2023-06-17 16:35:02,077 INFO [train.py:1064] (3/4) Training started 2023-06-17 16:35:02,077 INFO [train.py:1074] (3/4) Device: cuda:3 2023-06-17 16:35:03,769 INFO [lexicon.py:168] (3/4) Loading pre-compiled data/lang_char/Linv.pt 2023-06-17 16:35:03,963 INFO [train.py:1085] (3/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.1', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'c51a0b9684442a88ee37f3ce0af686a04b66855b', 'k2-git-date': 'Mon May 1 21:38:03 2023', 'lhotse-version': '1.14.0.dev+git.0f812851.dirty', 'torch-version': '1.10.0+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'zipformer_wenetspeech', 'icefall-git-sha1': '802bf98-dirty', 'icefall-git-date': 'Fri Jun 16 18:26:55 2023', 'icefall-path': '/star-kw/kangwei/code/icefall_wenetspeech', 'k2-path': '/ceph-hw/kangwei/code/k2_release/k2/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-hw/kangwei/dev_tools/anaconda3/envs/rnnt2/lib/python3.8/site-packages/lhotse-1.14.0.dev0+git.0f812851.dirty-py3.8.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-3-0423201227-84b4557756-8lx4n', 'IP address': '10.177.6.147'}, 'world_size': 4, 'master_port': 12537, 'tensorboard': True, 'num_epochs': 12, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp_L_small_causal'), 'lang_dir': PosixPath('data/lang_char'), 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 1.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 900, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 8, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'L', 'blank_id': 0, 'vocab_size': 5537} 2023-06-17 16:35:03,964 INFO [train.py:1087] (3/4) About to create model 2023-06-17 16:35:04,433 INFO [train.py:1091] (3/4) Number of model parameters: 32669302 2023-06-17 16:35:08,650 INFO [train.py:1106] (3/4) Using DDP 2023-06-17 16:35:10,581 INFO [asr_datamodule.py:390] (3/4) About to get train cuts 2023-06-17 16:35:10,598 INFO [asr_datamodule.py:398] (3/4) About to get dev cuts 2023-06-17 16:35:10,600 INFO [asr_datamodule.py:211] (3/4) About to get Musan cuts 2023-06-17 16:35:12,861 INFO [asr_datamodule.py:216] (3/4) Enable MUSAN 2023-06-17 16:35:12,862 INFO [asr_datamodule.py:239] (3/4) Enable SpecAugment 2023-06-17 16:35:12,862 INFO [asr_datamodule.py:240] (3/4) Time warp factor: 80 2023-06-17 16:35:12,862 INFO [asr_datamodule.py:250] (3/4) Num frame mask: 10 2023-06-17 16:35:12,862 INFO [asr_datamodule.py:263] (3/4) About to create train dataset 2023-06-17 16:35:12,862 INFO [asr_datamodule.py:289] (3/4) Using DynamicBucketingSampler. 2023-06-17 16:35:16,255 INFO [asr_datamodule.py:305] (3/4) About to create train dataloader 2023-06-17 16:35:16,256 INFO [asr_datamodule.py:336] (3/4) About to create dev dataset 2023-06-17 16:35:16,791 INFO [asr_datamodule.py:354] (3/4) About to create dev dataloader 2023-06-17 16:37:08,029 INFO [train.py:996] (3/4) Epoch 1, batch 0, loss[loss=10.66, simple_loss=9.679, pruned_loss=9.8, over 21848.00 frames. ], tot_loss[loss=10.66, simple_loss=9.679, pruned_loss=9.8, over 21848.00 frames. ], batch size: 98, lr: 2.25e-02, grad_scale: 1.0 2023-06-17 16:37:08,030 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-17 16:37:25,621 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=10.9, simple_loss=9.897, pruned_loss=10.04, over 1796401.00 frames. 2023-06-17 16:37:25,622 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23047MB 2023-06-17 16:37:31,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=0.0, ans=0.5 2023-06-17 16:37:33,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=0.0, ans=7.5 2023-06-17 16:37:38,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=0.0, ans=0.5 2023-06-17 16:37:54,438 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=146.64 vs. limit=7.545 2023-06-17 16:37:59,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=120.0, ans=0.049625 2023-06-17 16:38:01,075 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=254.54 vs. limit=7.59 2023-06-17 16:38:01,144 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=196.09 vs. limit=7.545 2023-06-17 16:38:02,902 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=132.75 vs. limit=5.03 2023-06-17 16:38:07,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=120.0, ans=0.494375 2023-06-17 16:38:13,111 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=255.82 vs. limit=7.59 2023-06-17 16:38:13,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=43.75 vs. limit=4.048 2023-06-17 16:39:11,362 INFO [train.py:996] (3/4) Epoch 1, batch 50, loss[loss=1.39, simple_loss=1.236, pruned_loss=1.379, over 21479.00 frames. ], tot_loss[loss=4.201, simple_loss=3.879, pruned_loss=3.169, over 961062.51 frames. ], batch size: 211, lr: 2.48e-02, grad_scale: 0.5 2023-06-17 16:39:22,606 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=207.19 vs. limit=7.6125 2023-06-17 16:39:22,800 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=38.09 vs. limit=4.12 2023-06-17 16:39:29,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=360.0, ans=0.2964 2023-06-17 16:39:32,320 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=170.42 vs. limit=7.77 2023-06-17 16:39:40,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=360.0, ans=0.483125 2023-06-17 16:39:42,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=360.0, ans=5.09 2023-06-17 16:39:44,334 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=12.51 vs. limit=3.054 2023-06-17 16:40:43,297 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=77.25 vs. limit=7.7025 2023-06-17 16:40:52,300 INFO [train.py:996] (3/4) Epoch 1, batch 100, loss[loss=1.401, simple_loss=1.215, pruned_loss=1.489, over 21759.00 frames. ], tot_loss[loss=2.635, simple_loss=2.401, pruned_loss=2.179, over 1693739.24 frames. ], batch size: 351, lr: 2.70e-02, grad_scale: 1.0 2023-06-17 16:40:56,007 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.080e+02 2.341e+02 3.851e+02 6.975e+03 2.847e+04, threshold=7.702e+02, percent-clipped=0.0 2023-06-17 16:41:23,916 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=187.68 vs. limit=7.7475 2023-06-17 16:41:23,937 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.77 vs. limit=3.099 2023-06-17 16:41:24,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=660.0, ans=0.17525000000000002 2023-06-17 16:41:28,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=720.0, ans=0.46625 2023-06-17 16:41:28,973 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=34.61 vs. limit=7.77 2023-06-17 16:41:30,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=720.0, ans=0.8748 2023-06-17 16:41:39,335 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=107.32 vs. limit=7.77 2023-06-17 16:41:40,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=720.0, ans=0.46625 2023-06-17 16:42:14,531 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=43.94 vs. limit=7.7925 2023-06-17 16:42:22,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=840.0, ans=5.525 2023-06-17 16:42:26,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=840.0, ans=0.395 2023-06-17 16:42:32,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=840.0, ans=0.460625 2023-06-17 16:42:33,020 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.18 vs. limit=8.13 2023-06-17 16:42:37,157 INFO [train.py:996] (3/4) Epoch 1, batch 150, loss[loss=1.189, simple_loss=1.011, pruned_loss=1.285, over 21610.00 frames. ], tot_loss[loss=2.021, simple_loss=1.816, pruned_loss=1.789, over 2270275.71 frames. ], batch size: 230, lr: 2.93e-02, grad_scale: 1.0 2023-06-17 16:42:46,622 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.64 vs. limit=8.175 2023-06-17 16:43:07,836 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=30.94 vs. limit=7.86 2023-06-17 16:43:16,048 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 16:43:41,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1080.0, ans=0.44937499999999997 2023-06-17 16:43:51,606 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=81.52 vs. limit=7.905 2023-06-17 16:44:01,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1080.0, ans=0.44937499999999997 2023-06-17 16:44:04,789 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 16:44:20,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1140.0, ans=0.15725 2023-06-17 16:44:26,068 INFO [train.py:996] (3/4) Epoch 1, batch 200, loss[loss=0.8868, simple_loss=0.7575, pruned_loss=0.879, over 15594.00 frames. ], tot_loss[loss=1.694, simple_loss=1.507, pruned_loss=1.543, over 2698189.69 frames. ], batch size: 60, lr: 3.15e-02, grad_scale: 2.0 2023-06-17 16:44:29,946 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 7.013e+01 1.220e+02 1.520e+02 2.087e+02 3.052e+02, threshold=3.040e+02, percent-clipped=0.0 2023-06-17 16:44:34,406 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.53 vs. limit=8.4 2023-06-17 16:44:40,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.61 vs. limit=5.6 2023-06-17 16:45:05,471 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.31 vs. limit=7.995 2023-06-17 16:45:10,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.91 vs. limit=8.49 2023-06-17 16:45:36,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1380.0, ans=0.4353125 2023-06-17 16:45:52,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1380.0, ans=0.3275 2023-06-17 16:45:59,768 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=39.33 vs. limit=8.04 2023-06-17 16:46:08,743 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.08 vs. limit=8.58 2023-06-17 16:46:16,414 INFO [train.py:996] (3/4) Epoch 1, batch 250, loss[loss=1.009, simple_loss=0.8607, pruned_loss=0.9498, over 21609.00 frames. ], tot_loss[loss=1.485, simple_loss=1.311, pruned_loss=1.364, over 3041837.37 frames. ], batch size: 263, lr: 3.38e-02, grad_scale: 2.0 2023-06-17 16:46:27,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1500.0, ans=0.4296875 2023-06-17 16:46:37,135 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=40.06 vs. limit=8.085 2023-06-17 16:46:38,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1560.0, ans=0.426875 2023-06-17 16:46:54,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1620.0, ans=0.2838 2023-06-17 16:46:57,164 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=28.33 vs. limit=8.1075 2023-06-17 16:47:22,790 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.50 vs. limit=8.76 2023-06-17 16:47:28,107 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.76 vs. limit=8.76 2023-06-17 16:47:40,654 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.76 vs. limit=8.13 2023-06-17 16:47:43,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1740.0, ans=0.4184375 2023-06-17 16:47:47,493 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=1.951e-02 2023-06-17 16:48:03,327 INFO [train.py:996] (3/4) Epoch 1, batch 300, loss[loss=0.8074, simple_loss=0.6839, pruned_loss=0.7431, over 21723.00 frames. ], tot_loss[loss=1.334, simple_loss=1.17, pruned_loss=1.225, over 3313218.17 frames. ], batch size: 124, lr: 3.60e-02, grad_scale: 4.0 2023-06-17 16:48:04,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.05 vs. limit=8.175 2023-06-17 16:48:07,102 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 9.171e+01 1.173e+02 1.354e+02 1.820e+02 4.361e+02, threshold=2.708e+02, percent-clipped=2.0 2023-06-17 16:48:22,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.89 vs. limit=5.93 2023-06-17 16:48:56,687 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=8.94 2023-06-17 16:48:57,002 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=18.26 vs. limit=5.0 2023-06-17 16:49:05,343 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=17.20 vs. limit=8.22 2023-06-17 16:49:10,645 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=8.2425 2023-06-17 16:49:48,612 INFO [train.py:996] (3/4) Epoch 1, batch 350, loss[loss=0.8705, simple_loss=0.7376, pruned_loss=0.7661, over 21812.00 frames. ], tot_loss[loss=1.22, simple_loss=1.063, pruned_loss=1.115, over 3527100.02 frames. ], batch size: 352, lr: 3.83e-02, grad_scale: 4.0 2023-06-17 16:49:48,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2100.0, ans=0.05275 2023-06-17 16:49:50,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2100.0, ans=0.27899999999999997 2023-06-17 16:50:15,998 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=9.120000000000001 2023-06-17 16:50:19,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2160.0, ans=0.39875 2023-06-17 16:50:25,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2220.0, ans=0.050050000000000004 2023-06-17 16:50:42,269 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.36 vs. limit=9.165 2023-06-17 16:50:56,038 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=4.912 2023-06-17 16:51:19,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=2340.0, ans=6.4625 2023-06-17 16:51:26,083 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=22.33 vs. limit=8.3775 2023-06-17 16:51:27,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=2340.0, ans=0.0426875 2023-06-17 16:51:36,710 INFO [train.py:996] (3/4) Epoch 1, batch 400, loss[loss=0.8498, simple_loss=0.7142, pruned_loss=0.7387, over 21428.00 frames. ], tot_loss[loss=1.135, simple_loss=0.9823, pruned_loss=1.029, over 3691888.38 frames. ], batch size: 389, lr: 4.05e-02, grad_scale: 8.0 2023-06-17 16:51:40,345 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 8.615e+01 1.452e+02 1.814e+02 2.451e+02 4.544e+02, threshold=3.628e+02, percent-clipped=11.0 2023-06-17 16:51:45,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2400.0, ans=0.3875 2023-06-17 16:51:48,275 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.14 vs. limit=8.4 2023-06-17 16:51:57,206 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.12 vs. limit=6.23 2023-06-17 16:52:22,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2520.0, ans=0.185 2023-06-17 16:52:24,986 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.06 vs. limit=8.445 2023-06-17 16:52:25,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2520.0, ans=0.381875 2023-06-17 16:52:38,924 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=3.387 2023-06-17 16:52:51,028 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.42 vs. limit=8.4675 2023-06-17 16:53:03,579 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=8.4675 2023-06-17 16:53:13,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2640.0, ans=0.2736 2023-06-17 16:53:26,150 INFO [train.py:996] (3/4) Epoch 1, batch 450, loss[loss=1.216, simple_loss=1.017, pruned_loss=1.035, over 21757.00 frames. ], tot_loss[loss=1.075, simple_loss=0.9239, pruned_loss=0.964, over 3821739.15 frames. ], batch size: 351, lr: 4.28e-02, grad_scale: 8.0 2023-06-17 16:53:28,774 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=17.34 vs. limit=8.5125 2023-06-17 16:53:29,041 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.91 vs. limit=8.5125 2023-06-17 16:53:32,268 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=5.08 2023-06-17 16:53:39,773 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.48 vs. limit=9.525 2023-06-17 16:53:47,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2760.0, ans=0.2724 2023-06-17 16:54:11,214 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=8.557500000000001 2023-06-17 16:54:12,848 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=45.52 vs. limit=8.557500000000001 2023-06-17 16:54:21,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2820.0, ans=0.14750000000000002 2023-06-17 16:54:44,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2880.0, ans=0.365 2023-06-17 16:54:47,018 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=9.66 2023-06-17 16:55:04,001 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=8.6025 2023-06-17 16:55:06,062 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=8.6025 2023-06-17 16:55:15,294 INFO [train.py:996] (3/4) Epoch 1, batch 500, loss[loss=0.773, simple_loss=0.6499, pruned_loss=0.6285, over 21849.00 frames. ], tot_loss[loss=1.041, simple_loss=0.8898, pruned_loss=0.9199, over 3920434.13 frames. ], batch size: 107, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 16:55:19,006 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 9.969e+01 1.768e+02 2.484e+02 3.323e+02 7.392e+02, threshold=4.968e+02, percent-clipped=16.0 2023-06-17 16:55:47,981 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.64 vs. limit=6.53 2023-06-17 16:55:48,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3060.0, ans=0.2694 2023-06-17 16:55:48,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=3060.0, ans=0.3565625 2023-06-17 16:56:14,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=3120.0, ans=0.35375 2023-06-17 16:56:36,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=3180.0, ans=0.7887000000000001 2023-06-17 16:56:39,409 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=1.270e+01 2023-06-17 16:56:46,829 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.67 vs. limit=5.8100000000000005 2023-06-17 16:57:02,932 INFO [train.py:996] (3/4) Epoch 1, batch 550, loss[loss=0.8786, simple_loss=0.7389, pruned_loss=0.6938, over 21887.00 frames. ], tot_loss[loss=1.014, simple_loss=0.8636, pruned_loss=0.8779, over 4006739.15 frames. ], batch size: 332, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 16:57:20,739 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.17 vs. limit=6.68 2023-06-17 16:57:36,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=8.76 2023-06-17 16:57:59,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3420.0, ans=0.2658 2023-06-17 16:57:59,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3420.0, ans=0.2658 2023-06-17 16:58:15,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=3480.0, ans=0.174 2023-06-17 16:58:17,453 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=8.805 2023-06-17 16:58:50,973 INFO [train.py:996] (3/4) Epoch 1, batch 600, loss[loss=1.104, simple_loss=0.9389, pruned_loss=0.8271, over 21215.00 frames. ], tot_loss[loss=0.9818, simple_loss=0.8353, pruned_loss=0.8289, over 4068106.12 frames. ], batch size: 548, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 16:58:51,946 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=10.2 2023-06-17 16:58:54,330 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 2.961e+02 3.893e+02 6.488e+02 1.570e+03, threshold=7.787e+02, percent-clipped=36.0 2023-06-17 16:59:01,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=3600.0, ans=0.04949747468305833 2023-06-17 16:59:06,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=3660.0, ans=7.2875 2023-06-17 16:59:35,569 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=5.4879999999999995 2023-06-17 16:59:54,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.34 vs. limit=10.29 2023-06-17 17:00:07,103 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.12 vs. limit=5.945 2023-06-17 17:00:10,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=3780.0, ans=0.058249999999999996 2023-06-17 17:00:27,567 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.89 vs. limit=5.96 2023-06-17 17:00:30,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=3840.0, ans=0.055999999999999994 2023-06-17 17:00:36,744 INFO [train.py:996] (3/4) Epoch 1, batch 650, loss[loss=0.7553, simple_loss=0.6458, pruned_loss=0.5463, over 21704.00 frames. ], tot_loss[loss=0.9471, simple_loss=0.8063, pruned_loss=0.7785, over 4121635.07 frames. ], batch size: 298, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:01:48,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=4080.0, ans=0.7572 2023-06-17 17:01:58,776 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.22 vs. limit=6.02 2023-06-17 17:02:00,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=4080.0, ans=0.2592 2023-06-17 17:02:03,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=4140.0, ans=0.2586 2023-06-17 17:02:09,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=4140.0, ans=0.7551 2023-06-17 17:02:21,257 INFO [train.py:996] (3/4) Epoch 1, batch 700, loss[loss=0.7592, simple_loss=0.6412, pruned_loss=0.5545, over 21371.00 frames. ], tot_loss[loss=0.9078, simple_loss=0.7738, pruned_loss=0.7267, over 4162392.43 frames. ], batch size: 471, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:02:24,613 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.523e+02 4.078e+02 5.855e+02 9.456e+02 2.667e+03, threshold=1.171e+03, percent-clipped=39.0 2023-06-17 17:02:27,535 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=9.075 2023-06-17 17:02:28,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=4200.0, ans=0.303125 2023-06-17 17:02:30,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=4200.0, ans=0.303125 2023-06-17 17:02:40,864 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.05 vs. limit=6.0649999999999995 2023-06-17 17:02:51,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=4260.0, ans=10.695 2023-06-17 17:03:29,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=4320.0, ans=0.07 2023-06-17 17:03:39,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=4380.0, ans=0.2562 2023-06-17 17:03:50,090 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=9.165 2023-06-17 17:04:05,960 INFO [train.py:996] (3/4) Epoch 1, batch 750, loss[loss=0.6729, simple_loss=0.5714, pruned_loss=0.4768, over 15325.00 frames. ], tot_loss[loss=0.8725, simple_loss=0.7452, pruned_loss=0.6799, over 4180662.06 frames. ], batch size: 63, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:04:08,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=4500.0, ans=0.07 2023-06-17 17:04:48,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.77 vs. limit=9.21 2023-06-17 17:05:38,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=4740.0, ans=0.2778125 2023-06-17 17:05:39,257 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=9.2775 2023-06-17 17:05:51,447 INFO [train.py:996] (3/4) Epoch 1, batch 800, loss[loss=0.6306, simple_loss=0.5522, pruned_loss=0.4104, over 21755.00 frames. ], tot_loss[loss=0.8326, simple_loss=0.7132, pruned_loss=0.6318, over 4200026.64 frames. ], batch size: 112, lr: 4.49e-02, grad_scale: 16.0 2023-06-17 17:05:53,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=4800.0, ans=0.275 2023-06-17 17:05:54,752 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 4.402e+02 7.390e+02 1.255e+03 3.583e+03, threshold=1.478e+03, percent-clipped=27.0 2023-06-17 17:05:57,398 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=9.3 2023-06-17 17:06:32,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=4860.0, ans=0.00981304347826087 2023-06-17 17:07:05,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=4980.0, ans=0.26656250000000004 2023-06-17 17:07:35,938 INFO [train.py:996] (3/4) Epoch 1, batch 850, loss[loss=0.7174, simple_loss=0.6084, pruned_loss=0.4937, over 21647.00 frames. ], tot_loss[loss=0.7947, simple_loss=0.6826, pruned_loss=0.5878, over 4219690.15 frames. ], batch size: 508, lr: 4.49e-02, grad_scale: 16.0 2023-06-17 17:07:50,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=5100.0, ans=0.2609375 2023-06-17 17:08:50,054 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.14 vs. limit=7.640000000000001 2023-06-17 17:09:19,814 INFO [train.py:996] (3/4) Epoch 1, batch 900, loss[loss=0.6409, simple_loss=0.5561, pruned_loss=0.4155, over 21752.00 frames. ], tot_loss[loss=0.7612, simple_loss=0.6566, pruned_loss=0.5481, over 4238365.75 frames. ], batch size: 247, lr: 4.48e-02, grad_scale: 16.0 2023-06-17 17:09:23,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 4.491e+02 8.246e+02 1.178e+03 2.944e+03, threshold=1.649e+03, percent-clipped=18.0 2023-06-17 17:09:39,038 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=11.55 2023-06-17 17:11:05,163 INFO [train.py:996] (3/4) Epoch 1, batch 950, loss[loss=0.6135, simple_loss=0.5424, pruned_loss=0.3791, over 21290.00 frames. ], tot_loss[loss=0.7302, simple_loss=0.6325, pruned_loss=0.5122, over 4256711.95 frames. ], batch size: 176, lr: 4.48e-02, grad_scale: 16.0 2023-06-17 17:11:24,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=9.6375 2023-06-17 17:11:29,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=5760.0, ans=0.04266666666666667 2023-06-17 17:11:58,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=5820.0, ans=0.2418 2023-06-17 17:12:16,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=5880.0, ans=0.224375 2023-06-17 17:12:35,561 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=9.7275 2023-06-17 17:12:38,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=5940.0, ans=0.2215625 2023-06-17 17:12:44,943 INFO [train.py:996] (3/4) Epoch 1, batch 1000, loss[loss=0.576, simple_loss=0.5005, pruned_loss=0.3647, over 21464.00 frames. ], tot_loss[loss=0.7043, simple_loss=0.6122, pruned_loss=0.4825, over 4270305.19 frames. ], batch size: 212, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:12:50,046 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 4.573e+02 9.444e+02 1.523e+03 4.461e+03, threshold=1.889e+03, percent-clipped=19.0 2023-06-17 17:13:45,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=6120.0, ans=0.04116666666666667 2023-06-17 17:14:15,982 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=12.18 2023-06-17 17:14:22,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=6240.0, ans=0.23759999999999998 2023-06-17 17:14:23,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=6240.0, ans=0.20750000000000002 2023-06-17 17:14:30,016 INFO [train.py:996] (3/4) Epoch 1, batch 1050, loss[loss=0.5568, simple_loss=0.5046, pruned_loss=0.3226, over 21265.00 frames. ], tot_loss[loss=0.6821, simple_loss=0.5956, pruned_loss=0.4565, over 4274115.67 frames. ], batch size: 159, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:14:30,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=6300.0, ans=0.6795 2023-06-17 17:15:14,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=6360.0, ans=0.030125000000000002 2023-06-17 17:16:19,593 INFO [train.py:996] (3/4) Epoch 1, batch 1100, loss[loss=0.5317, simple_loss=0.4859, pruned_loss=0.3015, over 21871.00 frames. ], tot_loss[loss=0.666, simple_loss=0.5849, pruned_loss=0.4346, over 4286646.14 frames. ], batch size: 118, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:16:24,913 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.653e+02 4.618e+02 6.760e+02 9.652e+02 3.048e+03, threshold=1.352e+03, percent-clipped=4.0 2023-06-17 17:16:34,502 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.77 vs. limit=12.45 2023-06-17 17:16:43,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=6600.0, ans=0.29900000000000004 2023-06-17 17:16:56,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=6660.0, ans=0.1878125 2023-06-17 17:17:13,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=6720.0, ans=0.6648000000000001 2023-06-17 17:17:18,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=6720.0, ans=0.185 2023-06-17 17:17:55,131 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.55 vs. limit=10.065 2023-06-17 17:18:16,846 INFO [train.py:996] (3/4) Epoch 1, batch 1150, loss[loss=0.5986, simple_loss=0.5482, pruned_loss=0.3368, over 21569.00 frames. ], tot_loss[loss=0.6515, simple_loss=0.5744, pruned_loss=0.4165, over 4288717.40 frames. ], batch size: 230, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:18:33,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=6900.0, ans=0.1765625 2023-06-17 17:18:33,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=6900.0, ans=0.04949747468305833 2023-06-17 17:19:03,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=7020.0, ans=0.17093750000000002 2023-06-17 17:20:04,095 INFO [train.py:996] (3/4) Epoch 1, batch 1200, loss[loss=0.5309, simple_loss=0.4609, pruned_loss=0.3253, over 20269.00 frames. ], tot_loss[loss=0.6412, simple_loss=0.5678, pruned_loss=0.4019, over 4287082.28 frames. ], batch size: 703, lr: 4.47e-02, grad_scale: 16.0 2023-06-17 17:20:09,128 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.453e+02 4.949e+02 7.827e+02 1.470e+03 3.073e+03, threshold=1.565e+03, percent-clipped=26.0 2023-06-17 17:20:09,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=7200.0, ans=0.16249999999999998 2023-06-17 17:20:17,051 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=6.88 2023-06-17 17:20:46,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=7320.0, ans=0.156875 2023-06-17 17:21:08,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=10.2675 2023-06-17 17:21:44,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=7440.0, ans=0.15125 2023-06-17 17:21:46,237 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=10.29 2023-06-17 17:21:50,134 INFO [train.py:996] (3/4) Epoch 1, batch 1250, loss[loss=0.7731, simple_loss=0.6719, pruned_loss=0.4697, over 21532.00 frames. ], tot_loss[loss=0.6359, simple_loss=0.5646, pruned_loss=0.3921, over 4292748.11 frames. ], batch size: 509, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:22:49,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=7680.0, ans=0.0092 2023-06-17 17:23:10,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=7680.0, ans=0.14 2023-06-17 17:23:12,596 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=10.4025 2023-06-17 17:23:34,681 INFO [train.py:996] (3/4) Epoch 1, batch 1300, loss[loss=0.5049, simple_loss=0.4648, pruned_loss=0.279, over 21405.00 frames. ], tot_loss[loss=0.6216, simple_loss=0.5551, pruned_loss=0.3763, over 4287857.57 frames. ], batch size: 131, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:23:46,456 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.014e+02 6.355e+02 9.383e+02 1.437e+03 4.251e+03, threshold=1.877e+03, percent-clipped=19.0 2023-06-17 17:24:09,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=7860.0, ans=0.13156250000000003 2023-06-17 17:24:26,507 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.25 vs. limit=13.440000000000001 2023-06-17 17:25:18,167 INFO [train.py:996] (3/4) Epoch 1, batch 1350, loss[loss=0.5412, simple_loss=0.4996, pruned_loss=0.2972, over 21229.00 frames. ], tot_loss[loss=0.6165, simple_loss=0.5521, pruned_loss=0.3681, over 4293083.69 frames. ], batch size: 176, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:25:39,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=8160.0, ans=0.125 2023-06-17 17:26:13,361 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 17:27:04,242 INFO [train.py:996] (3/4) Epoch 1, batch 1400, loss[loss=0.6739, simple_loss=0.5991, pruned_loss=0.3897, over 21705.00 frames. ], tot_loss[loss=0.603, simple_loss=0.5418, pruned_loss=0.3555, over 4286308.27 frames. ], batch size: 441, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:27:16,081 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 4.966e+02 8.167e+02 1.163e+03 2.690e+03, threshold=1.633e+03, percent-clipped=5.0 2023-06-17 17:27:36,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=8460.0, ans=0.125 2023-06-17 17:27:41,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=8520.0, ans=0.125 2023-06-17 17:27:55,950 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=10.695 2023-06-17 17:27:57,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=8580.0, ans=0.025 2023-06-17 17:28:04,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=8580.0, ans=0.07 2023-06-17 17:28:14,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=8580.0, ans=0.125 2023-06-17 17:28:39,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=8640.0, ans=0.008991304347826088 2023-06-17 17:28:47,813 INFO [train.py:996] (3/4) Epoch 1, batch 1450, loss[loss=0.579, simple_loss=0.5341, pruned_loss=0.3169, over 21246.00 frames. ], tot_loss[loss=0.5978, simple_loss=0.5381, pruned_loss=0.3489, over 4293180.22 frames. ], batch size: 143, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:29:11,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=8760.0, ans=0.125 2023-06-17 17:29:22,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=8760.0, ans=0.125 2023-06-17 17:29:25,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=8820.0, ans=0.125 2023-06-17 17:30:17,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=8940.0, ans=0.5871 2023-06-17 17:30:22,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=8940.0, ans=0.5871 2023-06-17 17:30:31,052 INFO [train.py:996] (3/4) Epoch 1, batch 1500, loss[loss=0.5598, simple_loss=0.5071, pruned_loss=0.3134, over 21480.00 frames. ], tot_loss[loss=0.5909, simple_loss=0.5338, pruned_loss=0.3409, over 4294663.93 frames. ], batch size: 131, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:30:42,867 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.853e+02 4.683e+02 9.412e+02 1.321e+03 2.952e+03, threshold=1.882e+03, percent-clipped=11.0 2023-06-17 17:31:45,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=9180.0, ans=0.125 2023-06-17 17:32:08,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=9240.0, ans=0.125 2023-06-17 17:32:19,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=9240.0, ans=0.125 2023-06-17 17:32:22,521 INFO [train.py:996] (3/4) Epoch 1, batch 1550, loss[loss=0.5872, simple_loss=0.5359, pruned_loss=0.3248, over 21541.00 frames. ], tot_loss[loss=0.5786, simple_loss=0.5261, pruned_loss=0.3293, over 4294372.59 frames. ], batch size: 414, lr: 4.45e-02, grad_scale: 8.0 2023-06-17 17:32:38,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=9360.0, ans=0.5724 2023-06-17 17:32:46,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=9360.0, ans=0.02766666666666667 2023-06-17 17:33:11,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=9420.0, ans=0.025 2023-06-17 17:33:12,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=9420.0, ans=0.125 2023-06-17 17:33:46,119 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=11.0775 2023-06-17 17:33:57,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=9540.0, ans=0.125 2023-06-17 17:34:05,210 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.76 vs. limit=7.385 2023-06-17 17:34:09,485 INFO [train.py:996] (3/4) Epoch 1, batch 1600, loss[loss=0.442, simple_loss=0.4291, pruned_loss=0.2251, over 21266.00 frames. ], tot_loss[loss=0.5694, simple_loss=0.5203, pruned_loss=0.3204, over 4288703.39 frames. ], batch size: 176, lr: 4.45e-02, grad_scale: 16.0 2023-06-17 17:34:15,869 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.713e+02 5.768e+02 7.778e+02 1.283e+03 4.290e+03, threshold=1.556e+03, percent-clipped=12.0 2023-06-17 17:34:16,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=9600.0, ans=0.02666666666666667 2023-06-17 17:34:32,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=9660.0, ans=0.125 2023-06-17 17:34:59,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=9720.0, ans=0.026166666666666668 2023-06-17 17:35:20,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=9780.0, ans=0.125 2023-06-17 17:35:49,409 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.95 vs. limit=4.476 2023-06-17 17:35:53,557 INFO [train.py:996] (3/4) Epoch 1, batch 1650, loss[loss=0.6081, simple_loss=0.5635, pruned_loss=0.3288, over 21698.00 frames. ], tot_loss[loss=0.5632, simple_loss=0.5173, pruned_loss=0.3135, over 4282994.27 frames. ], batch size: 414, lr: 4.45e-02, grad_scale: 8.0 2023-06-17 17:35:55,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=9900.0, ans=0.201 2023-06-17 17:36:54,033 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=11.28 2023-06-17 17:37:01,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=10080.0, ans=0.1992 2023-06-17 17:37:33,250 INFO [train.py:996] (3/4) Epoch 1, batch 1700, loss[loss=0.5021, simple_loss=0.4749, pruned_loss=0.2648, over 21063.00 frames. ], tot_loss[loss=0.5625, simple_loss=0.5185, pruned_loss=0.3105, over 4282614.43 frames. ], batch size: 608, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 17:37:41,957 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.683e+02 4.849e+02 8.667e+02 1.230e+03 2.717e+03, threshold=1.733e+03, percent-clipped=16.0 2023-06-17 17:38:49,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=10380.0, ans=0.125 2023-06-17 17:39:04,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=10440.0, ans=0.0 2023-06-17 17:39:19,718 INFO [train.py:996] (3/4) Epoch 1, batch 1750, loss[loss=0.5413, simple_loss=0.5129, pruned_loss=0.2849, over 21458.00 frames. ], tot_loss[loss=0.5521, simple_loss=0.5138, pruned_loss=0.3006, over 4275231.89 frames. ], batch size: 548, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 17:39:29,567 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=11.4375 2023-06-17 17:39:45,735 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.13 vs. limit=15.375 2023-06-17 17:40:20,779 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.21 vs. limit=15.465 2023-06-17 17:40:26,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=10680.0, ans=0.008547826086956522 2023-06-17 17:40:29,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=10680.0, ans=0.125 2023-06-17 17:40:43,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=10680.0, ans=0.125 2023-06-17 17:41:18,134 INFO [train.py:996] (3/4) Epoch 1, batch 1800, loss[loss=0.4678, simple_loss=0.4433, pruned_loss=0.2461, over 21541.00 frames. ], tot_loss[loss=0.5349, simple_loss=0.5019, pruned_loss=0.2879, over 4278252.40 frames. ], batch size: 263, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 17:41:26,430 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.462e+02 4.676e+02 7.112e+02 1.184e+03 2.740e+03, threshold=1.422e+03, percent-clipped=6.0 2023-06-17 17:41:42,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=10860.0, ans=0.1914 2023-06-17 17:41:49,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=10860.0, ans=0.125 2023-06-17 17:42:02,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=10920.0, ans=0.008495652173913043 2023-06-17 17:42:04,651 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=8.368 2023-06-17 17:42:05,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=10920.0, ans=0.125 2023-06-17 17:42:25,433 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=11.6175 2023-06-17 17:42:36,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11040.0, ans=0.1896 2023-06-17 17:43:02,028 INFO [train.py:996] (3/4) Epoch 1, batch 1850, loss[loss=0.5662, simple_loss=0.5157, pruned_loss=0.3099, over 21662.00 frames. ], tot_loss[loss=0.5262, simple_loss=0.4979, pruned_loss=0.28, over 4271918.25 frames. ], batch size: 263, lr: 4.43e-02, grad_scale: 8.0 2023-06-17 17:43:04,532 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.87 vs. limit=10.55 2023-06-17 17:43:56,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=11280.0, ans=0.125 2023-06-17 17:44:37,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=11340.0, ans=0.008404347826086957 2023-06-17 17:44:39,729 INFO [train.py:996] (3/4) Epoch 1, batch 1900, loss[loss=0.5515, simple_loss=0.5051, pruned_loss=0.2998, over 21730.00 frames. ], tot_loss[loss=0.5188, simple_loss=0.4925, pruned_loss=0.2746, over 4274568.75 frames. ], batch size: 473, lr: 4.43e-02, grad_scale: 8.0 2023-06-17 17:44:44,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=11400.0, ans=0.125 2023-06-17 17:44:47,637 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.112e+02 4.687e+02 6.940e+02 1.118e+03 3.518e+03, threshold=1.388e+03, percent-clipped=15.0 2023-06-17 17:44:52,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=11400.0, ans=0.008391304347826088 2023-06-17 17:44:53,341 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=11.775 2023-06-17 17:45:02,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=11460.0, ans=0.125 2023-06-17 17:45:35,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=11580.0, ans=0.37370000000000003 2023-06-17 17:45:40,074 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=11.842500000000001 2023-06-17 17:46:22,592 INFO [train.py:996] (3/4) Epoch 1, batch 1950, loss[loss=0.6288, simple_loss=0.574, pruned_loss=0.3423, over 21765.00 frames. ], tot_loss[loss=0.5105, simple_loss=0.485, pruned_loss=0.2696, over 4276743.65 frames. ], batch size: 441, lr: 4.43e-02, grad_scale: 4.0 2023-06-17 17:46:51,437 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.70 vs. limit=11.91 2023-06-17 17:47:21,080 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.56 vs. limit=16.41 2023-06-17 17:47:49,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=11940.0, ans=0.18059999999999998 2023-06-17 17:48:00,874 INFO [train.py:996] (3/4) Epoch 1, batch 2000, loss[loss=0.3814, simple_loss=0.4006, pruned_loss=0.1811, over 21799.00 frames. ], tot_loss[loss=0.4998, simple_loss=0.4778, pruned_loss=0.2621, over 4283342.23 frames. ], batch size: 282, lr: 4.42e-02, grad_scale: 8.0 2023-06-17 17:48:15,558 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.181e+02 5.337e+02 7.170e+02 1.145e+03 2.393e+03, threshold=1.434e+03, percent-clipped=15.0 2023-06-17 17:48:20,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=12060.0, ans=0.125 2023-06-17 17:48:25,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=12060.0, ans=10.0 2023-06-17 17:48:31,076 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=8.015 2023-06-17 17:48:46,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=12120.0, ans=0.125 2023-06-17 17:48:58,861 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.98 vs. limit=5.0 2023-06-17 17:49:37,866 INFO [train.py:996] (3/4) Epoch 1, batch 2050, loss[loss=0.4395, simple_loss=0.4564, pruned_loss=0.2113, over 21605.00 frames. ], tot_loss[loss=0.5003, simple_loss=0.4803, pruned_loss=0.261, over 4290937.23 frames. ], batch size: 263, lr: 4.42e-02, grad_scale: 8.0 2023-06-17 17:49:38,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=12300.0, ans=0.125 2023-06-17 17:50:15,180 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.25 vs. limit=11.21 2023-06-17 17:50:22,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=12420.0, ans=0.125 2023-06-17 17:50:30,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=12480.0, ans=0.1752 2023-06-17 17:50:30,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=12480.0, ans=0.125 2023-06-17 17:50:50,654 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=8.120000000000001 2023-06-17 17:51:00,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=12540.0, ans=0.014416666666666668 2023-06-17 17:51:08,916 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=12.2025 2023-06-17 17:51:15,787 INFO [train.py:996] (3/4) Epoch 1, batch 2100, loss[loss=0.4983, simple_loss=0.4792, pruned_loss=0.2587, over 21758.00 frames. ], tot_loss[loss=0.5055, simple_loss=0.4854, pruned_loss=0.2635, over 4279728.03 frames. ], batch size: 124, lr: 4.42e-02, grad_scale: 8.0 2023-06-17 17:51:29,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=12600.0, ans=0.014166666666666668 2023-06-17 17:51:31,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.912e+02 5.111e+02 7.540e+02 1.226e+03 2.396e+03, threshold=1.508e+03, percent-clipped=15.0 2023-06-17 17:51:54,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=12720.0, ans=0.125 2023-06-17 17:52:40,564 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=12.2925 2023-06-17 17:53:00,921 INFO [train.py:996] (3/4) Epoch 1, batch 2150, loss[loss=0.429, simple_loss=0.4144, pruned_loss=0.2218, over 21630.00 frames. ], tot_loss[loss=0.5026, simple_loss=0.4815, pruned_loss=0.2624, over 4285508.73 frames. ], batch size: 247, lr: 4.41e-02, grad_scale: 8.0 2023-06-17 17:53:13,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=12900.0, ans=0.008065217391304348 2023-06-17 17:53:33,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=12960.0, ans=0.05 2023-06-17 17:54:08,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=13080.0, ans=0.1692 2023-06-17 17:54:44,029 INFO [train.py:996] (3/4) Epoch 1, batch 2200, loss[loss=0.3562, simple_loss=0.386, pruned_loss=0.1632, over 21245.00 frames. ], tot_loss[loss=0.4992, simple_loss=0.4831, pruned_loss=0.2581, over 4286303.98 frames. ], batch size: 159, lr: 4.41e-02, grad_scale: 8.0 2023-06-17 17:54:44,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=13200.0, ans=0.125 2023-06-17 17:54:54,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=13200.0, ans=0.011666666666666672 2023-06-17 17:54:59,355 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 5.225e+02 6.882e+02 1.154e+03 2.681e+03, threshold=1.376e+03, percent-clipped=19.0 2023-06-17 17:55:09,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=13260.0, ans=0.125 2023-06-17 17:55:38,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=12.495000000000001 2023-06-17 17:56:13,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=13440.0, ans=0.125 2023-06-17 17:56:34,497 INFO [train.py:996] (3/4) Epoch 1, batch 2250, loss[loss=0.4122, simple_loss=0.3981, pruned_loss=0.2132, over 21841.00 frames. ], tot_loss[loss=0.4851, simple_loss=0.4733, pruned_loss=0.2488, over 4288714.99 frames. ], batch size: 98, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 17:56:45,148 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=2.575e-02 2023-06-17 17:56:48,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=13500.0, ans=0.125 2023-06-17 17:56:49,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=13560.0, ans=0.007921739130434782 2023-06-17 17:57:59,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=13740.0, ans=0.125 2023-06-17 17:57:59,711 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=12.6525 2023-06-17 17:58:02,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=13740.0, ans=0.125 2023-06-17 17:58:11,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.84 vs. limit=17.805 2023-06-17 17:58:19,042 INFO [train.py:996] (3/4) Epoch 1, batch 2300, loss[loss=0.4247, simple_loss=0.4188, pruned_loss=0.2153, over 21655.00 frames. ], tot_loss[loss=0.4768, simple_loss=0.4645, pruned_loss=0.2448, over 4276781.08 frames. ], batch size: 282, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 17:58:21,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=13800.0, ans=0.162 2023-06-17 17:58:29,084 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.055e+02 5.278e+02 8.077e+02 1.161e+03 3.244e+03, threshold=1.615e+03, percent-clipped=15.0 2023-06-17 17:59:04,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=13920.0, ans=0.4128 2023-06-17 17:59:12,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=13980.0, ans=0.125 2023-06-17 17:59:52,405 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.17 vs. limit=12.765 2023-06-17 18:00:03,804 INFO [train.py:996] (3/4) Epoch 1, batch 2350, loss[loss=0.4119, simple_loss=0.4006, pruned_loss=0.2115, over 21485.00 frames. ], tot_loss[loss=0.4758, simple_loss=0.4626, pruned_loss=0.2447, over 4284931.45 frames. ], batch size: 195, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 18:00:14,361 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.44 vs. limit=8.525 2023-06-17 18:01:20,369 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:01:38,178 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=12.877500000000001 2023-06-17 18:01:49,087 INFO [train.py:996] (3/4) Epoch 1, batch 2400, loss[loss=0.6116, simple_loss=0.5604, pruned_loss=0.3314, over 21513.00 frames. ], tot_loss[loss=0.4849, simple_loss=0.4708, pruned_loss=0.2497, over 4278902.07 frames. ], batch size: 414, lr: 4.39e-02, grad_scale: 16.0 2023-06-17 18:01:54,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=14400.0, ans=0.125 2023-06-17 18:01:59,363 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 4.626e+02 8.072e+02 1.275e+03 2.674e+03, threshold=1.614e+03, percent-clipped=13.0 2023-06-17 18:02:30,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=14520.0, ans=0.15480000000000002 2023-06-17 18:02:48,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=14520.0, ans=0.125 2023-06-17 18:03:02,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=14580.0, ans=0.125 2023-06-17 18:03:23,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=14640.0, ans=0.005666666666666667 2023-06-17 18:03:34,290 INFO [train.py:996] (3/4) Epoch 1, batch 2450, loss[loss=0.4453, simple_loss=0.4485, pruned_loss=0.221, over 21770.00 frames. ], tot_loss[loss=0.4887, simple_loss=0.4754, pruned_loss=0.2512, over 4275603.53 frames. ], batch size: 124, lr: 4.39e-02, grad_scale: 16.0 2023-06-17 18:03:44,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=14700.0, ans=0.007673913043478261 2023-06-17 18:03:45,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=14700.0, ans=0.125 2023-06-17 18:03:48,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=14760.0, ans=0.125 2023-06-17 18:04:06,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=14820.0, ans=0.125 2023-06-17 18:04:27,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=14820.0, ans=0.125 2023-06-17 18:05:16,174 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.08 vs. limit=18.75 2023-06-17 18:05:16,571 INFO [train.py:996] (3/4) Epoch 1, batch 2500, loss[loss=0.4056, simple_loss=0.4499, pruned_loss=0.1807, over 21336.00 frames. ], tot_loss[loss=0.4782, simple_loss=0.4677, pruned_loss=0.2445, over 4280128.31 frames. ], batch size: 176, lr: 4.38e-02, grad_scale: 8.0 2023-06-17 18:05:16,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=15000.0, ans=0.125 2023-06-17 18:05:27,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=15000.0, ans=0.007608695652173913 2023-06-17 18:05:28,401 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.577e+02 4.881e+02 6.609e+02 9.679e+02 1.963e+03, threshold=1.322e+03, percent-clipped=4.0 2023-06-17 18:06:33,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=15180.0, ans=13.192499999999999 2023-06-17 18:07:00,727 INFO [train.py:996] (3/4) Epoch 1, batch 2550, loss[loss=0.3953, simple_loss=0.3961, pruned_loss=0.1973, over 21669.00 frames. ], tot_loss[loss=0.4718, simple_loss=0.4644, pruned_loss=0.2396, over 4275223.69 frames. ], batch size: 283, lr: 4.38e-02, grad_scale: 8.0 2023-06-17 18:07:07,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=15300.0, ans=0.0029166666666666716 2023-06-17 18:07:19,731 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=13.26 2023-06-17 18:08:15,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=15540.0, ans=0.125 2023-06-17 18:08:18,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=15540.0, ans=0.35609999999999997 2023-06-17 18:08:38,594 INFO [train.py:996] (3/4) Epoch 1, batch 2600, loss[loss=0.4654, simple_loss=0.4578, pruned_loss=0.2365, over 21774.00 frames. ], tot_loss[loss=0.476, simple_loss=0.4676, pruned_loss=0.2423, over 4281414.12 frames. ], batch size: 247, lr: 4.37e-02, grad_scale: 8.0 2023-06-17 18:08:39,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=15600.0, ans=0.14400000000000002 2023-06-17 18:08:50,285 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.934e+02 4.629e+02 7.030e+02 1.078e+03 2.784e+03, threshold=1.406e+03, percent-clipped=16.0 2023-06-17 18:08:53,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=15660.0, ans=0.125 2023-06-17 18:10:07,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=15840.0, ans=0.05348 2023-06-17 18:10:24,600 INFO [train.py:996] (3/4) Epoch 1, batch 2650, loss[loss=0.4793, simple_loss=0.471, pruned_loss=0.2438, over 21965.00 frames. ], tot_loss[loss=0.4773, simple_loss=0.4684, pruned_loss=0.2432, over 4287861.16 frames. ], batch size: 316, lr: 4.37e-02, grad_scale: 8.0 2023-06-17 18:11:56,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=16140.0, ans=0.0 2023-06-17 18:12:07,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=16140.0, ans=0.1386 2023-06-17 18:12:10,379 INFO [train.py:996] (3/4) Epoch 1, batch 2700, loss[loss=0.3249, simple_loss=0.3375, pruned_loss=0.1561, over 21347.00 frames. ], tot_loss[loss=0.4684, simple_loss=0.4634, pruned_loss=0.2368, over 4281627.44 frames. ], batch size: 131, lr: 4.36e-02, grad_scale: 8.0 2023-06-17 18:12:18,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=16200.0, ans=0.07 2023-06-17 18:12:21,524 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.712e+02 4.286e+02 6.579e+02 1.091e+03 3.152e+03, threshold=1.316e+03, percent-clipped=14.0 2023-06-17 18:12:40,689 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=13.5975 2023-06-17 18:13:54,403 INFO [train.py:996] (3/4) Epoch 1, batch 2750, loss[loss=0.4034, simple_loss=0.4442, pruned_loss=0.1813, over 21450.00 frames. ], tot_loss[loss=0.4672, simple_loss=0.4626, pruned_loss=0.2359, over 4290356.53 frames. ], batch size: 194, lr: 4.36e-02, grad_scale: 4.0 2023-06-17 18:13:59,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=16500.0, ans=0.007282608695652174 2023-06-17 18:14:28,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=16560.0, ans=0.125 2023-06-17 18:14:38,225 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:15:32,975 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.01 vs. limit=9.184999999999999 2023-06-17 18:15:35,674 INFO [train.py:996] (3/4) Epoch 1, batch 2800, loss[loss=0.4145, simple_loss=0.4033, pruned_loss=0.2128, over 21295.00 frames. ], tot_loss[loss=0.4692, simple_loss=0.4656, pruned_loss=0.2364, over 4285901.46 frames. ], batch size: 551, lr: 4.36e-02, grad_scale: 8.0 2023-06-17 18:15:36,720 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.19 vs. limit=20.1 2023-06-17 18:15:40,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=16800.0, ans=0.04949747468305833 2023-06-17 18:15:52,020 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=27.29 vs. limit=20.1 2023-06-17 18:15:59,818 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.283e+02 4.894e+02 6.832e+02 1.003e+03 4.773e+03, threshold=1.366e+03, percent-clipped=15.0 2023-06-17 18:16:01,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=16860.0, ans=0.13140000000000002 2023-06-17 18:16:29,262 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:16:54,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=16980.0, ans=0.07 2023-06-17 18:17:01,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=16980.0, ans=0.0 2023-06-17 18:17:01,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.00 vs. limit=13.8675 2023-06-17 18:17:20,328 INFO [train.py:996] (3/4) Epoch 1, batch 2850, loss[loss=0.4402, simple_loss=0.4541, pruned_loss=0.2131, over 21657.00 frames. ], tot_loss[loss=0.4661, simple_loss=0.4639, pruned_loss=0.2342, over 4284326.23 frames. ], batch size: 414, lr: 4.35e-02, grad_scale: 8.0 2023-06-17 18:17:50,224 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:18:10,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17220.0, ans=0.12780000000000002 2023-06-17 18:18:55,438 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.97 vs. limit=13.67 2023-06-17 18:19:03,954 INFO [train.py:996] (3/4) Epoch 1, batch 2900, loss[loss=0.4402, simple_loss=0.4333, pruned_loss=0.2236, over 21892.00 frames. ], tot_loss[loss=0.4605, simple_loss=0.4593, pruned_loss=0.2309, over 4286891.69 frames. ], batch size: 107, lr: 4.35e-02, grad_scale: 8.0 2023-06-17 18:19:13,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=17400.0, ans=0.125 2023-06-17 18:19:28,660 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 4.517e+02 6.306e+02 8.812e+02 1.788e+03, threshold=1.261e+03, percent-clipped=6.0 2023-06-17 18:19:39,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=17460.0, ans=0.0 2023-06-17 18:19:46,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17460.0, ans=0.1254 2023-06-17 18:19:59,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=17520.0, ans=0.0 2023-06-17 18:20:16,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=17580.0, ans=0.04949747468305833 2023-06-17 18:20:21,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=17580.0, ans=0.1242 2023-06-17 18:20:36,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=17640.0, ans=0.0 2023-06-17 18:20:54,491 INFO [train.py:996] (3/4) Epoch 1, batch 2950, loss[loss=0.4224, simple_loss=0.4204, pruned_loss=0.2122, over 21396.00 frames. ], tot_loss[loss=0.4573, simple_loss=0.459, pruned_loss=0.2278, over 4293432.10 frames. ], batch size: 144, lr: 4.34e-02, grad_scale: 8.0 2023-06-17 18:22:44,537 INFO [train.py:996] (3/4) Epoch 1, batch 3000, loss[loss=0.4774, simple_loss=0.4747, pruned_loss=0.2401, over 21534.00 frames. ], tot_loss[loss=0.4611, simple_loss=0.4638, pruned_loss=0.2292, over 4283172.56 frames. ], batch size: 194, lr: 4.34e-02, grad_scale: 8.0 2023-06-17 18:22:44,537 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-17 18:23:01,481 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.3658, simple_loss=0.4363, pruned_loss=0.1476, over 1796401.00 frames. 2023-06-17 18:23:01,482 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-17 18:23:20,423 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.233e+02 5.025e+02 6.573e+02 9.808e+02 2.550e+03, threshold=1.315e+03, percent-clipped=11.0 2023-06-17 18:23:53,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=18120.0, ans=0.125 2023-06-17 18:24:08,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18180.0, ans=0.1182 2023-06-17 18:24:13,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=18180.0, ans=0.26370000000000005 2023-06-17 18:24:45,857 INFO [train.py:996] (3/4) Epoch 1, batch 3050, loss[loss=0.4935, simple_loss=0.4885, pruned_loss=0.2493, over 21880.00 frames. ], tot_loss[loss=0.4597, simple_loss=0.4644, pruned_loss=0.2275, over 4285537.41 frames. ], batch size: 371, lr: 4.33e-02, grad_scale: 8.0 2023-06-17 18:24:57,185 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.57 vs. limit=14.3625 2023-06-17 18:25:08,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=18360.0, ans=0.1164 2023-06-17 18:25:19,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=18360.0, ans=0.2574000000000001 2023-06-17 18:25:19,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18360.0, ans=0.1164 2023-06-17 18:25:30,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=18420.0, ans=0.11580000000000001 2023-06-17 18:25:40,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=18420.0, ans=0.25529999999999997 2023-06-17 18:26:35,502 INFO [train.py:996] (3/4) Epoch 1, batch 3100, loss[loss=0.4667, simple_loss=0.4569, pruned_loss=0.2383, over 21579.00 frames. ], tot_loss[loss=0.4533, simple_loss=0.4598, pruned_loss=0.2235, over 4283397.29 frames. ], batch size: 548, lr: 4.33e-02, grad_scale: 8.0 2023-06-17 18:26:53,816 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.845e+02 4.821e+02 6.301e+02 1.043e+03 2.318e+03, threshold=1.260e+03, percent-clipped=14.0 2023-06-17 18:27:09,543 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:27:15,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=18720.0, ans=0.125 2023-06-17 18:27:57,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=18840.0, ans=0.05 2023-06-17 18:28:01,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=18840.0, ans=0.1116 2023-06-17 18:28:20,377 INFO [train.py:996] (3/4) Epoch 1, batch 3150, loss[loss=0.5393, simple_loss=0.5237, pruned_loss=0.2775, over 21484.00 frames. ], tot_loss[loss=0.4585, simple_loss=0.4643, pruned_loss=0.2263, over 4288414.26 frames. ], batch size: 131, lr: 4.32e-02, grad_scale: 8.0 2023-06-17 18:28:54,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=18960.0, ans=0.125 2023-06-17 18:29:28,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.92 vs. limit=5.862 2023-06-17 18:29:41,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=19080.0, ans=0.10920000000000002 2023-06-17 18:30:11,510 INFO [train.py:996] (3/4) Epoch 1, batch 3200, loss[loss=0.4649, simple_loss=0.4857, pruned_loss=0.2221, over 21676.00 frames. ], tot_loss[loss=0.4544, simple_loss=0.4637, pruned_loss=0.2226, over 4290584.08 frames. ], batch size: 414, lr: 4.32e-02, grad_scale: 16.0 2023-06-17 18:30:24,578 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.487e+02 4.999e+02 6.065e+02 1.040e+03 2.031e+03, threshold=1.213e+03, percent-clipped=14.0 2023-06-17 18:31:55,093 INFO [train.py:996] (3/4) Epoch 1, batch 3250, loss[loss=0.4004, simple_loss=0.4074, pruned_loss=0.1967, over 21381.00 frames. ], tot_loss[loss=0.4546, simple_loss=0.462, pruned_loss=0.2236, over 4289420.10 frames. ], batch size: 211, lr: 4.31e-02, grad_scale: 8.0 2023-06-17 18:32:10,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=19560.0, ans=0.125 2023-06-17 18:33:22,972 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.58 vs. limit=14.9025 2023-06-17 18:33:24,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.76 vs. limit=14.9025 2023-06-17 18:33:40,004 INFO [train.py:996] (3/4) Epoch 1, batch 3300, loss[loss=0.3597, simple_loss=0.408, pruned_loss=0.1556, over 21533.00 frames. ], tot_loss[loss=0.4477, simple_loss=0.4548, pruned_loss=0.2203, over 4285533.99 frames. ], batch size: 230, lr: 4.31e-02, grad_scale: 8.0 2023-06-17 18:34:06,161 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 4.541e+02 6.764e+02 1.015e+03 2.529e+03, threshold=1.353e+03, percent-clipped=14.0 2023-06-17 18:34:41,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=19920.0, ans=0.125 2023-06-17 18:34:43,269 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:35:18,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=20040.0, ans=0.04949747468305833 2023-06-17 18:35:24,203 INFO [train.py:996] (3/4) Epoch 1, batch 3350, loss[loss=0.4188, simple_loss=0.4446, pruned_loss=0.1965, over 21407.00 frames. ], tot_loss[loss=0.4493, simple_loss=0.4578, pruned_loss=0.2204, over 4285201.49 frames. ], batch size: 548, lr: 4.30e-02, grad_scale: 8.0 2023-06-17 18:35:26,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=20100.0, ans=0.2 2023-06-17 18:35:53,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=20160.0, ans=0.125 2023-06-17 18:35:57,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.91 vs. limit=15.0 2023-06-17 18:36:13,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=20220.0, ans=0.125 2023-06-17 18:36:18,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=20220.0, ans=0.125 2023-06-17 18:36:26,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=20220.0, ans=0.025 2023-06-17 18:36:41,051 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-06-17 18:37:12,829 INFO [train.py:996] (3/4) Epoch 1, batch 3400, loss[loss=0.4529, simple_loss=0.4505, pruned_loss=0.2277, over 21859.00 frames. ], tot_loss[loss=0.4489, simple_loss=0.4569, pruned_loss=0.2205, over 4287055.09 frames. ], batch size: 124, lr: 4.29e-02, grad_scale: 8.0 2023-06-17 18:37:20,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=20400.0, ans=0.1 2023-06-17 18:37:31,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=20400.0, ans=0.125 2023-06-17 18:37:34,556 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.423e+02 4.338e+02 6.007e+02 8.675e+02 3.027e+03, threshold=1.201e+03, percent-clipped=6.0 2023-06-17 18:37:53,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=20460.0, ans=0.2 2023-06-17 18:38:00,946 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=12.0 2023-06-17 18:38:17,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=20580.0, ans=0.006395652173913044 2023-06-17 18:38:23,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=20580.0, ans=0.125 2023-06-17 18:38:43,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=20640.0, ans=0.0063826086956521744 2023-06-17 18:39:03,057 INFO [train.py:996] (3/4) Epoch 1, batch 3450, loss[loss=0.6433, simple_loss=0.5926, pruned_loss=0.347, over 21386.00 frames. ], tot_loss[loss=0.444, simple_loss=0.4501, pruned_loss=0.219, over 4282446.39 frames. ], batch size: 507, lr: 4.29e-02, grad_scale: 8.0 2023-06-17 18:39:12,908 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2023-06-17 18:39:18,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=20760.0, ans=0.025 2023-06-17 18:40:05,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=20880.0, ans=0.125 2023-06-17 18:40:47,103 INFO [train.py:996] (3/4) Epoch 1, batch 3500, loss[loss=0.4861, simple_loss=0.4899, pruned_loss=0.2411, over 21805.00 frames. ], tot_loss[loss=0.4559, simple_loss=0.4616, pruned_loss=0.2251, over 4282053.38 frames. ], batch size: 247, lr: 4.28e-02, grad_scale: 8.0 2023-06-17 18:41:09,661 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 4.958e+02 6.770e+02 9.160e+02 2.307e+03, threshold=1.354e+03, percent-clipped=16.0 2023-06-17 18:41:52,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=21180.0, ans=0.006265217391304348 2023-06-17 18:41:54,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=21180.0, ans=0.125 2023-06-17 18:42:23,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=21240.0, ans=0.0 2023-06-17 18:42:32,160 INFO [train.py:996] (3/4) Epoch 1, batch 3550, loss[loss=0.4442, simple_loss=0.4468, pruned_loss=0.2208, over 21209.00 frames. ], tot_loss[loss=0.4591, simple_loss=0.4648, pruned_loss=0.2267, over 4278488.94 frames. ], batch size: 159, lr: 4.28e-02, grad_scale: 4.0 2023-06-17 18:42:49,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=21300.0, ans=0.1 2023-06-17 18:43:11,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=21360.0, ans=0.125 2023-06-17 18:44:21,625 INFO [train.py:996] (3/4) Epoch 1, batch 3600, loss[loss=0.5015, simple_loss=0.4887, pruned_loss=0.2572, over 21560.00 frames. ], tot_loss[loss=0.454, simple_loss=0.4585, pruned_loss=0.2247, over 4280994.50 frames. ], batch size: 389, lr: 4.27e-02, grad_scale: 8.0 2023-06-17 18:44:22,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.19 vs. limit=15.0 2023-06-17 18:44:28,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21600.0, ans=0.1 2023-06-17 18:44:37,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=21660.0, ans=15.0 2023-06-17 18:44:39,436 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.661e+02 4.436e+02 5.716e+02 8.040e+02 1.927e+03, threshold=1.143e+03, percent-clipped=4.0 2023-06-17 18:46:01,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21840.0, ans=0.1 2023-06-17 18:46:05,192 INFO [train.py:996] (3/4) Epoch 1, batch 3650, loss[loss=0.5628, simple_loss=0.5627, pruned_loss=0.2815, over 19800.00 frames. ], tot_loss[loss=0.4542, simple_loss=0.4601, pruned_loss=0.2242, over 4282370.79 frames. ], batch size: 702, lr: 4.27e-02, grad_scale: 8.0 2023-06-17 18:46:36,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=21960.0, ans=10.0 2023-06-17 18:46:54,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=22020.0, ans=0.125 2023-06-17 18:46:56,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=22020.0, ans=0.006082608695652174 2023-06-17 18:47:34,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=22140.0, ans=0.125 2023-06-17 18:47:48,603 INFO [train.py:996] (3/4) Epoch 1, batch 3700, loss[loss=0.4485, simple_loss=0.4628, pruned_loss=0.2171, over 21841.00 frames. ], tot_loss[loss=0.4524, simple_loss=0.459, pruned_loss=0.2229, over 4276496.25 frames. ], batch size: 351, lr: 4.26e-02, grad_scale: 8.0 2023-06-17 18:48:00,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=22200.0, ans=0.125 2023-06-17 18:48:06,728 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.875e+02 4.996e+02 7.328e+02 1.013e+03 2.628e+03, threshold=1.466e+03, percent-clipped=16.0 2023-06-17 18:48:38,844 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.94 vs. limit=6.0 2023-06-17 18:48:44,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=22380.0, ans=0.2 2023-06-17 18:49:32,173 INFO [train.py:996] (3/4) Epoch 1, batch 3750, loss[loss=0.3816, simple_loss=0.3757, pruned_loss=0.1937, over 20235.00 frames. ], tot_loss[loss=0.4461, simple_loss=0.4529, pruned_loss=0.2196, over 4285587.97 frames. ], batch size: 703, lr: 4.26e-02, grad_scale: 8.0 2023-06-17 18:50:04,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=22560.0, ans=0.125 2023-06-17 18:50:07,832 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2023-06-17 18:50:13,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=22620.0, ans=0.125 2023-06-17 18:50:33,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-06-17 18:51:16,277 INFO [train.py:996] (3/4) Epoch 1, batch 3800, loss[loss=0.4608, simple_loss=0.4618, pruned_loss=0.2299, over 19987.00 frames. ], tot_loss[loss=0.4424, simple_loss=0.4511, pruned_loss=0.2169, over 4287321.30 frames. ], batch size: 703, lr: 4.25e-02, grad_scale: 8.0 2023-06-17 18:51:23,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=22800.0, ans=0.125 2023-06-17 18:51:31,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=22800.0, ans=0.125 2023-06-17 18:51:39,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.895e+02 5.418e+02 7.571e+02 2.562e+03, threshold=1.084e+03, percent-clipped=5.0 2023-06-17 18:52:38,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=22980.0, ans=0.125 2023-06-17 18:52:58,984 INFO [train.py:996] (3/4) Epoch 1, batch 3850, loss[loss=0.4184, simple_loss=0.4154, pruned_loss=0.2108, over 21874.00 frames. ], tot_loss[loss=0.4405, simple_loss=0.4482, pruned_loss=0.2164, over 4282522.72 frames. ], batch size: 373, lr: 4.24e-02, grad_scale: 8.0 2023-06-17 18:53:33,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=23160.0, ans=0.035 2023-06-17 18:53:38,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=23220.0, ans=0.125 2023-06-17 18:53:58,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=23280.0, ans=0.125 2023-06-17 18:53:59,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=23280.0, ans=0.1 2023-06-17 18:54:17,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=23280.0, ans=0.0 2023-06-17 18:54:21,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=23280.0, ans=0.125 2023-06-17 18:54:40,119 INFO [train.py:996] (3/4) Epoch 1, batch 3900, loss[loss=0.4328, simple_loss=0.4218, pruned_loss=0.2219, over 21233.00 frames. ], tot_loss[loss=0.4361, simple_loss=0.4432, pruned_loss=0.2145, over 4284944.90 frames. ], batch size: 548, lr: 4.24e-02, grad_scale: 8.0 2023-06-17 18:54:59,268 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.423e+02 4.722e+02 6.490e+02 9.055e+02 2.329e+03, threshold=1.298e+03, percent-clipped=15.0 2023-06-17 18:55:25,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=23520.0, ans=0.1 2023-06-17 18:55:40,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=23580.0, ans=0.125 2023-06-17 18:56:25,079 INFO [train.py:996] (3/4) Epoch 1, batch 3950, loss[loss=0.2732, simple_loss=0.3166, pruned_loss=0.1149, over 21317.00 frames. ], tot_loss[loss=0.433, simple_loss=0.4429, pruned_loss=0.2116, over 4291922.11 frames. ], batch size: 176, lr: 4.23e-02, grad_scale: 8.0 2023-06-17 18:56:28,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=23700.0, ans=0.125 2023-06-17 18:56:34,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=23700.0, ans=0.1 2023-06-17 18:56:51,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=23760.0, ans=0.125 2023-06-17 18:56:56,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=23760.0, ans=0.0 2023-06-17 18:57:20,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=23820.0, ans=0.1 2023-06-17 18:58:09,749 INFO [train.py:996] (3/4) Epoch 1, batch 4000, loss[loss=0.3767, simple_loss=0.3834, pruned_loss=0.185, over 21787.00 frames. ], tot_loss[loss=0.4233, simple_loss=0.4357, pruned_loss=0.2055, over 4287658.38 frames. ], batch size: 124, lr: 4.23e-02, grad_scale: 16.0 2023-06-17 18:58:15,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=24000.0, ans=0.0 2023-06-17 18:58:21,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=24000.0, ans=0.2 2023-06-17 18:58:33,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.808e+02 4.109e+02 5.052e+02 7.332e+02 1.857e+03, threshold=1.010e+03, percent-clipped=6.0 2023-06-17 18:58:34,573 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.13 vs. limit=15.0 2023-06-17 18:58:43,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=24060.0, ans=0.1 2023-06-17 18:58:55,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=24120.0, ans=0.04949747468305833 2023-06-17 18:59:45,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=24240.0, ans=0.2 2023-06-17 18:59:46,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=24240.0, ans=0.2 2023-06-17 18:59:49,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=24240.0, ans=0.2 2023-06-17 18:59:52,451 INFO [train.py:996] (3/4) Epoch 1, batch 4050, loss[loss=0.3766, simple_loss=0.3739, pruned_loss=0.1897, over 20840.00 frames. ], tot_loss[loss=0.4175, simple_loss=0.4327, pruned_loss=0.2012, over 4277846.27 frames. ], batch size: 613, lr: 4.22e-02, grad_scale: 8.0 2023-06-17 19:00:00,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=24300.0, ans=0.125 2023-06-17 19:01:14,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=24480.0, ans=0.125 2023-06-17 19:01:35,938 INFO [train.py:996] (3/4) Epoch 1, batch 4100, loss[loss=0.4092, simple_loss=0.4198, pruned_loss=0.1993, over 21933.00 frames. ], tot_loss[loss=0.4193, simple_loss=0.4338, pruned_loss=0.2024, over 4285798.92 frames. ], batch size: 316, lr: 4.22e-02, grad_scale: 8.0 2023-06-17 19:01:59,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=24660.0, ans=0.005508695652173913 2023-06-17 19:02:00,998 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.803e+02 4.142e+02 6.350e+02 1.020e+03 2.376e+03, threshold=1.270e+03, percent-clipped=25.0 2023-06-17 19:02:38,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=24720.0, ans=0.125 2023-06-17 19:02:53,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24780.0, ans=0.1 2023-06-17 19:03:00,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24780.0, ans=0.1 2023-06-17 19:03:08,874 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2023-06-17 19:03:19,135 INFO [train.py:996] (3/4) Epoch 1, batch 4150, loss[loss=0.4565, simple_loss=0.4587, pruned_loss=0.2271, over 21068.00 frames. ], tot_loss[loss=0.4119, simple_loss=0.4319, pruned_loss=0.196, over 4271578.83 frames. ], batch size: 608, lr: 4.21e-02, grad_scale: 8.0 2023-06-17 19:03:19,966 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=15.0 2023-06-17 19:03:42,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=24960.0, ans=0.125 2023-06-17 19:04:05,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=24960.0, ans=0.0 2023-06-17 19:04:22,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=25020.0, ans=0.125 2023-06-17 19:04:43,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=25080.0, ans=0.125 2023-06-17 19:04:48,085 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.58 vs. limit=10.0 2023-06-17 19:05:05,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=25140.0, ans=0.1 2023-06-17 19:05:09,985 INFO [train.py:996] (3/4) Epoch 1, batch 4200, loss[loss=0.53, simple_loss=0.5098, pruned_loss=0.2752, over 21355.00 frames. ], tot_loss[loss=0.4108, simple_loss=0.4312, pruned_loss=0.1951, over 4275542.51 frames. ], batch size: 548, lr: 4.20e-02, grad_scale: 8.0 2023-06-17 19:05:10,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=25200.0, ans=0.04949747468305833 2023-06-17 19:05:19,401 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.68 vs. limit=15.0 2023-06-17 19:05:46,223 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 4.243e+02 5.422e+02 7.726e+02 1.559e+03, threshold=1.084e+03, percent-clipped=3.0 2023-06-17 19:06:20,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=25380.0, ans=0.125 2023-06-17 19:06:20,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=25380.0, ans=0.1 2023-06-17 19:06:21,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=25380.0, ans=0.0053521739130434785 2023-06-17 19:06:26,850 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:06:44,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=25440.0, ans=0.0053391304347826084 2023-06-17 19:06:49,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=25440.0, ans=0.1 2023-06-17 19:07:07,383 INFO [train.py:996] (3/4) Epoch 1, batch 4250, loss[loss=0.4476, simple_loss=0.4321, pruned_loss=0.2315, over 20215.00 frames. ], tot_loss[loss=0.4191, simple_loss=0.4396, pruned_loss=0.1993, over 4272573.45 frames. ], batch size: 702, lr: 4.20e-02, grad_scale: 8.0 2023-06-17 19:07:14,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=25500.0, ans=10.0 2023-06-17 19:07:22,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=25500.0, ans=0.2 2023-06-17 19:07:49,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=25620.0, ans=0.125 2023-06-17 19:08:58,882 INFO [train.py:996] (3/4) Epoch 1, batch 4300, loss[loss=0.3734, simple_loss=0.4059, pruned_loss=0.1705, over 21301.00 frames. ], tot_loss[loss=0.4261, simple_loss=0.4477, pruned_loss=0.2023, over 4277333.72 frames. ], batch size: 176, lr: 4.19e-02, grad_scale: 8.0 2023-06-17 19:09:12,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=25800.0, ans=0.0 2023-06-17 19:09:18,521 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.691e+02 4.313e+02 6.396e+02 8.892e+02 2.391e+03, threshold=1.279e+03, percent-clipped=16.0 2023-06-17 19:10:42,356 INFO [train.py:996] (3/4) Epoch 1, batch 4350, loss[loss=0.3316, simple_loss=0.3458, pruned_loss=0.1587, over 21220.00 frames. ], tot_loss[loss=0.4243, simple_loss=0.4445, pruned_loss=0.2021, over 4266436.42 frames. ], batch size: 548, lr: 4.19e-02, grad_scale: 8.0 2023-06-17 19:10:54,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=26100.0, ans=0.125 2023-06-17 19:11:28,737 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-17 19:12:16,572 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.73 vs. limit=15.0 2023-06-17 19:12:27,476 INFO [train.py:996] (3/4) Epoch 1, batch 4400, loss[loss=0.4834, simple_loss=0.5052, pruned_loss=0.2308, over 21656.00 frames. ], tot_loss[loss=0.4199, simple_loss=0.4394, pruned_loss=0.2002, over 4268760.25 frames. ], batch size: 414, lr: 4.18e-02, grad_scale: 16.0 2023-06-17 19:12:48,103 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.456e+02 3.803e+02 5.319e+02 7.173e+02 2.856e+03, threshold=1.064e+03, percent-clipped=8.0 2023-06-17 19:12:55,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=26460.0, ans=0.125 2023-06-17 19:14:13,424 INFO [train.py:996] (3/4) Epoch 1, batch 4450, loss[loss=0.5359, simple_loss=0.5359, pruned_loss=0.2679, over 21573.00 frames. ], tot_loss[loss=0.4234, simple_loss=0.4456, pruned_loss=0.2007, over 4272299.33 frames. ], batch size: 471, lr: 4.17e-02, grad_scale: 8.0 2023-06-17 19:14:48,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=26760.0, ans=0.125 2023-06-17 19:15:21,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=26880.0, ans=0.09899494936611666 2023-06-17 19:15:34,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=26940.0, ans=0.05 2023-06-17 19:15:37,126 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.93 vs. limit=15.0 2023-06-17 19:15:38,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=26940.0, ans=0.1 2023-06-17 19:15:51,997 INFO [train.py:996] (3/4) Epoch 1, batch 4500, loss[loss=0.4593, simple_loss=0.444, pruned_loss=0.2373, over 20182.00 frames. ], tot_loss[loss=0.4282, simple_loss=0.4483, pruned_loss=0.204, over 4279282.49 frames. ], batch size: 707, lr: 4.17e-02, grad_scale: 8.0 2023-06-17 19:15:54,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=27000.0, ans=0.2 2023-06-17 19:16:17,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=27060.0, ans=0.125 2023-06-17 19:16:19,140 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.844e+02 4.691e+02 6.117e+02 8.779e+02 1.856e+03, threshold=1.223e+03, percent-clipped=14.0 2023-06-17 19:16:43,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=27120.0, ans=0.125 2023-06-17 19:17:36,015 INFO [train.py:996] (3/4) Epoch 1, batch 4550, loss[loss=0.499, simple_loss=0.5069, pruned_loss=0.2456, over 21347.00 frames. ], tot_loss[loss=0.4294, simple_loss=0.4512, pruned_loss=0.2038, over 4282318.13 frames. ], batch size: 549, lr: 4.16e-02, grad_scale: 8.0 2023-06-17 19:17:59,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=27360.0, ans=0.125 2023-06-17 19:19:19,233 INFO [train.py:996] (3/4) Epoch 1, batch 4600, loss[loss=0.3855, simple_loss=0.4181, pruned_loss=0.1765, over 21832.00 frames. ], tot_loss[loss=0.4338, simple_loss=0.4535, pruned_loss=0.207, over 4282206.01 frames. ], batch size: 351, lr: 4.15e-02, grad_scale: 8.0 2023-06-17 19:19:27,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=27600.0, ans=0.125 2023-06-17 19:19:32,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=27600.0, ans=0.07 2023-06-17 19:19:46,047 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.894e+02 4.493e+02 6.587e+02 9.549e+02 1.987e+03, threshold=1.317e+03, percent-clipped=15.0 2023-06-17 19:19:58,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=27660.0, ans=0.125 2023-06-17 19:20:13,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=27720.0, ans=0.125 2023-06-17 19:20:23,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=27780.0, ans=0.125 2023-06-17 19:20:44,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=27840.0, ans=0.125 2023-06-17 19:21:02,837 INFO [train.py:996] (3/4) Epoch 1, batch 4650, loss[loss=0.3301, simple_loss=0.3682, pruned_loss=0.146, over 21774.00 frames. ], tot_loss[loss=0.422, simple_loss=0.4419, pruned_loss=0.2011, over 4286791.99 frames. ], batch size: 391, lr: 4.15e-02, grad_scale: 8.0 2023-06-17 19:21:41,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27960.0, ans=0.1 2023-06-17 19:21:43,834 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=15.0 2023-06-17 19:21:56,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=28020.0, ans=0.1 2023-06-17 19:22:07,167 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.93 vs. limit=15.0 2023-06-17 19:22:40,189 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.41 vs. limit=15.0 2023-06-17 19:22:40,783 INFO [train.py:996] (3/4) Epoch 1, batch 4700, loss[loss=0.3472, simple_loss=0.3567, pruned_loss=0.1689, over 21238.00 frames. ], tot_loss[loss=0.4114, simple_loss=0.4308, pruned_loss=0.196, over 4285498.00 frames. ], batch size: 159, lr: 4.14e-02, grad_scale: 8.0 2023-06-17 19:23:12,935 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.969e+02 4.560e+02 5.738e+02 6.731e+02 1.328e+03, threshold=1.148e+03, percent-clipped=1.0 2023-06-17 19:23:25,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=28260.0, ans=0.004726086956521739 2023-06-17 19:23:33,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=28320.0, ans=0.2 2023-06-17 19:23:39,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=28320.0, ans=0.1 2023-06-17 19:23:46,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=28380.0, ans=0.125 2023-06-17 19:23:46,818 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=15.0 2023-06-17 19:23:55,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=28380.0, ans=0.0047 2023-06-17 19:24:01,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=28440.0, ans=0.125 2023-06-17 19:24:02,521 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.54 vs. limit=15.0 2023-06-17 19:24:22,985 INFO [train.py:996] (3/4) Epoch 1, batch 4750, loss[loss=0.4234, simple_loss=0.4221, pruned_loss=0.2124, over 21323.00 frames. ], tot_loss[loss=0.4094, simple_loss=0.4259, pruned_loss=0.1965, over 4285220.29 frames. ], batch size: 159, lr: 4.14e-02, grad_scale: 8.0 2023-06-17 19:24:30,220 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-17 19:24:33,304 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2023-06-17 19:25:03,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=28560.0, ans=0.05 2023-06-17 19:25:16,640 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.88 vs. limit=15.0 2023-06-17 19:25:22,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=28680.0, ans=0.2 2023-06-17 19:26:08,413 INFO [train.py:996] (3/4) Epoch 1, batch 4800, loss[loss=0.3855, simple_loss=0.4045, pruned_loss=0.1833, over 21305.00 frames. ], tot_loss[loss=0.4129, simple_loss=0.4289, pruned_loss=0.1984, over 4284047.95 frames. ], batch size: 143, lr: 4.13e-02, grad_scale: 16.0 2023-06-17 19:26:40,339 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.777e+02 4.396e+02 5.630e+02 9.544e+02 1.768e+03, threshold=1.126e+03, percent-clipped=14.0 2023-06-17 19:27:01,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=28920.0, ans=0.125 2023-06-17 19:27:01,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=28920.0, ans=0.125 2023-06-17 19:27:44,666 INFO [train.py:996] (3/4) Epoch 1, batch 4850, loss[loss=0.4116, simple_loss=0.4319, pruned_loss=0.1957, over 21823.00 frames. ], tot_loss[loss=0.4109, simple_loss=0.4275, pruned_loss=0.1971, over 4282366.71 frames. ], batch size: 332, lr: 4.12e-02, grad_scale: 16.0 2023-06-17 19:28:09,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=29100.0, ans=0.125 2023-06-17 19:28:15,389 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=15.0 2023-06-17 19:28:33,806 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.22 vs. limit=6.0 2023-06-17 19:28:36,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=29220.0, ans=0.0 2023-06-17 19:28:39,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=29220.0, ans=0.04949747468305833 2023-06-17 19:28:56,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=29280.0, ans=0.2 2023-06-17 19:29:28,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=29400.0, ans=0.0 2023-06-17 19:29:29,276 INFO [train.py:996] (3/4) Epoch 1, batch 4900, loss[loss=0.5376, simple_loss=0.5294, pruned_loss=0.2729, over 21516.00 frames. ], tot_loss[loss=0.4151, simple_loss=0.4316, pruned_loss=0.1993, over 4279997.21 frames. ], batch size: 508, lr: 4.12e-02, grad_scale: 16.0 2023-06-17 19:30:02,441 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.963e+02 4.351e+02 5.424e+02 7.801e+02 1.566e+03, threshold=1.085e+03, percent-clipped=9.0 2023-06-17 19:30:30,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=29520.0, ans=0.2 2023-06-17 19:30:33,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=29520.0, ans=10.0 2023-06-17 19:30:58,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29640.0, ans=0.1 2023-06-17 19:31:00,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=29640.0, ans=0.125 2023-06-17 19:31:25,672 INFO [train.py:996] (3/4) Epoch 1, batch 4950, loss[loss=0.4454, simple_loss=0.4806, pruned_loss=0.2051, over 21649.00 frames. ], tot_loss[loss=0.4139, simple_loss=0.4359, pruned_loss=0.1959, over 4275096.95 frames. ], batch size: 441, lr: 4.11e-02, grad_scale: 16.0 2023-06-17 19:32:01,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=29820.0, ans=0.2 2023-06-17 19:32:52,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=29940.0, ans=0.0 2023-06-17 19:33:07,723 INFO [train.py:996] (3/4) Epoch 1, batch 5000, loss[loss=0.3188, simple_loss=0.382, pruned_loss=0.1278, over 21421.00 frames. ], tot_loss[loss=0.4062, simple_loss=0.432, pruned_loss=0.1902, over 4273338.40 frames. ], batch size: 194, lr: 4.10e-02, grad_scale: 16.0 2023-06-17 19:33:34,068 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.310e+02 4.453e+02 5.189e+02 7.873e+02 1.529e+03, threshold=1.038e+03, percent-clipped=6.0 2023-06-17 19:33:41,515 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.42 vs. limit=22.5 2023-06-17 19:33:50,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=30120.0, ans=0.004321739130434783 2023-06-17 19:34:26,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=30240.0, ans=0.1 2023-06-17 19:34:43,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=30300.0, ans=0.125 2023-06-17 19:34:45,159 INFO [train.py:996] (3/4) Epoch 1, batch 5050, loss[loss=0.3952, simple_loss=0.4135, pruned_loss=0.1884, over 21581.00 frames. ], tot_loss[loss=0.4082, simple_loss=0.4325, pruned_loss=0.192, over 4277644.90 frames. ], batch size: 195, lr: 4.10e-02, grad_scale: 16.0 2023-06-17 19:35:26,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=30420.0, ans=0.2 2023-06-17 19:35:41,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=30420.0, ans=0.125 2023-06-17 19:35:57,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.14 vs. limit=22.5 2023-06-17 19:36:00,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=19.86 vs. limit=15.0 2023-06-17 19:36:27,901 INFO [train.py:996] (3/4) Epoch 1, batch 5100, loss[loss=0.4066, simple_loss=0.4169, pruned_loss=0.1982, over 21389.00 frames. ], tot_loss[loss=0.408, simple_loss=0.4321, pruned_loss=0.192, over 4278123.85 frames. ], batch size: 159, lr: 4.09e-02, grad_scale: 16.0 2023-06-17 19:36:42,637 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-06-17 19:36:59,579 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.656e+02 4.521e+02 5.607e+02 7.657e+02 1.284e+03, threshold=1.121e+03, percent-clipped=8.0 2023-06-17 19:37:05,147 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:37:21,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=30720.0, ans=0.1 2023-06-17 19:38:11,321 INFO [train.py:996] (3/4) Epoch 1, batch 5150, loss[loss=0.4281, simple_loss=0.4466, pruned_loss=0.2048, over 21802.00 frames. ], tot_loss[loss=0.4104, simple_loss=0.4326, pruned_loss=0.1941, over 4281491.69 frames. ], batch size: 332, lr: 4.09e-02, grad_scale: 16.0 2023-06-17 19:38:21,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=30900.0, ans=0.125 2023-06-17 19:38:26,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=30900.0, ans=0.0 2023-06-17 19:38:41,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=30960.0, ans=0.125 2023-06-17 19:38:46,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=30960.0, ans=0.2 2023-06-17 19:39:11,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=31020.0, ans=0.125 2023-06-17 19:39:13,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=31020.0, ans=0.0 2023-06-17 19:40:01,173 INFO [train.py:996] (3/4) Epoch 1, batch 5200, loss[loss=0.3545, simple_loss=0.3989, pruned_loss=0.155, over 21249.00 frames. ], tot_loss[loss=0.4112, simple_loss=0.4343, pruned_loss=0.1941, over 4283868.81 frames. ], batch size: 159, lr: 4.08e-02, grad_scale: 32.0 2023-06-17 19:40:13,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=31200.0, ans=0.2 2023-06-17 19:40:27,286 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.519e+02 4.450e+02 5.949e+02 9.427e+02 1.654e+03, threshold=1.190e+03, percent-clipped=14.0 2023-06-17 19:41:05,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=31380.0, ans=0.05 2023-06-17 19:41:44,199 INFO [train.py:996] (3/4) Epoch 1, batch 5250, loss[loss=0.522, simple_loss=0.5066, pruned_loss=0.2687, over 21770.00 frames. ], tot_loss[loss=0.4078, simple_loss=0.4354, pruned_loss=0.1901, over 4281964.70 frames. ], batch size: 441, lr: 4.07e-02, grad_scale: 16.0 2023-06-17 19:41:56,647 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:42:21,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=31620.0, ans=0.2 2023-06-17 19:42:34,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=31620.0, ans=0.0 2023-06-17 19:42:41,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=31680.0, ans=0.125 2023-06-17 19:42:43,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=31680.0, ans=0.125 2023-06-17 19:43:06,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=31740.0, ans=0.07 2023-06-17 19:43:07,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=31740.0, ans=0.1 2023-06-17 19:43:25,578 INFO [train.py:996] (3/4) Epoch 1, batch 5300, loss[loss=0.439, simple_loss=0.4415, pruned_loss=0.2183, over 21919.00 frames. ], tot_loss[loss=0.4105, simple_loss=0.4364, pruned_loss=0.1923, over 4291294.90 frames. ], batch size: 414, lr: 4.07e-02, grad_scale: 16.0 2023-06-17 19:43:53,702 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.621e+02 4.200e+02 5.076e+02 7.002e+02 1.420e+03, threshold=1.015e+03, percent-clipped=3.0 2023-06-17 19:44:19,396 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-06-17 19:45:01,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=32040.0, ans=0.0039043478260869556 2023-06-17 19:45:07,629 INFO [train.py:996] (3/4) Epoch 1, batch 5350, loss[loss=0.4214, simple_loss=0.4223, pruned_loss=0.2102, over 21467.00 frames. ], tot_loss[loss=0.4127, simple_loss=0.4365, pruned_loss=0.1944, over 4290268.78 frames. ], batch size: 159, lr: 4.06e-02, grad_scale: 16.0 2023-06-17 19:45:26,571 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2023-06-17 19:45:47,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=32160.0, ans=0.125 2023-06-17 19:45:58,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=32220.0, ans=0.125 2023-06-17 19:46:36,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=32340.0, ans=0.125 2023-06-17 19:46:42,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=32340.0, ans=0.0 2023-06-17 19:46:50,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=32340.0, ans=0.04949747468305833 2023-06-17 19:46:54,973 INFO [train.py:996] (3/4) Epoch 1, batch 5400, loss[loss=0.3457, simple_loss=0.3949, pruned_loss=0.1483, over 21754.00 frames. ], tot_loss[loss=0.4135, simple_loss=0.4353, pruned_loss=0.1958, over 4294147.70 frames. ], batch size: 391, lr: 4.05e-02, grad_scale: 16.0 2023-06-17 19:47:23,219 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 4.680e+02 5.760e+02 7.952e+02 1.690e+03, threshold=1.152e+03, percent-clipped=11.0 2023-06-17 19:47:25,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=32460.0, ans=0.2 2023-06-17 19:47:30,520 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=15.0 2023-06-17 19:48:05,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=32580.0, ans=0.2 2023-06-17 19:48:12,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=32580.0, ans=0.0 2023-06-17 19:48:18,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=32640.0, ans=0.1 2023-06-17 19:48:28,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=32640.0, ans=0.1 2023-06-17 19:48:38,263 INFO [train.py:996] (3/4) Epoch 1, batch 5450, loss[loss=0.3989, simple_loss=0.4687, pruned_loss=0.1646, over 21688.00 frames. ], tot_loss[loss=0.4103, simple_loss=0.4349, pruned_loss=0.1928, over 4297400.08 frames. ], batch size: 247, lr: 4.05e-02, grad_scale: 16.0 2023-06-17 19:48:55,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=32700.0, ans=0.2 2023-06-17 19:49:23,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=32820.0, ans=0.125 2023-06-17 19:49:36,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=15.0 2023-06-17 19:50:00,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=32880.0, ans=0.125 2023-06-17 19:50:04,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=32940.0, ans=0.04949747468305833 2023-06-17 19:50:26,861 INFO [train.py:996] (3/4) Epoch 1, batch 5500, loss[loss=0.5203, simple_loss=0.5861, pruned_loss=0.2272, over 19799.00 frames. ], tot_loss[loss=0.4067, simple_loss=0.4369, pruned_loss=0.1882, over 4291966.07 frames. ], batch size: 702, lr: 4.04e-02, grad_scale: 16.0 2023-06-17 19:50:49,560 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.509e+02 4.085e+02 5.638e+02 7.299e+02 1.416e+03, threshold=1.128e+03, percent-clipped=6.0 2023-06-17 19:50:52,238 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.83 vs. limit=15.0 2023-06-17 19:51:33,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=33180.0, ans=0.125 2023-06-17 19:51:42,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=33180.0, ans=0.003656521739130435 2023-06-17 19:52:13,360 INFO [train.py:996] (3/4) Epoch 1, batch 5550, loss[loss=0.3097, simple_loss=0.3667, pruned_loss=0.1264, over 21673.00 frames. ], tot_loss[loss=0.3987, simple_loss=0.4321, pruned_loss=0.1826, over 4287641.58 frames. ], batch size: 247, lr: 4.03e-02, grad_scale: 16.0 2023-06-17 19:53:14,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=33420.0, ans=0.2 2023-06-17 19:53:56,944 INFO [train.py:996] (3/4) Epoch 1, batch 5600, loss[loss=0.4337, simple_loss=0.4511, pruned_loss=0.2082, over 20013.00 frames. ], tot_loss[loss=0.392, simple_loss=0.4286, pruned_loss=0.1777, over 4285120.15 frames. ], batch size: 702, lr: 4.03e-02, grad_scale: 32.0 2023-06-17 19:53:57,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=33600.0, ans=0.2 2023-06-17 19:54:17,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=33660.0, ans=0.125 2023-06-17 19:54:29,859 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.309e+02 4.088e+02 5.346e+02 7.510e+02 1.919e+03, threshold=1.069e+03, percent-clipped=8.0 2023-06-17 19:54:30,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=33660.0, ans=0.0 2023-06-17 19:54:50,450 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-06-17 19:54:54,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=33720.0, ans=0.0 2023-06-17 19:55:38,020 INFO [train.py:996] (3/4) Epoch 1, batch 5650, loss[loss=0.4484, simple_loss=0.5003, pruned_loss=0.1982, over 21213.00 frames. ], tot_loss[loss=0.3989, simple_loss=0.4336, pruned_loss=0.1821, over 4282189.85 frames. ], batch size: 548, lr: 4.02e-02, grad_scale: 32.0 2023-06-17 19:56:36,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=34020.0, ans=0.125 2023-06-17 19:56:44,850 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.79 vs. limit=22.5 2023-06-17 19:57:27,458 INFO [train.py:996] (3/4) Epoch 1, batch 5700, loss[loss=0.3521, simple_loss=0.403, pruned_loss=0.1506, over 21749.00 frames. ], tot_loss[loss=0.4007, simple_loss=0.4323, pruned_loss=0.1846, over 4282173.60 frames. ], batch size: 282, lr: 4.02e-02, grad_scale: 32.0 2023-06-17 19:57:59,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=34260.0, ans=0.125 2023-06-17 19:58:00,893 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 4.144e+02 5.223e+02 7.602e+02 1.708e+03, threshold=1.045e+03, percent-clipped=9.0 2023-06-17 19:58:04,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=34260.0, ans=0.125 2023-06-17 19:58:19,189 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-17 19:58:27,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=34320.0, ans=0.125 2023-06-17 19:58:50,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=34440.0, ans=0.2 2023-06-17 19:59:11,800 INFO [train.py:996] (3/4) Epoch 1, batch 5750, loss[loss=0.3773, simple_loss=0.4558, pruned_loss=0.1494, over 20808.00 frames. ], tot_loss[loss=0.3928, simple_loss=0.4262, pruned_loss=0.1798, over 4271743.19 frames. ], batch size: 608, lr: 4.01e-02, grad_scale: 32.0 2023-06-17 19:59:30,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=34500.0, ans=0.05 2023-06-17 19:59:32,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=34560.0, ans=0.0 2023-06-17 20:00:01,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34620.0, ans=0.1 2023-06-17 20:00:08,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=34620.0, ans=0.125 2023-06-17 20:00:35,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=34740.0, ans=0.003317391304347826 2023-06-17 20:00:59,960 INFO [train.py:996] (3/4) Epoch 1, batch 5800, loss[loss=0.277, simple_loss=0.3137, pruned_loss=0.1202, over 21935.00 frames. ], tot_loss[loss=0.3878, simple_loss=0.4234, pruned_loss=0.1761, over 4266759.72 frames. ], batch size: 98, lr: 4.00e-02, grad_scale: 32.0 2023-06-17 20:01:28,182 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 3.873e+02 4.586e+02 6.036e+02 1.114e+03, threshold=9.172e+02, percent-clipped=1.0 2023-06-17 20:01:39,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34920.0, ans=0.1 2023-06-17 20:01:53,735 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-06-17 20:02:00,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-17 20:02:03,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=34980.0, ans=0.125 2023-06-17 20:02:12,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=34980.0, ans=0.5 2023-06-17 20:02:16,188 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2023-06-17 20:02:43,506 INFO [train.py:996] (3/4) Epoch 1, batch 5850, loss[loss=0.3893, simple_loss=0.4693, pruned_loss=0.1547, over 19845.00 frames. ], tot_loss[loss=0.3737, simple_loss=0.4161, pruned_loss=0.1656, over 4271882.77 frames. ], batch size: 702, lr: 4.00e-02, grad_scale: 32.0 2023-06-17 20:02:46,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=35100.0, ans=0.015 2023-06-17 20:02:52,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=35100.0, ans=0.2 2023-06-17 20:03:04,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=35160.0, ans=0.1 2023-06-17 20:04:18,855 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.65 vs. limit=15.0 2023-06-17 20:04:20,743 INFO [train.py:996] (3/4) Epoch 1, batch 5900, loss[loss=0.3583, simple_loss=0.393, pruned_loss=0.1618, over 21561.00 frames. ], tot_loss[loss=0.3588, simple_loss=0.4053, pruned_loss=0.1561, over 4277606.01 frames. ], batch size: 212, lr: 3.99e-02, grad_scale: 32.0 2023-06-17 20:04:24,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=35400.0, ans=0.0 2023-06-17 20:04:42,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=35460.0, ans=0.0 2023-06-17 20:04:44,912 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.49 vs. limit=22.5 2023-06-17 20:04:48,502 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 3.363e+02 4.037e+02 5.226e+02 1.298e+03, threshold=8.074e+02, percent-clipped=7.0 2023-06-17 20:05:25,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=35580.0, ans=0.125 2023-06-17 20:05:52,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=35640.0, ans=0.1 2023-06-17 20:05:53,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=35640.0, ans=0.1 2023-06-17 20:05:55,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=35640.0, ans=0.05 2023-06-17 20:06:08,245 INFO [train.py:996] (3/4) Epoch 1, batch 5950, loss[loss=0.4189, simple_loss=0.4211, pruned_loss=0.2083, over 21481.00 frames. ], tot_loss[loss=0.3703, simple_loss=0.4092, pruned_loss=0.1657, over 4275478.98 frames. ], batch size: 389, lr: 3.98e-02, grad_scale: 32.0 2023-06-17 20:06:37,262 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.20 vs. limit=15.0 2023-06-17 20:06:41,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=35820.0, ans=0.125 2023-06-17 20:06:50,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=35820.0, ans=0.125 2023-06-17 20:07:12,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=35880.0, ans=0.1 2023-06-17 20:07:37,780 INFO [train.py:996] (3/4) Epoch 1, batch 6000, loss[loss=0.4039, simple_loss=0.4208, pruned_loss=0.1935, over 21500.00 frames. ], tot_loss[loss=0.3766, simple_loss=0.4092, pruned_loss=0.172, over 4275954.06 frames. ], batch size: 548, lr: 3.98e-02, grad_scale: 32.0 2023-06-17 20:07:37,781 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-17 20:07:56,503 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.3636, simple_loss=0.4388, pruned_loss=0.1442, over 1796401.00 frames. 2023-06-17 20:07:56,503 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-17 20:08:19,456 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.037e+02 4.782e+02 6.358e+02 7.928e+02 1.970e+03, threshold=1.272e+03, percent-clipped=23.0 2023-06-17 20:08:59,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36180.0, ans=0.1 2023-06-17 20:09:25,279 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-17 20:09:34,143 INFO [train.py:996] (3/4) Epoch 1, batch 6050, loss[loss=0.2918, simple_loss=0.3494, pruned_loss=0.1171, over 21554.00 frames. ], tot_loss[loss=0.3756, simple_loss=0.4057, pruned_loss=0.1728, over 4281809.14 frames. ], batch size: 230, lr: 3.97e-02, grad_scale: 32.0 2023-06-17 20:09:44,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=36300.0, ans=0.0029782608695652175 2023-06-17 20:09:53,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=36360.0, ans=0.125 2023-06-17 20:10:09,431 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.52 vs. limit=22.5 2023-06-17 20:11:07,780 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-17 20:11:15,934 INFO [train.py:996] (3/4) Epoch 1, batch 6100, loss[loss=0.384, simple_loss=0.4027, pruned_loss=0.1827, over 21616.00 frames. ], tot_loss[loss=0.3735, simple_loss=0.4034, pruned_loss=0.1718, over 4286903.32 frames. ], batch size: 230, lr: 3.96e-02, grad_scale: 32.0 2023-06-17 20:11:25,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=36600.0, ans=0.125 2023-06-17 20:11:25,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36600.0, ans=0.1 2023-06-17 20:11:28,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=36600.0, ans=0.0 2023-06-17 20:11:29,514 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.14 vs. limit=12.0 2023-06-17 20:11:36,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=36660.0, ans=0.0 2023-06-17 20:11:38,225 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.396e+02 4.105e+02 5.881e+02 8.261e+02 1.678e+03, threshold=1.176e+03, percent-clipped=6.0 2023-06-17 20:11:47,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=36660.0, ans=0.5 2023-06-17 20:12:08,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=36720.0, ans=0.0 2023-06-17 20:12:18,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=36780.0, ans=0.2 2023-06-17 20:12:39,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=36840.0, ans=0.0 2023-06-17 20:12:42,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36840.0, ans=0.1 2023-06-17 20:12:57,167 INFO [train.py:996] (3/4) Epoch 1, batch 6150, loss[loss=0.35, simple_loss=0.387, pruned_loss=0.1565, over 21600.00 frames. ], tot_loss[loss=0.3835, simple_loss=0.4099, pruned_loss=0.1785, over 4294346.96 frames. ], batch size: 263, lr: 3.96e-02, grad_scale: 32.0 2023-06-17 20:13:18,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36960.0, ans=0.1 2023-06-17 20:14:24,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=37140.0, ans=0.125 2023-06-17 20:14:31,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=37140.0, ans=0.125 2023-06-17 20:14:33,487 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=12.0 2023-06-17 20:14:38,997 INFO [train.py:996] (3/4) Epoch 1, batch 6200, loss[loss=0.3804, simple_loss=0.3917, pruned_loss=0.1845, over 21194.00 frames. ], tot_loss[loss=0.3851, simple_loss=0.4127, pruned_loss=0.1788, over 4297478.05 frames. ], batch size: 608, lr: 3.95e-02, grad_scale: 32.0 2023-06-17 20:15:07,345 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.629e+02 4.899e+02 6.626e+02 1.862e+03, threshold=9.798e+02, percent-clipped=4.0 2023-06-17 20:15:52,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=37380.0, ans=0.1 2023-06-17 20:15:54,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=37380.0, ans=10.0 2023-06-17 20:16:06,297 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2023-06-17 20:16:15,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=37440.0, ans=0.0 2023-06-17 20:16:21,919 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-17 20:16:22,538 INFO [train.py:996] (3/4) Epoch 1, batch 6250, loss[loss=0.4599, simple_loss=0.5038, pruned_loss=0.208, over 21399.00 frames. ], tot_loss[loss=0.3871, simple_loss=0.418, pruned_loss=0.1782, over 4297527.46 frames. ], batch size: 548, lr: 3.94e-02, grad_scale: 32.0 2023-06-17 20:17:32,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=37680.0, ans=0.04949747468305833 2023-06-17 20:17:44,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=37680.0, ans=0.125 2023-06-17 20:18:03,395 INFO [train.py:996] (3/4) Epoch 1, batch 6300, loss[loss=0.4293, simple_loss=0.4441, pruned_loss=0.2073, over 21846.00 frames. ], tot_loss[loss=0.388, simple_loss=0.4225, pruned_loss=0.1768, over 4291545.93 frames. ], batch size: 332, lr: 3.94e-02, grad_scale: 32.0 2023-06-17 20:18:41,521 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.831e+02 4.261e+02 6.027e+02 8.452e+02 1.541e+03, threshold=1.205e+03, percent-clipped=13.0 2023-06-17 20:19:14,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=37980.0, ans=0.125 2023-06-17 20:19:19,116 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.85 vs. limit=15.0 2023-06-17 20:19:46,069 INFO [train.py:996] (3/4) Epoch 1, batch 6350, loss[loss=0.4682, simple_loss=0.4757, pruned_loss=0.2304, over 21801.00 frames. ], tot_loss[loss=0.398, simple_loss=0.4288, pruned_loss=0.1836, over 4293789.90 frames. ], batch size: 282, lr: 3.93e-02, grad_scale: 32.0 2023-06-17 20:19:54,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=38100.0, ans=0.125 2023-06-17 20:19:56,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=38100.0, ans=0.2 2023-06-17 20:20:13,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=38100.0, ans=0.1 2023-06-17 20:20:15,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=38100.0, ans=0.0025869565217391307 2023-06-17 20:20:20,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=38160.0, ans=0.0 2023-06-17 20:20:57,643 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:21:46,033 INFO [train.py:996] (3/4) Epoch 1, batch 6400, loss[loss=0.4349, simple_loss=0.4483, pruned_loss=0.2108, over 21819.00 frames. ], tot_loss[loss=0.4098, simple_loss=0.4375, pruned_loss=0.1911, over 4288114.44 frames. ], batch size: 247, lr: 3.92e-02, grad_scale: 32.0 2023-06-17 20:22:15,054 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.706e+02 4.358e+02 5.224e+02 7.258e+02 1.926e+03, threshold=1.045e+03, percent-clipped=7.0 2023-06-17 20:22:27,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=38520.0, ans=0.2 2023-06-17 20:22:28,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=38520.0, ans=0.125 2023-06-17 20:22:48,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=38580.0, ans=0.1 2023-06-17 20:22:48,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=38580.0, ans=0.2 2023-06-17 20:23:30,201 INFO [train.py:996] (3/4) Epoch 1, batch 6450, loss[loss=0.5263, simple_loss=0.5757, pruned_loss=0.2384, over 20786.00 frames. ], tot_loss[loss=0.4067, simple_loss=0.437, pruned_loss=0.1882, over 4287972.62 frames. ], batch size: 607, lr: 3.92e-02, grad_scale: 32.0 2023-06-17 20:23:52,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=38760.0, ans=0.125 2023-06-17 20:24:14,009 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.52 vs. limit=15.0 2023-06-17 20:24:21,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=38820.0, ans=0.125 2023-06-17 20:24:32,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=38880.0, ans=0.125 2023-06-17 20:24:33,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=38880.0, ans=0.0 2023-06-17 20:25:09,296 INFO [train.py:996] (3/4) Epoch 1, batch 6500, loss[loss=0.3376, simple_loss=0.3613, pruned_loss=0.157, over 21198.00 frames. ], tot_loss[loss=0.4014, simple_loss=0.4292, pruned_loss=0.1868, over 4284516.43 frames. ], batch size: 144, lr: 3.91e-02, grad_scale: 32.0 2023-06-17 20:25:37,744 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 3.797e+02 4.922e+02 6.987e+02 1.536e+03, threshold=9.843e+02, percent-clipped=9.0 2023-06-17 20:26:46,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=39240.0, ans=0.2 2023-06-17 20:26:49,842 INFO [train.py:996] (3/4) Epoch 1, batch 6550, loss[loss=0.3794, simple_loss=0.4119, pruned_loss=0.1734, over 21871.00 frames. ], tot_loss[loss=0.3994, simple_loss=0.4287, pruned_loss=0.185, over 4283509.45 frames. ], batch size: 316, lr: 3.91e-02, grad_scale: 32.0 2023-06-17 20:27:04,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=39300.0, ans=0.05 2023-06-17 20:27:04,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=39300.0, ans=0.2 2023-06-17 20:27:12,880 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=22.5 2023-06-17 20:27:46,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=39480.0, ans=0.0 2023-06-17 20:28:28,292 INFO [train.py:996] (3/4) Epoch 1, batch 6600, loss[loss=0.3345, simple_loss=0.3569, pruned_loss=0.1561, over 21277.00 frames. ], tot_loss[loss=0.3951, simple_loss=0.4246, pruned_loss=0.1828, over 4267319.69 frames. ], batch size: 144, lr: 3.90e-02, grad_scale: 16.0 2023-06-17 20:28:46,228 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=15.0 2023-06-17 20:28:49,057 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=22.5 2023-06-17 20:28:53,848 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2023-06-17 20:28:57,766 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.389e+02 4.216e+02 5.009e+02 6.284e+02 1.954e+03, threshold=1.002e+03, percent-clipped=7.0 2023-06-17 20:29:57,695 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=5.226e-03 2023-06-17 20:30:17,376 INFO [train.py:996] (3/4) Epoch 1, batch 6650, loss[loss=0.3129, simple_loss=0.3429, pruned_loss=0.1414, over 21305.00 frames. ], tot_loss[loss=0.3843, simple_loss=0.4143, pruned_loss=0.1772, over 4260610.78 frames. ], batch size: 159, lr: 3.89e-02, grad_scale: 16.0 2023-06-17 20:30:18,707 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.95 vs. limit=15.0 2023-06-17 20:30:27,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=39900.0, ans=0.125 2023-06-17 20:30:45,932 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.02 vs. limit=15.0 2023-06-17 20:30:53,577 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2023-06-17 20:30:53,821 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.98 vs. limit=15.0 2023-06-17 20:31:54,219 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=15.0 2023-06-17 20:31:54,705 INFO [train.py:996] (3/4) Epoch 1, batch 6700, loss[loss=0.3616, simple_loss=0.3876, pruned_loss=0.1678, over 21530.00 frames. ], tot_loss[loss=0.3824, simple_loss=0.4116, pruned_loss=0.1766, over 4259504.10 frames. ], batch size: 230, lr: 3.89e-02, grad_scale: 16.0 2023-06-17 20:32:24,888 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.374e+02 3.778e+02 4.910e+02 6.670e+02 1.888e+03, threshold=9.820e+02, percent-clipped=8.0 2023-06-17 20:33:01,829 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.69 vs. limit=5.0 2023-06-17 20:33:38,582 INFO [train.py:996] (3/4) Epoch 1, batch 6750, loss[loss=0.4527, simple_loss=0.4489, pruned_loss=0.2283, over 21864.00 frames. ], tot_loss[loss=0.3814, simple_loss=0.4082, pruned_loss=0.1773, over 4269666.82 frames. ], batch size: 351, lr: 3.88e-02, grad_scale: 16.0 2023-06-17 20:33:55,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=40500.0, ans=0.125 2023-06-17 20:34:04,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=40560.0, ans=0.0 2023-06-17 20:34:08,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=40560.0, ans=0.125 2023-06-17 20:34:11,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=40560.0, ans=0.125 2023-06-17 20:35:21,748 INFO [train.py:996] (3/4) Epoch 1, batch 6800, loss[loss=0.5249, simple_loss=0.586, pruned_loss=0.2319, over 19780.00 frames. ], tot_loss[loss=0.386, simple_loss=0.4104, pruned_loss=0.1807, over 4271515.71 frames. ], batch size: 702, lr: 3.87e-02, grad_scale: 32.0 2023-06-17 20:35:50,487 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.963e+02 4.135e+02 5.210e+02 7.018e+02 1.112e+03, threshold=1.042e+03, percent-clipped=5.0 2023-06-17 20:35:58,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=40920.0, ans=0.125 2023-06-17 20:36:28,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=40980.0, ans=0.125 2023-06-17 20:36:35,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40980.0, ans=0.1 2023-06-17 20:37:02,293 INFO [train.py:996] (3/4) Epoch 1, batch 6850, loss[loss=0.3559, simple_loss=0.374, pruned_loss=0.1689, over 21566.00 frames. ], tot_loss[loss=0.3857, simple_loss=0.4069, pruned_loss=0.1823, over 4270719.81 frames. ], batch size: 263, lr: 3.87e-02, grad_scale: 32.0 2023-06-17 20:37:05,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=41100.0, ans=0.0 2023-06-17 20:37:09,382 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-17 20:38:06,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=41280.0, ans=0.125 2023-06-17 20:38:08,363 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=15.0 2023-06-17 20:38:21,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=41340.0, ans=0.125 2023-06-17 20:38:31,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=41340.0, ans=0.125 2023-06-17 20:38:44,863 INFO [train.py:996] (3/4) Epoch 1, batch 6900, loss[loss=0.3729, simple_loss=0.4314, pruned_loss=0.1572, over 21708.00 frames. ], tot_loss[loss=0.3863, simple_loss=0.408, pruned_loss=0.1823, over 4275055.56 frames. ], batch size: 414, lr: 3.86e-02, grad_scale: 32.0 2023-06-17 20:39:00,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=41400.0, ans=0.07 2023-06-17 20:39:15,040 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.588e+02 4.043e+02 5.136e+02 6.723e+02 1.147e+03, threshold=1.027e+03, percent-clipped=4.0 2023-06-17 20:39:23,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=41520.0, ans=0.125 2023-06-17 20:39:25,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=41520.0, ans=0.0 2023-06-17 20:40:05,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=41580.0, ans=0.07 2023-06-17 20:40:12,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=41640.0, ans=0.07 2023-06-17 20:40:33,691 INFO [train.py:996] (3/4) Epoch 1, batch 6950, loss[loss=0.4395, simple_loss=0.4671, pruned_loss=0.206, over 21846.00 frames. ], tot_loss[loss=0.3812, simple_loss=0.4083, pruned_loss=0.1771, over 4281081.34 frames. ], batch size: 118, lr: 3.85e-02, grad_scale: 32.0 2023-06-17 20:41:06,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=41820.0, ans=0.125 2023-06-17 20:41:37,361 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-06-17 20:42:09,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=41940.0, ans=0.0017521739130434769 2023-06-17 20:42:15,393 INFO [train.py:996] (3/4) Epoch 1, batch 7000, loss[loss=0.4616, simple_loss=0.4349, pruned_loss=0.2441, over 21340.00 frames. ], tot_loss[loss=0.3884, simple_loss=0.4127, pruned_loss=0.182, over 4279999.84 frames. ], batch size: 508, lr: 3.85e-02, grad_scale: 32.0 2023-06-17 20:42:40,005 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 3.856e+02 5.678e+02 7.793e+02 1.284e+03, threshold=1.136e+03, percent-clipped=9.0 2023-06-17 20:42:49,500 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-06-17 20:43:58,195 INFO [train.py:996] (3/4) Epoch 1, batch 7050, loss[loss=0.3349, simple_loss=0.3882, pruned_loss=0.1408, over 21586.00 frames. ], tot_loss[loss=0.3825, simple_loss=0.4082, pruned_loss=0.1784, over 4276216.41 frames. ], batch size: 263, lr: 3.84e-02, grad_scale: 32.0 2023-06-17 20:44:31,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=42360.0, ans=10.0 2023-06-17 20:44:32,231 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.48 vs. limit=22.5 2023-06-17 20:44:53,047 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.36 vs. limit=22.5 2023-06-17 20:45:16,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=42480.0, ans=0.0016347826086956525 2023-06-17 20:45:19,408 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.79 vs. limit=15.0 2023-06-17 20:45:27,560 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2023-06-17 20:45:33,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=42540.0, ans=0.5 2023-06-17 20:45:41,222 INFO [train.py:996] (3/4) Epoch 1, batch 7100, loss[loss=0.3348, simple_loss=0.3827, pruned_loss=0.1435, over 21725.00 frames. ], tot_loss[loss=0.3901, simple_loss=0.4159, pruned_loss=0.1822, over 4278927.59 frames. ], batch size: 332, lr: 3.83e-02, grad_scale: 16.0 2023-06-17 20:46:23,393 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.465e+02 3.499e+02 4.765e+02 6.343e+02 1.936e+03, threshold=9.530e+02, percent-clipped=5.0 2023-06-17 20:46:43,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=42720.0, ans=0.0 2023-06-17 20:46:54,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=42780.0, ans=0.0 2023-06-17 20:47:00,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=42780.0, ans=0.0015695652173913048 2023-06-17 20:47:23,717 INFO [train.py:996] (3/4) Epoch 1, batch 7150, loss[loss=0.4473, simple_loss=0.4645, pruned_loss=0.215, over 21611.00 frames. ], tot_loss[loss=0.3826, simple_loss=0.4109, pruned_loss=0.1772, over 4283604.10 frames. ], batch size: 389, lr: 3.83e-02, grad_scale: 16.0 2023-06-17 20:47:40,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=42900.0, ans=0.05 2023-06-17 20:48:23,033 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.43 vs. limit=15.0 2023-06-17 20:48:47,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=43080.0, ans=0.0 2023-06-17 20:49:07,060 INFO [train.py:996] (3/4) Epoch 1, batch 7200, loss[loss=0.3822, simple_loss=0.394, pruned_loss=0.1852, over 21182.00 frames. ], tot_loss[loss=0.39, simple_loss=0.4149, pruned_loss=0.1825, over 4280189.27 frames. ], batch size: 159, lr: 3.82e-02, grad_scale: 32.0 2023-06-17 20:49:48,753 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.531e+02 4.300e+02 5.251e+02 6.410e+02 9.416e+02, threshold=1.050e+03, percent-clipped=0.0 2023-06-17 20:49:49,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43260.0, ans=0.1 2023-06-17 20:50:05,168 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=12.0 2023-06-17 20:50:25,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=43380.0, ans=0.125 2023-06-17 20:50:37,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=43440.0, ans=0.125 2023-06-17 20:50:48,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=43500.0, ans=0.0014130434782608694 2023-06-17 20:50:49,418 INFO [train.py:996] (3/4) Epoch 1, batch 7250, loss[loss=0.3477, simple_loss=0.3561, pruned_loss=0.1697, over 21237.00 frames. ], tot_loss[loss=0.386, simple_loss=0.4088, pruned_loss=0.1815, over 4280276.62 frames. ], batch size: 549, lr: 3.82e-02, grad_scale: 32.0 2023-06-17 20:50:56,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=43500.0, ans=0.1 2023-06-17 20:52:05,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=43680.0, ans=0.0 2023-06-17 20:52:10,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=43740.0, ans=0.125 2023-06-17 20:52:13,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43740.0, ans=0.1 2023-06-17 20:52:31,510 INFO [train.py:996] (3/4) Epoch 1, batch 7300, loss[loss=0.3653, simple_loss=0.3845, pruned_loss=0.1731, over 21364.00 frames. ], tot_loss[loss=0.3777, simple_loss=0.4001, pruned_loss=0.1776, over 4270413.33 frames. ], batch size: 131, lr: 3.81e-02, grad_scale: 32.0 2023-06-17 20:52:50,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=43800.0, ans=0.125 2023-06-17 20:53:07,897 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.375e+02 3.808e+02 5.144e+02 6.713e+02 1.157e+03, threshold=1.029e+03, percent-clipped=4.0 2023-06-17 20:53:21,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=43920.0, ans=0.1 2023-06-17 20:53:38,313 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=15.0 2023-06-17 20:53:56,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=44040.0, ans=15.0 2023-06-17 20:54:25,387 INFO [train.py:996] (3/4) Epoch 1, batch 7350, loss[loss=0.406, simple_loss=0.4175, pruned_loss=0.1973, over 21704.00 frames. ], tot_loss[loss=0.3777, simple_loss=0.3976, pruned_loss=0.1789, over 4272386.24 frames. ], batch size: 298, lr: 3.80e-02, grad_scale: 32.0 2023-06-17 20:54:29,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=44100.0, ans=0.125 2023-06-17 20:54:54,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=44160.0, ans=0.2 2023-06-17 20:55:01,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=44160.0, ans=0.125 2023-06-17 20:55:11,874 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.38 vs. limit=10.0 2023-06-17 20:55:55,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=44340.0, ans=0.125 2023-06-17 20:56:05,309 INFO [train.py:996] (3/4) Epoch 1, batch 7400, loss[loss=0.3699, simple_loss=0.4214, pruned_loss=0.1592, over 21828.00 frames. ], tot_loss[loss=0.3876, simple_loss=0.408, pruned_loss=0.1836, over 4276026.47 frames. ], batch size: 317, lr: 3.80e-02, grad_scale: 32.0 2023-06-17 20:56:36,858 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.553e+02 4.273e+02 5.813e+02 7.639e+02 1.411e+03, threshold=1.163e+03, percent-clipped=7.0 2023-06-17 20:56:52,500 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.20 vs. limit=22.5 2023-06-17 20:57:42,577 INFO [train.py:996] (3/4) Epoch 1, batch 7450, loss[loss=0.322, simple_loss=0.3512, pruned_loss=0.1464, over 21415.00 frames. ], tot_loss[loss=0.3826, simple_loss=0.4049, pruned_loss=0.1802, over 4280580.33 frames. ], batch size: 195, lr: 3.79e-02, grad_scale: 32.0 2023-06-17 20:57:56,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=44700.0, ans=0.04949747468305833 2023-06-17 20:58:16,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=44760.0, ans=0.125 2023-06-17 20:59:34,479 INFO [train.py:996] (3/4) Epoch 1, batch 7500, loss[loss=0.4568, simple_loss=0.5025, pruned_loss=0.2056, over 21266.00 frames. ], tot_loss[loss=0.39, simple_loss=0.4126, pruned_loss=0.1837, over 4277500.17 frames. ], batch size: 549, lr: 3.78e-02, grad_scale: 32.0 2023-06-17 20:59:40,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=45000.0, ans=0.001086956521739131 2023-06-17 20:59:56,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=45060.0, ans=0.125 2023-06-17 21:00:01,009 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.011e+02 4.419e+02 5.234e+02 7.057e+02 1.215e+03, threshold=1.047e+03, percent-clipped=2.0 2023-06-17 21:00:14,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=45120.0, ans=0.125 2023-06-17 21:00:19,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=45120.0, ans=0.001060869565217391 2023-06-17 21:00:23,320 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.85 vs. limit=15.0 2023-06-17 21:01:13,068 INFO [train.py:996] (3/4) Epoch 1, batch 7550, loss[loss=0.3441, simple_loss=0.3946, pruned_loss=0.1468, over 21786.00 frames. ], tot_loss[loss=0.3919, simple_loss=0.4205, pruned_loss=0.1816, over 4276780.74 frames. ], batch size: 118, lr: 3.78e-02, grad_scale: 32.0 2023-06-17 21:01:18,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=45300.0, ans=0.125 2023-06-17 21:01:26,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=45300.0, ans=0.125 2023-06-17 21:01:44,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=45420.0, ans=0.0 2023-06-17 21:02:37,802 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.46 vs. limit=6.0 2023-06-17 21:02:56,468 INFO [train.py:996] (3/4) Epoch 1, batch 7600, loss[loss=0.3884, simple_loss=0.4125, pruned_loss=0.1821, over 21321.00 frames. ], tot_loss[loss=0.3879, simple_loss=0.4178, pruned_loss=0.179, over 4276975.07 frames. ], batch size: 159, lr: 3.77e-02, grad_scale: 32.0 2023-06-17 21:03:03,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=45600.0, ans=0.125 2023-06-17 21:03:22,200 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.375e+02 3.546e+02 4.998e+02 6.623e+02 1.459e+03, threshold=9.996e+02, percent-clipped=5.0 2023-06-17 21:03:22,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=45660.0, ans=0.0 2023-06-17 21:03:38,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=45720.0, ans=0.0 2023-06-17 21:04:14,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=45780.0, ans=0.0009173913043478265 2023-06-17 21:04:38,379 INFO [train.py:996] (3/4) Epoch 1, batch 7650, loss[loss=0.4002, simple_loss=0.4158, pruned_loss=0.1923, over 21914.00 frames. ], tot_loss[loss=0.3912, simple_loss=0.4174, pruned_loss=0.1825, over 4283528.44 frames. ], batch size: 414, lr: 3.77e-02, grad_scale: 32.0 2023-06-17 21:05:14,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=46020.0, ans=0.125 2023-06-17 21:06:24,110 INFO [train.py:996] (3/4) Epoch 1, batch 7700, loss[loss=0.3792, simple_loss=0.4007, pruned_loss=0.1788, over 19959.00 frames. ], tot_loss[loss=0.3974, simple_loss=0.4213, pruned_loss=0.1867, over 4290478.80 frames. ], batch size: 702, lr: 3.76e-02, grad_scale: 32.0 2023-06-17 21:06:44,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=46260.0, ans=0.0008130434782608695 2023-06-17 21:06:48,968 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-17 21:06:50,967 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.909e+02 4.180e+02 5.539e+02 6.663e+02 1.200e+03, threshold=1.108e+03, percent-clipped=4.0 2023-06-17 21:06:51,454 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:07:29,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.10 vs. limit=22.5 2023-06-17 21:07:31,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=46380.0, ans=0.0007869565217391312 2023-06-17 21:07:57,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=46440.0, ans=0.2 2023-06-17 21:08:09,030 INFO [train.py:996] (3/4) Epoch 1, batch 7750, loss[loss=0.392, simple_loss=0.4455, pruned_loss=0.1693, over 21375.00 frames. ], tot_loss[loss=0.3981, simple_loss=0.4253, pruned_loss=0.1854, over 4287009.09 frames. ], batch size: 194, lr: 3.75e-02, grad_scale: 32.0 2023-06-17 21:08:14,797 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.86 vs. limit=15.0 2023-06-17 21:08:15,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=46500.0, ans=0.0 2023-06-17 21:09:53,149 INFO [train.py:996] (3/4) Epoch 1, batch 7800, loss[loss=0.2877, simple_loss=0.3118, pruned_loss=0.1318, over 21238.00 frames. ], tot_loss[loss=0.3997, simple_loss=0.4275, pruned_loss=0.1859, over 4277646.46 frames. ], batch size: 143, lr: 3.75e-02, grad_scale: 32.0 2023-06-17 21:10:14,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=46860.0, ans=0.0 2023-06-17 21:10:28,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=46860.0, ans=0.0 2023-06-17 21:10:29,650 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.732e+02 4.420e+02 5.608e+02 7.244e+02 1.529e+03, threshold=1.122e+03, percent-clipped=4.0 2023-06-17 21:10:59,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=46920.0, ans=0.1 2023-06-17 21:11:32,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=47040.0, ans=0.0006434782608695649 2023-06-17 21:11:35,298 INFO [train.py:996] (3/4) Epoch 1, batch 7850, loss[loss=0.35, simple_loss=0.3711, pruned_loss=0.1645, over 21691.00 frames. ], tot_loss[loss=0.3926, simple_loss=0.4189, pruned_loss=0.1831, over 4276942.63 frames. ], batch size: 333, lr: 3.74e-02, grad_scale: 32.0 2023-06-17 21:11:58,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=47160.0, ans=0.125 2023-06-17 21:12:11,189 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=15.0 2023-06-17 21:12:47,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=47280.0, ans=0.0005913043478260865 2023-06-17 21:13:01,761 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.27 vs. limit=15.0 2023-06-17 21:13:07,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.10 vs. limit=6.0 2023-06-17 21:13:19,393 INFO [train.py:996] (3/4) Epoch 1, batch 7900, loss[loss=0.3301, simple_loss=0.3457, pruned_loss=0.1572, over 21879.00 frames. ], tot_loss[loss=0.3889, simple_loss=0.4139, pruned_loss=0.182, over 4266608.69 frames. ], batch size: 98, lr: 3.73e-02, grad_scale: 32.0 2023-06-17 21:13:30,949 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.98 vs. limit=15.0 2023-06-17 21:13:56,672 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.667e+02 5.315e+02 6.476e+02 7.892e+02 1.492e+03, threshold=1.295e+03, percent-clipped=7.0 2023-06-17 21:14:28,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=47580.0, ans=0.2 2023-06-17 21:14:57,990 INFO [train.py:996] (3/4) Epoch 1, batch 7950, loss[loss=0.4465, simple_loss=0.471, pruned_loss=0.211, over 21695.00 frames. ], tot_loss[loss=0.3924, simple_loss=0.4211, pruned_loss=0.1819, over 4266010.01 frames. ], batch size: 414, lr: 3.73e-02, grad_scale: 32.0 2023-06-17 21:15:08,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=47700.0, ans=0.125 2023-06-17 21:15:18,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=47700.0, ans=0.035 2023-06-17 21:16:03,558 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.11 vs. limit=15.0 2023-06-17 21:16:16,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=47880.0, ans=0.125 2023-06-17 21:16:50,544 INFO [train.py:996] (3/4) Epoch 1, batch 8000, loss[loss=0.3656, simple_loss=0.3643, pruned_loss=0.1835, over 20297.00 frames. ], tot_loss[loss=0.3978, simple_loss=0.4244, pruned_loss=0.1856, over 4264572.69 frames. ], batch size: 703, lr: 3.72e-02, grad_scale: 32.0 2023-06-17 21:17:28,890 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.757e+02 4.328e+02 5.465e+02 6.460e+02 1.072e+03, threshold=1.093e+03, percent-clipped=0.0 2023-06-17 21:18:02,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=48180.0, ans=0.0 2023-06-17 21:18:11,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=48180.0, ans=0.125 2023-06-17 21:18:32,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=48240.0, ans=0.125 2023-06-17 21:18:43,559 INFO [train.py:996] (3/4) Epoch 1, batch 8050, loss[loss=0.3044, simple_loss=0.33, pruned_loss=0.1394, over 21399.00 frames. ], tot_loss[loss=0.3953, simple_loss=0.4234, pruned_loss=0.1836, over 4266066.89 frames. ], batch size: 131, lr: 3.72e-02, grad_scale: 32.0 2023-06-17 21:18:47,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=48300.0, ans=0.125 2023-06-17 21:19:17,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=48420.0, ans=0.0 2023-06-17 21:19:18,360 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.70 vs. limit=22.5 2023-06-17 21:19:24,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=48420.0, ans=0.1 2023-06-17 21:20:27,368 INFO [train.py:996] (3/4) Epoch 1, batch 8100, loss[loss=0.3932, simple_loss=0.4078, pruned_loss=0.1893, over 21796.00 frames. ], tot_loss[loss=0.3985, simple_loss=0.426, pruned_loss=0.1855, over 4261303.39 frames. ], batch size: 247, lr: 3.71e-02, grad_scale: 32.0 2023-06-17 21:20:36,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=48600.0, ans=0.0 2023-06-17 21:20:54,434 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 4.589e+02 6.477e+02 8.462e+02 1.426e+03, threshold=1.295e+03, percent-clipped=5.0 2023-06-17 21:21:18,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=48720.0, ans=0.00027826086956521737 2023-06-17 21:22:10,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=48840.0, ans=0.125 2023-06-17 21:22:14,024 INFO [train.py:996] (3/4) Epoch 1, batch 8150, loss[loss=0.3943, simple_loss=0.4553, pruned_loss=0.1667, over 21818.00 frames. ], tot_loss[loss=0.4051, simple_loss=0.435, pruned_loss=0.1876, over 4260615.14 frames. ], batch size: 372, lr: 3.70e-02, grad_scale: 16.0 2023-06-17 21:22:54,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=48960.0, ans=0.05 2023-06-17 21:22:58,592 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.94 vs. limit=15.0 2023-06-17 21:23:58,526 INFO [train.py:996] (3/4) Epoch 1, batch 8200, loss[loss=0.4157, simple_loss=0.418, pruned_loss=0.2067, over 21619.00 frames. ], tot_loss[loss=0.3988, simple_loss=0.4283, pruned_loss=0.1847, over 4256834.91 frames. ], batch size: 415, lr: 3.70e-02, grad_scale: 16.0 2023-06-17 21:24:22,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=49260.0, ans=0.2 2023-06-17 21:24:36,651 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.823e+02 4.911e+02 6.054e+02 7.943e+02 1.649e+03, threshold=1.211e+03, percent-clipped=3.0 2023-06-17 21:25:14,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=49380.0, ans=0.125 2023-06-17 21:25:42,188 INFO [train.py:996] (3/4) Epoch 1, batch 8250, loss[loss=0.4095, simple_loss=0.411, pruned_loss=0.204, over 21298.00 frames. ], tot_loss[loss=0.3985, simple_loss=0.4265, pruned_loss=0.1852, over 4255752.29 frames. ], batch size: 608, lr: 3.69e-02, grad_scale: 16.0 2023-06-17 21:26:50,716 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.23 vs. limit=15.0 2023-06-17 21:26:55,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=49680.0, ans=0.0 2023-06-17 21:27:25,179 INFO [train.py:996] (3/4) Epoch 1, batch 8300, loss[loss=0.3614, simple_loss=0.411, pruned_loss=0.1559, over 21699.00 frames. ], tot_loss[loss=0.3899, simple_loss=0.4209, pruned_loss=0.1795, over 4259524.75 frames. ], batch size: 351, lr: 3.68e-02, grad_scale: 16.0 2023-06-17 21:27:50,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=49860.0, ans=0.2 2023-06-17 21:28:03,861 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 3.951e+02 4.948e+02 6.196e+02 1.080e+03, threshold=9.896e+02, percent-clipped=0.0 2023-06-17 21:28:11,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=49920.0, ans=1.7391304347826736e-05 2023-06-17 21:28:40,429 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.53 vs. limit=22.5 2023-06-17 21:28:41,997 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.02 vs. limit=15.0 2023-06-17 21:29:09,825 INFO [train.py:996] (3/4) Epoch 1, batch 8350, loss[loss=0.3289, simple_loss=0.3687, pruned_loss=0.1445, over 21799.00 frames. ], tot_loss[loss=0.3819, simple_loss=0.4161, pruned_loss=0.1739, over 4261756.98 frames. ], batch size: 118, lr: 3.68e-02, grad_scale: 16.0 2023-06-17 21:30:14,708 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=15.0 2023-06-17 21:30:15,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50280.0, ans=0.1 2023-06-17 21:30:20,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=50280.0, ans=0.125 2023-06-17 21:30:25,899 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.10 vs. limit=15.0 2023-06-17 21:30:53,940 INFO [train.py:996] (3/4) Epoch 1, batch 8400, loss[loss=0.2122, simple_loss=0.2548, pruned_loss=0.08479, over 16674.00 frames. ], tot_loss[loss=0.3749, simple_loss=0.4113, pruned_loss=0.1692, over 4253150.06 frames. ], batch size: 62, lr: 3.67e-02, grad_scale: 32.0 2023-06-17 21:30:59,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=50400.0, ans=0.125 2023-06-17 21:31:01,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=50400.0, ans=0.125 2023-06-17 21:31:32,768 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.584e+02 3.530e+02 4.836e+02 6.875e+02 1.901e+03, threshold=9.672e+02, percent-clipped=8.0 2023-06-17 21:32:41,499 INFO [train.py:996] (3/4) Epoch 1, batch 8450, loss[loss=0.431, simple_loss=0.4877, pruned_loss=0.1871, over 20798.00 frames. ], tot_loss[loss=0.3762, simple_loss=0.4117, pruned_loss=0.1704, over 4262878.67 frames. ], batch size: 607, lr: 3.67e-02, grad_scale: 16.0 2023-06-17 21:32:55,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=50700.0, ans=0.125 2023-06-17 21:33:21,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=50820.0, ans=0.125 2023-06-17 21:33:34,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=50820.0, ans=0.1 2023-06-17 21:33:46,217 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-17 21:33:54,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=50880.0, ans=0.1 2023-06-17 21:34:06,712 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.40 vs. limit=22.5 2023-06-17 21:34:13,630 INFO [train.py:996] (3/4) Epoch 1, batch 8500, loss[loss=0.4002, simple_loss=0.3915, pruned_loss=0.2045, over 21505.00 frames. ], tot_loss[loss=0.3758, simple_loss=0.4074, pruned_loss=0.1721, over 4268048.08 frames. ], batch size: 511, lr: 3.66e-02, grad_scale: 16.0 2023-06-17 21:35:00,158 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.023e+02 4.457e+02 5.550e+02 7.009e+02 1.801e+03, threshold=1.110e+03, percent-clipped=10.0 2023-06-17 21:35:27,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=51180.0, ans=0.0 2023-06-17 21:35:41,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=51240.0, ans=0.125 2023-06-17 21:35:49,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=51240.0, ans=0.0 2023-06-17 21:35:51,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=51240.0, ans=0.2 2023-06-17 21:36:04,099 INFO [train.py:996] (3/4) Epoch 1, batch 8550, loss[loss=0.399, simple_loss=0.4521, pruned_loss=0.1729, over 21843.00 frames. ], tot_loss[loss=0.3813, simple_loss=0.4119, pruned_loss=0.1754, over 4271380.11 frames. ], batch size: 371, lr: 3.65e-02, grad_scale: 16.0 2023-06-17 21:36:44,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=51360.0, ans=0.1 2023-06-17 21:37:04,973 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-06-17 21:37:14,148 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-17 21:37:56,909 INFO [train.py:996] (3/4) Epoch 1, batch 8600, loss[loss=0.4534, simple_loss=0.4613, pruned_loss=0.2228, over 21343.00 frames. ], tot_loss[loss=0.3888, simple_loss=0.4208, pruned_loss=0.1784, over 4275197.26 frames. ], batch size: 548, lr: 3.65e-02, grad_scale: 16.0 2023-06-17 21:38:13,393 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.89 vs. limit=15.0 2023-06-17 21:38:38,363 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.423e+02 4.881e+02 5.851e+02 7.697e+02 1.206e+03, threshold=1.170e+03, percent-clipped=2.0 2023-06-17 21:38:52,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=51720.0, ans=0.04949747468305833 2023-06-17 21:39:47,596 INFO [train.py:996] (3/4) Epoch 1, batch 8650, loss[loss=0.326, simple_loss=0.3965, pruned_loss=0.1277, over 21823.00 frames. ], tot_loss[loss=0.3961, simple_loss=0.4303, pruned_loss=0.181, over 4276204.42 frames. ], batch size: 316, lr: 3.64e-02, grad_scale: 16.0 2023-06-17 21:40:19,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=51960.0, ans=0.1 2023-06-17 21:41:01,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=52140.0, ans=0.0 2023-06-17 21:41:24,306 INFO [train.py:996] (3/4) Epoch 1, batch 8700, loss[loss=0.3306, simple_loss=0.3608, pruned_loss=0.1502, over 21389.00 frames. ], tot_loss[loss=0.3877, simple_loss=0.4241, pruned_loss=0.1756, over 4267715.68 frames. ], batch size: 131, lr: 3.64e-02, grad_scale: 16.0 2023-06-17 21:41:33,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=52200.0, ans=0.0 2023-06-17 21:42:04,516 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.951e+02 4.948e+02 6.720e+02 1.137e+03, threshold=9.897e+02, percent-clipped=0.0 2023-06-17 21:42:06,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=52320.0, ans=0.0 2023-06-17 21:42:15,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=52320.0, ans=0.1 2023-06-17 21:42:29,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=52380.0, ans=0.0 2023-06-17 21:42:30,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=52380.0, ans=0.0 2023-06-17 21:43:07,234 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-17 21:43:13,259 INFO [train.py:996] (3/4) Epoch 1, batch 8750, loss[loss=0.398, simple_loss=0.4147, pruned_loss=0.1906, over 21832.00 frames. ], tot_loss[loss=0.3879, simple_loss=0.4192, pruned_loss=0.1783, over 4279009.07 frames. ], batch size: 282, lr: 3.63e-02, grad_scale: 16.0 2023-06-17 21:43:53,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=52620.0, ans=0.1 2023-06-17 21:44:04,431 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=19.70 vs. limit=22.5 2023-06-17 21:45:03,160 INFO [train.py:996] (3/4) Epoch 1, batch 8800, loss[loss=0.4528, simple_loss=0.4677, pruned_loss=0.2189, over 21191.00 frames. ], tot_loss[loss=0.3982, simple_loss=0.4285, pruned_loss=0.184, over 4277661.70 frames. ], batch size: 143, lr: 3.62e-02, grad_scale: 32.0 2023-06-17 21:45:29,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=52860.0, ans=0.05 2023-06-17 21:45:32,796 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.521e+02 5.221e+02 6.385e+02 9.121e+02 2.025e+03, threshold=1.277e+03, percent-clipped=20.0 2023-06-17 21:45:59,658 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:46:06,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=52980.0, ans=0.125 2023-06-17 21:46:49,125 INFO [train.py:996] (3/4) Epoch 1, batch 8850, loss[loss=0.3255, simple_loss=0.3737, pruned_loss=0.1387, over 21179.00 frames. ], tot_loss[loss=0.4035, simple_loss=0.4357, pruned_loss=0.1857, over 4268320.62 frames. ], batch size: 143, lr: 3.62e-02, grad_scale: 32.0 2023-06-17 21:46:57,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=53100.0, ans=0.0 2023-06-17 21:47:09,619 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=12.0 2023-06-17 21:47:20,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=53220.0, ans=0.1 2023-06-17 21:48:13,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=53340.0, ans=0.05 2023-06-17 21:48:33,209 INFO [train.py:996] (3/4) Epoch 1, batch 8900, loss[loss=0.393, simple_loss=0.4068, pruned_loss=0.1896, over 21783.00 frames. ], tot_loss[loss=0.4007, simple_loss=0.4311, pruned_loss=0.1852, over 4265016.97 frames. ], batch size: 102, lr: 3.61e-02, grad_scale: 32.0 2023-06-17 21:48:42,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=53400.0, ans=0.04949747468305833 2023-06-17 21:49:10,042 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.740e+02 3.829e+02 5.153e+02 6.429e+02 1.062e+03, threshold=1.031e+03, percent-clipped=0.0 2023-06-17 21:50:04,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=53640.0, ans=0.125 2023-06-17 21:50:19,516 INFO [train.py:996] (3/4) Epoch 1, batch 8950, loss[loss=0.4282, simple_loss=0.4488, pruned_loss=0.2038, over 21635.00 frames. ], tot_loss[loss=0.3983, simple_loss=0.4321, pruned_loss=0.1822, over 4253402.12 frames. ], batch size: 414, lr: 3.61e-02, grad_scale: 32.0 2023-06-17 21:50:49,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=53760.0, ans=0.05 2023-06-17 21:50:54,122 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.06 vs. limit=15.0 2023-06-17 21:51:01,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=53820.0, ans=0.125 2023-06-17 21:51:38,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=53880.0, ans=0.125 2023-06-17 21:52:04,303 INFO [train.py:996] (3/4) Epoch 1, batch 9000, loss[loss=0.3514, simple_loss=0.381, pruned_loss=0.1608, over 21905.00 frames. ], tot_loss[loss=0.3928, simple_loss=0.4237, pruned_loss=0.1809, over 4255930.00 frames. ], batch size: 107, lr: 3.60e-02, grad_scale: 32.0 2023-06-17 21:52:04,304 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-17 21:52:23,527 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.3404, simple_loss=0.4251, pruned_loss=0.1278, over 1796401.00 frames. 2023-06-17 21:52:23,528 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-17 21:53:06,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 4.196e+02 5.638e+02 6.877e+02 1.385e+03, threshold=1.128e+03, percent-clipped=3.0 2023-06-17 21:53:15,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=54120.0, ans=0.0 2023-06-17 21:54:04,846 INFO [train.py:996] (3/4) Epoch 1, batch 9050, loss[loss=0.4808, simple_loss=0.4965, pruned_loss=0.2325, over 21807.00 frames. ], tot_loss[loss=0.3845, simple_loss=0.4183, pruned_loss=0.1753, over 4249458.54 frames. ], batch size: 118, lr: 3.59e-02, grad_scale: 32.0 2023-06-17 21:55:00,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=54420.0, ans=0.125 2023-06-17 21:55:02,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=54420.0, ans=0.0 2023-06-17 21:55:11,497 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.28 vs. limit=6.0 2023-06-17 21:55:22,436 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-06-17 21:55:32,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=54540.0, ans=0.1 2023-06-17 21:55:45,377 INFO [train.py:996] (3/4) Epoch 1, batch 9100, loss[loss=0.362, simple_loss=0.4206, pruned_loss=0.1517, over 21696.00 frames. ], tot_loss[loss=0.3921, simple_loss=0.425, pruned_loss=0.1795, over 4250326.76 frames. ], batch size: 298, lr: 3.59e-02, grad_scale: 32.0 2023-06-17 21:56:31,623 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 3.753e+02 4.940e+02 6.814e+02 2.174e+03, threshold=9.881e+02, percent-clipped=7.0 2023-06-17 21:56:33,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=54720.0, ans=0.0 2023-06-17 21:56:36,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=54720.0, ans=0.125 2023-06-17 21:57:17,420 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-06-17 21:57:35,303 INFO [train.py:996] (3/4) Epoch 1, batch 9150, loss[loss=0.3804, simple_loss=0.4358, pruned_loss=0.1625, over 21799.00 frames. ], tot_loss[loss=0.3843, simple_loss=0.4229, pruned_loss=0.1729, over 4259945.87 frames. ], batch size: 351, lr: 3.58e-02, grad_scale: 32.0 2023-06-17 21:57:37,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=54900.0, ans=0.1 2023-06-17 21:58:00,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=54900.0, ans=0.2 2023-06-17 21:58:00,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=54900.0, ans=0.125 2023-06-17 21:58:09,056 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-17 21:58:16,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=54960.0, ans=0.05 2023-06-17 21:59:29,967 INFO [train.py:996] (3/4) Epoch 1, batch 9200, loss[loss=0.4077, simple_loss=0.4384, pruned_loss=0.1885, over 21294.00 frames. ], tot_loss[loss=0.3856, simple_loss=0.4259, pruned_loss=0.1726, over 4266355.44 frames. ], batch size: 176, lr: 3.58e-02, grad_scale: 32.0 2023-06-17 21:59:41,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=55200.0, ans=0.04949747468305833 2023-06-17 21:59:44,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=55260.0, ans=0.125 2023-06-17 21:59:59,129 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.844e+02 4.127e+02 5.458e+02 7.694e+02 1.391e+03, threshold=1.092e+03, percent-clipped=9.0 2023-06-17 22:00:04,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=55320.0, ans=0.125 2023-06-17 22:01:13,171 INFO [train.py:996] (3/4) Epoch 1, batch 9250, loss[loss=0.3886, simple_loss=0.4089, pruned_loss=0.1841, over 21853.00 frames. ], tot_loss[loss=0.3958, simple_loss=0.4314, pruned_loss=0.1801, over 4267389.77 frames. ], batch size: 118, lr: 3.57e-02, grad_scale: 32.0 2023-06-17 22:01:20,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=55500.0, ans=0.2 2023-06-17 22:02:41,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=55740.0, ans=0.0 2023-06-17 22:02:58,736 INFO [train.py:996] (3/4) Epoch 1, batch 9300, loss[loss=0.4572, simple_loss=0.4693, pruned_loss=0.2225, over 20619.00 frames. ], tot_loss[loss=0.3922, simple_loss=0.4241, pruned_loss=0.1801, over 4259622.37 frames. ], batch size: 607, lr: 3.57e-02, grad_scale: 32.0 2023-06-17 22:03:28,667 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.804e+02 4.090e+02 5.143e+02 6.278e+02 1.452e+03, threshold=1.029e+03, percent-clipped=2.0 2023-06-17 22:03:39,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=55920.0, ans=0.125 2023-06-17 22:03:44,469 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.66 vs. limit=22.5 2023-06-17 22:04:44,705 INFO [train.py:996] (3/4) Epoch 1, batch 9350, loss[loss=0.4163, simple_loss=0.4499, pruned_loss=0.1913, over 21629.00 frames. ], tot_loss[loss=0.4, simple_loss=0.4328, pruned_loss=0.1836, over 4262771.00 frames. ], batch size: 230, lr: 3.56e-02, grad_scale: 32.0 2023-06-17 22:05:37,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=56220.0, ans=0.2 2023-06-17 22:05:37,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=56220.0, ans=0.125 2023-06-17 22:06:29,236 INFO [train.py:996] (3/4) Epoch 1, batch 9400, loss[loss=0.4082, simple_loss=0.4446, pruned_loss=0.1859, over 21716.00 frames. ], tot_loss[loss=0.4013, simple_loss=0.4341, pruned_loss=0.1843, over 4262657.16 frames. ], batch size: 332, lr: 3.55e-02, grad_scale: 32.0 2023-06-17 22:06:29,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=56400.0, ans=0.125 2023-06-17 22:07:03,735 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.872e+02 4.762e+02 5.781e+02 7.006e+02 1.289e+03, threshold=1.156e+03, percent-clipped=1.0 2023-06-17 22:07:40,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=56580.0, ans=0.125 2023-06-17 22:08:04,312 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=6.0 2023-06-17 22:08:11,245 INFO [train.py:996] (3/4) Epoch 1, batch 9450, loss[loss=0.3348, simple_loss=0.3634, pruned_loss=0.1531, over 21213.00 frames. ], tot_loss[loss=0.3923, simple_loss=0.4231, pruned_loss=0.1808, over 4257202.63 frames. ], batch size: 159, lr: 3.55e-02, grad_scale: 16.0 2023-06-17 22:08:17,941 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:08:46,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=56820.0, ans=0.125 2023-06-17 22:09:39,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=56940.0, ans=0.1 2023-06-17 22:09:52,849 INFO [train.py:996] (3/4) Epoch 1, batch 9500, loss[loss=0.3373, simple_loss=0.3698, pruned_loss=0.1524, over 21763.00 frames. ], tot_loss[loss=0.3824, simple_loss=0.4133, pruned_loss=0.1758, over 4262985.35 frames. ], batch size: 112, lr: 3.54e-02, grad_scale: 16.0 2023-06-17 22:09:54,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=57000.0, ans=0.2 2023-06-17 22:10:16,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=57060.0, ans=0.1 2023-06-17 22:10:35,936 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.522e+02 3.981e+02 4.935e+02 6.509e+02 1.656e+03, threshold=9.871e+02, percent-clipped=4.0 2023-06-17 22:11:35,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=57300.0, ans=0.04949747468305833 2023-06-17 22:11:37,256 INFO [train.py:996] (3/4) Epoch 1, batch 9550, loss[loss=0.4846, simple_loss=0.4836, pruned_loss=0.2428, over 21584.00 frames. ], tot_loss[loss=0.3897, simple_loss=0.4186, pruned_loss=0.1804, over 4269246.44 frames. ], batch size: 389, lr: 3.54e-02, grad_scale: 16.0 2023-06-17 22:11:48,170 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.89 vs. limit=12.0 2023-06-17 22:12:04,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=57360.0, ans=0.125 2023-06-17 22:12:42,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=57480.0, ans=0.0 2023-06-17 22:13:08,450 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.38 vs. limit=6.0 2023-06-17 22:13:13,493 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.24 vs. limit=15.0 2023-06-17 22:13:22,029 INFO [train.py:996] (3/4) Epoch 1, batch 9600, loss[loss=0.399, simple_loss=0.4187, pruned_loss=0.1896, over 17595.00 frames. ], tot_loss[loss=0.3962, simple_loss=0.424, pruned_loss=0.1843, over 4272120.33 frames. ], batch size: 60, lr: 3.53e-02, grad_scale: 32.0 2023-06-17 22:13:28,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=57600.0, ans=15.0 2023-06-17 22:13:28,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=57600.0, ans=0.125 2023-06-17 22:14:08,853 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.810e+02 4.156e+02 5.294e+02 7.045e+02 1.358e+03, threshold=1.059e+03, percent-clipped=6.0 2023-06-17 22:14:33,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=57780.0, ans=0.0 2023-06-17 22:14:33,775 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.73 vs. limit=22.5 2023-06-17 22:14:50,465 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-06-17 22:15:06,090 INFO [train.py:996] (3/4) Epoch 1, batch 9650, loss[loss=0.4215, simple_loss=0.4399, pruned_loss=0.2015, over 21333.00 frames. ], tot_loss[loss=0.3957, simple_loss=0.424, pruned_loss=0.1837, over 4277670.73 frames. ], batch size: 176, lr: 3.53e-02, grad_scale: 32.0 2023-06-17 22:16:25,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=58080.0, ans=0.125 2023-06-17 22:16:43,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=58140.0, ans=0.0 2023-06-17 22:16:44,661 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.65 vs. limit=15.0 2023-06-17 22:16:49,787 INFO [train.py:996] (3/4) Epoch 1, batch 9700, loss[loss=0.3724, simple_loss=0.391, pruned_loss=0.177, over 21273.00 frames. ], tot_loss[loss=0.3959, simple_loss=0.4261, pruned_loss=0.1829, over 4279146.96 frames. ], batch size: 159, lr: 3.52e-02, grad_scale: 32.0 2023-06-17 22:17:37,772 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.787e+02 4.137e+02 5.402e+02 6.942e+02 1.239e+03, threshold=1.080e+03, percent-clipped=2.0 2023-06-17 22:18:06,172 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:18:19,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=58440.0, ans=0.2 2023-06-17 22:18:31,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=58500.0, ans=0.2 2023-06-17 22:18:33,463 INFO [train.py:996] (3/4) Epoch 1, batch 9750, loss[loss=0.3791, simple_loss=0.3836, pruned_loss=0.1873, over 21206.00 frames. ], tot_loss[loss=0.3891, simple_loss=0.4167, pruned_loss=0.1807, over 4275349.88 frames. ], batch size: 471, lr: 3.51e-02, grad_scale: 32.0 2023-06-17 22:18:42,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=58500.0, ans=0.07 2023-06-17 22:18:46,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=58500.0, ans=0.125 2023-06-17 22:20:15,535 INFO [train.py:996] (3/4) Epoch 1, batch 9800, loss[loss=0.3561, simple_loss=0.3894, pruned_loss=0.1615, over 21657.00 frames. ], tot_loss[loss=0.3862, simple_loss=0.415, pruned_loss=0.1787, over 4269120.49 frames. ], batch size: 263, lr: 3.51e-02, grad_scale: 16.0 2023-06-17 22:20:34,649 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-06-17 22:21:03,539 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.967e+02 4.740e+02 5.847e+02 8.148e+02 2.070e+03, threshold=1.169e+03, percent-clipped=10.0 2023-06-17 22:21:41,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=59040.0, ans=0.2 2023-06-17 22:21:44,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=59040.0, ans=0.1 2023-06-17 22:21:56,342 INFO [train.py:996] (3/4) Epoch 1, batch 9850, loss[loss=0.4317, simple_loss=0.457, pruned_loss=0.2032, over 20742.00 frames. ], tot_loss[loss=0.384, simple_loss=0.4117, pruned_loss=0.1782, over 4271729.20 frames. ], batch size: 607, lr: 3.50e-02, grad_scale: 16.0 2023-06-17 22:22:03,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=59100.0, ans=0.125 2023-06-17 22:23:09,517 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.58 vs. limit=5.0 2023-06-17 22:23:10,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=59280.0, ans=0.0 2023-06-17 22:23:12,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=59280.0, ans=0.1 2023-06-17 22:23:15,667 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=15.0 2023-06-17 22:23:39,140 INFO [train.py:996] (3/4) Epoch 1, batch 9900, loss[loss=0.3948, simple_loss=0.4223, pruned_loss=0.1836, over 21393.00 frames. ], tot_loss[loss=0.3801, simple_loss=0.407, pruned_loss=0.1766, over 4268045.61 frames. ], batch size: 211, lr: 3.50e-02, grad_scale: 16.0 2023-06-17 22:24:19,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=15.0 2023-06-17 22:24:28,777 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.114e+02 4.325e+02 5.228e+02 6.725e+02 1.103e+03, threshold=1.046e+03, percent-clipped=0.0 2023-06-17 22:25:22,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=59700.0, ans=0.125 2023-06-17 22:25:23,896 INFO [train.py:996] (3/4) Epoch 1, batch 9950, loss[loss=0.383, simple_loss=0.3962, pruned_loss=0.1849, over 22017.00 frames. ], tot_loss[loss=0.3827, simple_loss=0.408, pruned_loss=0.1787, over 4277290.47 frames. ], batch size: 375, lr: 3.49e-02, grad_scale: 16.0 2023-06-17 22:25:24,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=59700.0, ans=0.125 2023-06-17 22:25:44,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=59700.0, ans=0.2 2023-06-17 22:25:46,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=59700.0, ans=0.125 2023-06-17 22:26:08,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=59760.0, ans=0.125 2023-06-17 22:26:09,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=59760.0, ans=0.125 2023-06-17 22:26:43,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=59880.0, ans=0.125 2023-06-17 22:27:12,370 INFO [train.py:996] (3/4) Epoch 1, batch 10000, loss[loss=0.3971, simple_loss=0.4215, pruned_loss=0.1864, over 21636.00 frames. ], tot_loss[loss=0.3789, simple_loss=0.4051, pruned_loss=0.1763, over 4275606.22 frames. ], batch size: 351, lr: 3.49e-02, grad_scale: 32.0 2023-06-17 22:27:21,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=60000.0, ans=0.07 2023-06-17 22:27:22,459 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.13 vs. limit=15.0 2023-06-17 22:27:25,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=60000.0, ans=0.04949747468305833 2023-06-17 22:28:03,645 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.788e+02 4.452e+02 5.196e+02 6.727e+02 1.360e+03, threshold=1.039e+03, percent-clipped=5.0 2023-06-17 22:28:11,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=60120.0, ans=0.125 2023-06-17 22:28:13,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=60120.0, ans=0.1 2023-06-17 22:28:33,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=60180.0, ans=0.0 2023-06-17 22:29:04,504 INFO [train.py:996] (3/4) Epoch 1, batch 10050, loss[loss=0.3276, simple_loss=0.3633, pruned_loss=0.146, over 21250.00 frames. ], tot_loss[loss=0.3827, simple_loss=0.4083, pruned_loss=0.1785, over 4271336.70 frames. ], batch size: 159, lr: 3.48e-02, grad_scale: 32.0 2023-06-17 22:29:07,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=60300.0, ans=0.0 2023-06-17 22:29:08,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=60300.0, ans=0.125 2023-06-17 22:29:20,843 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-06-17 22:29:27,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=60360.0, ans=0.1 2023-06-17 22:30:35,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=60540.0, ans=0.2 2023-06-17 22:30:54,300 INFO [train.py:996] (3/4) Epoch 1, batch 10100, loss[loss=0.388, simple_loss=0.428, pruned_loss=0.174, over 21707.00 frames. ], tot_loss[loss=0.3775, simple_loss=0.4056, pruned_loss=0.1747, over 4274758.34 frames. ], batch size: 351, lr: 3.47e-02, grad_scale: 32.0 2023-06-17 22:31:25,639 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.20 vs. limit=6.0 2023-06-17 22:31:32,503 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.854e+02 4.188e+02 5.288e+02 6.297e+02 1.348e+03, threshold=1.058e+03, percent-clipped=5.0 2023-06-17 22:31:35,145 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=12.0 2023-06-17 22:31:36,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=60720.0, ans=0.125 2023-06-17 22:31:44,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=60720.0, ans=0.0 2023-06-17 22:32:37,659 INFO [train.py:996] (3/4) Epoch 1, batch 10150, loss[loss=0.3833, simple_loss=0.4068, pruned_loss=0.1799, over 21682.00 frames. ], tot_loss[loss=0.3829, simple_loss=0.4105, pruned_loss=0.1777, over 4263606.00 frames. ], batch size: 247, lr: 3.47e-02, grad_scale: 32.0 2023-06-17 22:33:11,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=60960.0, ans=0.2 2023-06-17 22:33:42,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.21 vs. limit=10.0 2023-06-17 22:34:22,084 INFO [train.py:996] (3/4) Epoch 1, batch 10200, loss[loss=0.3114, simple_loss=0.3776, pruned_loss=0.1226, over 21725.00 frames. ], tot_loss[loss=0.3745, simple_loss=0.4057, pruned_loss=0.1716, over 4270859.14 frames. ], batch size: 332, lr: 3.46e-02, grad_scale: 32.0 2023-06-17 22:34:47,926 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.64 vs. limit=6.0 2023-06-17 22:35:01,228 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 3.806e+02 4.726e+02 6.535e+02 1.145e+03, threshold=9.453e+02, percent-clipped=1.0 2023-06-17 22:36:03,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=61440.0, ans=0.1 2023-06-17 22:36:11,353 INFO [train.py:996] (3/4) Epoch 1, batch 10250, loss[loss=0.4067, simple_loss=0.4387, pruned_loss=0.1873, over 21615.00 frames. ], tot_loss[loss=0.364, simple_loss=0.4005, pruned_loss=0.1637, over 4262234.33 frames. ], batch size: 389, lr: 3.46e-02, grad_scale: 32.0 2023-06-17 22:36:18,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=61500.0, ans=0.1 2023-06-17 22:36:25,715 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=23.05 vs. limit=15.0 2023-06-17 22:36:38,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=61560.0, ans=0.2 2023-06-17 22:37:18,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=61680.0, ans=0.125 2023-06-17 22:37:44,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=61740.0, ans=0.125 2023-06-17 22:37:58,244 INFO [train.py:996] (3/4) Epoch 1, batch 10300, loss[loss=0.4849, simple_loss=0.5169, pruned_loss=0.2264, over 21691.00 frames. ], tot_loss[loss=0.3717, simple_loss=0.408, pruned_loss=0.1677, over 4270556.31 frames. ], batch size: 441, lr: 3.45e-02, grad_scale: 16.0 2023-06-17 22:38:05,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=61800.0, ans=0.125 2023-06-17 22:38:12,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=61800.0, ans=0.1 2023-06-17 22:38:29,364 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:38:32,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=61860.0, ans=0.125 2023-06-17 22:38:39,027 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 4.187e+02 5.738e+02 8.381e+02 2.086e+03, threshold=1.148e+03, percent-clipped=17.0 2023-06-17 22:39:13,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=61980.0, ans=0.125 2023-06-17 22:39:18,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=61980.0, ans=0.125 2023-06-17 22:39:43,535 INFO [train.py:996] (3/4) Epoch 1, batch 10350, loss[loss=0.3113, simple_loss=0.345, pruned_loss=0.1388, over 21393.00 frames. ], tot_loss[loss=0.3711, simple_loss=0.4079, pruned_loss=0.1672, over 4267559.50 frames. ], batch size: 211, lr: 3.45e-02, grad_scale: 16.0 2023-06-17 22:40:28,904 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.07 vs. limit=22.5 2023-06-17 22:41:08,765 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=39.83 vs. limit=15.0 2023-06-17 22:41:19,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=62340.0, ans=0.1 2023-06-17 22:41:29,001 INFO [train.py:996] (3/4) Epoch 1, batch 10400, loss[loss=0.3412, simple_loss=0.3987, pruned_loss=0.1419, over 21279.00 frames. ], tot_loss[loss=0.36, simple_loss=0.3966, pruned_loss=0.1617, over 4265004.19 frames. ], batch size: 551, lr: 3.44e-02, grad_scale: 32.0 2023-06-17 22:41:50,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=62460.0, ans=0.2 2023-06-17 22:41:55,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=62460.0, ans=0.125 2023-06-17 22:42:19,015 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.354e+02 3.646e+02 4.942e+02 6.227e+02 1.303e+03, threshold=9.884e+02, percent-clipped=2.0 2023-06-17 22:42:55,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=62640.0, ans=0.125 2023-06-17 22:43:18,604 INFO [train.py:996] (3/4) Epoch 1, batch 10450, loss[loss=0.4152, simple_loss=0.4446, pruned_loss=0.1929, over 21707.00 frames. ], tot_loss[loss=0.3693, simple_loss=0.4033, pruned_loss=0.1677, over 4262900.37 frames. ], batch size: 247, lr: 3.44e-02, grad_scale: 32.0 2023-06-17 22:44:00,205 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.62 vs. limit=12.0 2023-06-17 22:44:24,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=62880.0, ans=0.2 2023-06-17 22:44:43,335 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-06-17 22:45:02,369 INFO [train.py:996] (3/4) Epoch 1, batch 10500, loss[loss=0.3505, simple_loss=0.3829, pruned_loss=0.1591, over 21436.00 frames. ], tot_loss[loss=0.3663, simple_loss=0.401, pruned_loss=0.1658, over 4261066.20 frames. ], batch size: 389, lr: 3.43e-02, grad_scale: 32.0 2023-06-17 22:45:48,077 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 3.889e+02 5.007e+02 6.898e+02 1.631e+03, threshold=1.001e+03, percent-clipped=5.0 2023-06-17 22:46:45,930 INFO [train.py:996] (3/4) Epoch 1, batch 10550, loss[loss=0.3374, simple_loss=0.3606, pruned_loss=0.1572, over 21921.00 frames. ], tot_loss[loss=0.3647, simple_loss=0.3962, pruned_loss=0.1666, over 4257595.05 frames. ], batch size: 373, lr: 3.43e-02, grad_scale: 32.0 2023-06-17 22:47:27,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=63420.0, ans=0.125 2023-06-17 22:47:53,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=63480.0, ans=0.125 2023-06-17 22:48:29,735 INFO [train.py:996] (3/4) Epoch 1, batch 10600, loss[loss=0.4057, simple_loss=0.4587, pruned_loss=0.1764, over 21408.00 frames. ], tot_loss[loss=0.3592, simple_loss=0.3914, pruned_loss=0.1635, over 4254328.87 frames. ], batch size: 507, lr: 3.42e-02, grad_scale: 32.0 2023-06-17 22:49:00,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=63660.0, ans=0.2 2023-06-17 22:49:22,784 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.391e+02 3.854e+02 4.619e+02 6.310e+02 1.881e+03, threshold=9.238e+02, percent-clipped=9.0 2023-06-17 22:50:28,322 INFO [train.py:996] (3/4) Epoch 1, batch 10650, loss[loss=0.3421, simple_loss=0.4111, pruned_loss=0.1366, over 21191.00 frames. ], tot_loss[loss=0.3567, simple_loss=0.3926, pruned_loss=0.1603, over 4248758.10 frames. ], batch size: 549, lr: 3.41e-02, grad_scale: 32.0 2023-06-17 22:50:35,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.50 vs. limit=22.5 2023-06-17 22:50:45,767 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-17 22:51:42,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=64080.0, ans=0.125 2023-06-17 22:52:14,283 INFO [train.py:996] (3/4) Epoch 1, batch 10700, loss[loss=0.4447, simple_loss=0.4722, pruned_loss=0.2086, over 21460.00 frames. ], tot_loss[loss=0.3607, simple_loss=0.3949, pruned_loss=0.1632, over 4250686.94 frames. ], batch size: 131, lr: 3.41e-02, grad_scale: 32.0 2023-06-17 22:52:52,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=64320.0, ans=0.125 2023-06-17 22:52:55,085 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.608e+02 4.131e+02 5.113e+02 6.555e+02 1.006e+03, threshold=1.023e+03, percent-clipped=2.0 2023-06-17 22:53:03,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=64320.0, ans=0.125 2023-06-17 22:53:36,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=64440.0, ans=0.04949747468305833 2023-06-17 22:53:55,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=64440.0, ans=0.1 2023-06-17 22:53:59,504 INFO [train.py:996] (3/4) Epoch 1, batch 10750, loss[loss=0.3772, simple_loss=0.3824, pruned_loss=0.186, over 20025.00 frames. ], tot_loss[loss=0.3738, simple_loss=0.4072, pruned_loss=0.1702, over 4253583.62 frames. ], batch size: 702, lr: 3.40e-02, grad_scale: 32.0 2023-06-17 22:54:14,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=64500.0, ans=0.1 2023-06-17 22:54:26,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.51 vs. limit=15.0 2023-06-17 22:54:45,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=64620.0, ans=0.0 2023-06-17 22:55:31,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=64740.0, ans=0.1 2023-06-17 22:55:49,308 INFO [train.py:996] (3/4) Epoch 1, batch 10800, loss[loss=0.4208, simple_loss=0.4428, pruned_loss=0.1993, over 20653.00 frames. ], tot_loss[loss=0.3773, simple_loss=0.4113, pruned_loss=0.1716, over 4253681.14 frames. ], batch size: 609, lr: 3.40e-02, grad_scale: 32.0 2023-06-17 22:56:27,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=64920.0, ans=0.0 2023-06-17 22:56:30,671 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.758e+02 4.502e+02 5.308e+02 7.377e+02 1.430e+03, threshold=1.062e+03, percent-clipped=5.0 2023-06-17 22:56:34,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=64920.0, ans=0.125 2023-06-17 22:56:45,845 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-17 22:57:00,884 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=15.0 2023-06-17 22:57:33,928 INFO [train.py:996] (3/4) Epoch 1, batch 10850, loss[loss=0.391, simple_loss=0.45, pruned_loss=0.166, over 20795.00 frames. ], tot_loss[loss=0.3778, simple_loss=0.4127, pruned_loss=0.1715, over 4251986.99 frames. ], batch size: 607, lr: 3.39e-02, grad_scale: 32.0 2023-06-17 22:57:42,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=65100.0, ans=0.125 2023-06-17 22:57:47,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=65100.0, ans=0.125 2023-06-17 22:58:16,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=65220.0, ans=0.125 2023-06-17 22:58:18,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=65220.0, ans=0.125 2023-06-17 22:58:30,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=65220.0, ans=0.2 2023-06-17 22:58:34,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=65280.0, ans=15.0 2023-06-17 22:58:44,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=65280.0, ans=0.125 2023-06-17 22:58:45,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=65280.0, ans=0.0 2023-06-17 22:59:17,687 INFO [train.py:996] (3/4) Epoch 1, batch 10900, loss[loss=0.3152, simple_loss=0.3852, pruned_loss=0.1226, over 21745.00 frames. ], tot_loss[loss=0.3725, simple_loss=0.4069, pruned_loss=0.169, over 4248243.78 frames. ], batch size: 282, lr: 3.39e-02, grad_scale: 32.0 2023-06-17 22:59:21,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=65400.0, ans=0.2 2023-06-17 22:59:59,142 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.489e+02 3.764e+02 4.430e+02 5.513e+02 1.224e+03, threshold=8.861e+02, percent-clipped=2.0 2023-06-17 23:00:38,856 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.09 vs. limit=22.5 2023-06-17 23:00:40,570 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2023-06-17 23:01:01,594 INFO [train.py:996] (3/4) Epoch 1, batch 10950, loss[loss=0.3817, simple_loss=0.3909, pruned_loss=0.1862, over 21534.00 frames. ], tot_loss[loss=0.3664, simple_loss=0.4014, pruned_loss=0.1657, over 4243693.62 frames. ], batch size: 414, lr: 3.38e-02, grad_scale: 32.0 2023-06-17 23:01:38,554 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=15.0 2023-06-17 23:01:55,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=65820.0, ans=10.0 2023-06-17 23:02:01,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=65880.0, ans=0.125 2023-06-17 23:02:44,521 INFO [train.py:996] (3/4) Epoch 1, batch 11000, loss[loss=0.4291, simple_loss=0.4437, pruned_loss=0.2072, over 21368.00 frames. ], tot_loss[loss=0.3672, simple_loss=0.4007, pruned_loss=0.1669, over 4258110.92 frames. ], batch size: 159, lr: 3.38e-02, grad_scale: 32.0 2023-06-17 23:03:20,381 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.42 vs. limit=10.0 2023-06-17 23:03:25,861 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.455e+02 4.369e+02 5.427e+02 7.022e+02 1.248e+03, threshold=1.085e+03, percent-clipped=10.0 2023-06-17 23:03:45,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=66120.0, ans=0.1 2023-06-17 23:04:10,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=66240.0, ans=0.125 2023-06-17 23:04:12,771 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=15.0 2023-06-17 23:04:13,983 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=19.10 vs. limit=15.0 2023-06-17 23:04:15,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=66240.0, ans=0.0 2023-06-17 23:04:27,709 INFO [train.py:996] (3/4) Epoch 1, batch 11050, loss[loss=0.3045, simple_loss=0.3306, pruned_loss=0.1392, over 20763.00 frames. ], tot_loss[loss=0.3674, simple_loss=0.3981, pruned_loss=0.1684, over 4261995.49 frames. ], batch size: 608, lr: 3.37e-02, grad_scale: 32.0 2023-06-17 23:04:32,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=66300.0, ans=0.125 2023-06-17 23:04:36,873 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-17 23:04:49,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=66360.0, ans=0.1 2023-06-17 23:05:04,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=66420.0, ans=0.125 2023-06-17 23:05:08,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=12.0 2023-06-17 23:06:11,009 INFO [train.py:996] (3/4) Epoch 1, batch 11100, loss[loss=0.3335, simple_loss=0.3606, pruned_loss=0.1532, over 21378.00 frames. ], tot_loss[loss=0.3649, simple_loss=0.3947, pruned_loss=0.1676, over 4258095.83 frames. ], batch size: 131, lr: 3.37e-02, grad_scale: 32.0 2023-06-17 23:06:15,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=66600.0, ans=0.125 2023-06-17 23:06:31,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=66660.0, ans=0.0 2023-06-17 23:06:58,055 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.675e+02 3.924e+02 4.981e+02 6.262e+02 1.185e+03, threshold=9.963e+02, percent-clipped=1.0 2023-06-17 23:07:12,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=66720.0, ans=0.125 2023-06-17 23:07:19,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=66780.0, ans=0.5 2023-06-17 23:07:43,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=66840.0, ans=0.125 2023-06-17 23:07:45,479 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=15.0 2023-06-17 23:07:55,920 INFO [train.py:996] (3/4) Epoch 1, batch 11150, loss[loss=0.3138, simple_loss=0.3536, pruned_loss=0.137, over 15355.00 frames. ], tot_loss[loss=0.3631, simple_loss=0.3922, pruned_loss=0.167, over 4258244.19 frames. ], batch size: 61, lr: 3.36e-02, grad_scale: 32.0 2023-06-17 23:07:59,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=66900.0, ans=0.125 2023-06-17 23:08:49,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=67020.0, ans=0.2 2023-06-17 23:09:04,191 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.72 vs. limit=22.5 2023-06-17 23:09:38,646 INFO [train.py:996] (3/4) Epoch 1, batch 11200, loss[loss=0.4295, simple_loss=0.4329, pruned_loss=0.2131, over 21491.00 frames. ], tot_loss[loss=0.3603, simple_loss=0.3909, pruned_loss=0.1649, over 4254591.80 frames. ], batch size: 441, lr: 3.36e-02, grad_scale: 32.0 2023-06-17 23:09:54,867 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.80 vs. limit=22.5 2023-06-17 23:10:05,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=67260.0, ans=0.0 2023-06-17 23:10:25,339 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.560e+02 3.969e+02 4.814e+02 6.139e+02 9.199e+02, threshold=9.628e+02, percent-clipped=0.0 2023-06-17 23:10:35,256 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=15.0 2023-06-17 23:10:47,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=67380.0, ans=0.2 2023-06-17 23:11:21,138 INFO [train.py:996] (3/4) Epoch 1, batch 11250, loss[loss=0.3436, simple_loss=0.3933, pruned_loss=0.147, over 21332.00 frames. ], tot_loss[loss=0.3608, simple_loss=0.3908, pruned_loss=0.1655, over 4253388.03 frames. ], batch size: 131, lr: 3.35e-02, grad_scale: 32.0 2023-06-17 23:11:23,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=67500.0, ans=0.2 2023-06-17 23:12:13,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=67620.0, ans=0.125 2023-06-17 23:12:42,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=67740.0, ans=0.1 2023-06-17 23:13:01,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=67740.0, ans=0.125 2023-06-17 23:13:04,109 INFO [train.py:996] (3/4) Epoch 1, batch 11300, loss[loss=0.3845, simple_loss=0.4177, pruned_loss=0.1756, over 21505.00 frames. ], tot_loss[loss=0.3619, simple_loss=0.3919, pruned_loss=0.166, over 4263897.48 frames. ], batch size: 471, lr: 3.35e-02, grad_scale: 32.0 2023-06-17 23:13:17,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=67800.0, ans=0.125 2023-06-17 23:13:28,376 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.92 vs. limit=12.0 2023-06-17 23:13:32,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=67860.0, ans=0.125 2023-06-17 23:13:32,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=67860.0, ans=0.2 2023-06-17 23:13:51,112 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.497e+02 3.668e+02 4.732e+02 6.264e+02 1.219e+03, threshold=9.465e+02, percent-clipped=6.0 2023-06-17 23:14:06,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=67920.0, ans=0.125 2023-06-17 23:14:18,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=67980.0, ans=0.0 2023-06-17 23:14:49,552 INFO [train.py:996] (3/4) Epoch 1, batch 11350, loss[loss=0.4361, simple_loss=0.4995, pruned_loss=0.1863, over 20886.00 frames. ], tot_loss[loss=0.3657, simple_loss=0.3968, pruned_loss=0.1673, over 4267169.58 frames. ], batch size: 607, lr: 3.34e-02, grad_scale: 32.0 2023-06-17 23:15:13,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=68160.0, ans=0.125 2023-06-17 23:15:24,530 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=14.23 vs. limit=15.0 2023-06-17 23:16:25,708 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-17 23:16:41,423 INFO [train.py:996] (3/4) Epoch 1, batch 11400, loss[loss=0.3034, simple_loss=0.3563, pruned_loss=0.1253, over 21271.00 frames. ], tot_loss[loss=0.3746, simple_loss=0.4065, pruned_loss=0.1714, over 4274086.10 frames. ], batch size: 176, lr: 3.34e-02, grad_scale: 32.0 2023-06-17 23:17:28,739 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 4.138e+02 5.254e+02 6.973e+02 1.408e+03, threshold=1.051e+03, percent-clipped=10.0 2023-06-17 23:17:30,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=68520.0, ans=0.125 2023-06-17 23:17:42,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=68580.0, ans=0.0 2023-06-17 23:17:57,344 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.58 vs. limit=22.5 2023-06-17 23:18:17,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=68640.0, ans=0.125 2023-06-17 23:18:25,307 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.39 vs. limit=10.0 2023-06-17 23:18:26,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=68700.0, ans=0.125 2023-06-17 23:18:27,534 INFO [train.py:996] (3/4) Epoch 1, batch 11450, loss[loss=0.412, simple_loss=0.4368, pruned_loss=0.1935, over 21452.00 frames. ], tot_loss[loss=0.3738, simple_loss=0.408, pruned_loss=0.1698, over 4272928.16 frames. ], batch size: 131, lr: 3.33e-02, grad_scale: 32.0 2023-06-17 23:19:17,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=68820.0, ans=0.2 2023-06-17 23:19:31,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=68820.0, ans=10.0 2023-06-17 23:20:00,637 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:20:13,495 INFO [train.py:996] (3/4) Epoch 1, batch 11500, loss[loss=0.3579, simple_loss=0.4027, pruned_loss=0.1566, over 21595.00 frames. ], tot_loss[loss=0.3768, simple_loss=0.4111, pruned_loss=0.1713, over 4271342.23 frames. ], batch size: 230, lr: 3.33e-02, grad_scale: 32.0 2023-06-17 23:20:47,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=69060.0, ans=0.0 2023-06-17 23:21:00,425 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.621e+02 4.282e+02 5.552e+02 6.865e+02 1.531e+03, threshold=1.110e+03, percent-clipped=3.0 2023-06-17 23:21:09,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=69120.0, ans=0.125 2023-06-17 23:21:13,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=69120.0, ans=0.0 2023-06-17 23:22:09,453 INFO [train.py:996] (3/4) Epoch 1, batch 11550, loss[loss=0.4271, simple_loss=0.4759, pruned_loss=0.1891, over 21635.00 frames. ], tot_loss[loss=0.3782, simple_loss=0.4162, pruned_loss=0.1701, over 4274502.95 frames. ], batch size: 389, lr: 3.32e-02, grad_scale: 32.0 2023-06-17 23:22:44,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=69360.0, ans=0.0 2023-06-17 23:22:55,154 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.41 vs. limit=15.0 2023-06-17 23:23:11,260 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:23:16,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=69480.0, ans=0.125 2023-06-17 23:23:38,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=69540.0, ans=0.125 2023-06-17 23:23:53,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=69600.0, ans=0.125 2023-06-17 23:23:54,930 INFO [train.py:996] (3/4) Epoch 1, batch 11600, loss[loss=0.3951, simple_loss=0.4354, pruned_loss=0.1774, over 21438.00 frames. ], tot_loss[loss=0.3824, simple_loss=0.426, pruned_loss=0.1694, over 4269761.83 frames. ], batch size: 194, lr: 3.32e-02, grad_scale: 32.0 2023-06-17 23:24:39,718 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.776e+02 4.538e+02 6.004e+02 8.984e+02 1.767e+03, threshold=1.201e+03, percent-clipped=15.0 2023-06-17 23:24:51,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=69720.0, ans=0.2 2023-06-17 23:24:59,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=69780.0, ans=0.125 2023-06-17 23:25:00,336 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-17 23:25:03,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=69780.0, ans=0.0 2023-06-17 23:25:21,613 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.35 vs. limit=15.0 2023-06-17 23:25:32,106 INFO [train.py:996] (3/4) Epoch 1, batch 11650, loss[loss=0.3496, simple_loss=0.4158, pruned_loss=0.1417, over 21277.00 frames. ], tot_loss[loss=0.3848, simple_loss=0.4311, pruned_loss=0.1692, over 4260887.21 frames. ], batch size: 143, lr: 3.31e-02, grad_scale: 16.0 2023-06-17 23:25:38,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=69900.0, ans=0.125 2023-06-17 23:26:08,103 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-17 23:27:01,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=70140.0, ans=0.1 2023-06-17 23:27:01,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=70140.0, ans=0.0 2023-06-17 23:27:04,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=70140.0, ans=0.125 2023-06-17 23:27:15,723 INFO [train.py:996] (3/4) Epoch 1, batch 11700, loss[loss=0.3421, simple_loss=0.3598, pruned_loss=0.1622, over 21571.00 frames. ], tot_loss[loss=0.3819, simple_loss=0.423, pruned_loss=0.1705, over 4253539.67 frames. ], batch size: 263, lr: 3.31e-02, grad_scale: 16.0 2023-06-17 23:27:19,902 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-06-17 23:27:45,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=70260.0, ans=0.125 2023-06-17 23:27:55,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=70320.0, ans=0.0 2023-06-17 23:28:00,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.900e+02 4.129e+02 5.507e+02 7.167e+02 1.590e+03, threshold=1.101e+03, percent-clipped=1.0 2023-06-17 23:28:05,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70320.0, ans=0.1 2023-06-17 23:28:49,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=70440.0, ans=0.0 2023-06-17 23:28:52,869 INFO [train.py:996] (3/4) Epoch 1, batch 11750, loss[loss=0.3453, simple_loss=0.3621, pruned_loss=0.1643, over 21410.00 frames. ], tot_loss[loss=0.3768, simple_loss=0.4129, pruned_loss=0.1703, over 4260288.08 frames. ], batch size: 131, lr: 3.30e-02, grad_scale: 16.0 2023-06-17 23:29:52,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=70680.0, ans=0.125 2023-06-17 23:30:16,232 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.29 vs. limit=15.0 2023-06-17 23:30:38,192 INFO [train.py:996] (3/4) Epoch 1, batch 11800, loss[loss=0.4081, simple_loss=0.4647, pruned_loss=0.1758, over 19839.00 frames. ], tot_loss[loss=0.382, simple_loss=0.4159, pruned_loss=0.174, over 4264055.80 frames. ], batch size: 704, lr: 3.30e-02, grad_scale: 16.0 2023-06-17 23:31:07,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=70860.0, ans=0.125 2023-06-17 23:31:16,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=70920.0, ans=0.2 2023-06-17 23:31:29,233 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 3.727e+02 4.871e+02 6.879e+02 1.447e+03, threshold=9.741e+02, percent-clipped=5.0 2023-06-17 23:31:59,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=70980.0, ans=0.125 2023-06-17 23:32:22,100 INFO [train.py:996] (3/4) Epoch 1, batch 11850, loss[loss=0.3347, simple_loss=0.386, pruned_loss=0.1417, over 21761.00 frames. ], tot_loss[loss=0.3798, simple_loss=0.4159, pruned_loss=0.1718, over 4272422.14 frames. ], batch size: 247, lr: 3.29e-02, grad_scale: 16.0 2023-06-17 23:32:51,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=71160.0, ans=0.02 2023-06-17 23:33:37,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=71280.0, ans=0.125 2023-06-17 23:33:46,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=71280.0, ans=0.95 2023-06-17 23:34:12,205 INFO [train.py:996] (3/4) Epoch 1, batch 11900, loss[loss=0.3177, simple_loss=0.3919, pruned_loss=0.1217, over 21784.00 frames. ], tot_loss[loss=0.3738, simple_loss=0.4136, pruned_loss=0.167, over 4273384.19 frames. ], batch size: 282, lr: 3.29e-02, grad_scale: 16.0 2023-06-17 23:34:14,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=71400.0, ans=0.125 2023-06-17 23:34:14,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=71400.0, ans=0.0 2023-06-17 23:34:46,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=71460.0, ans=0.125 2023-06-17 23:35:08,163 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.209e+02 3.592e+02 4.787e+02 5.877e+02 1.275e+03, threshold=9.575e+02, percent-clipped=4.0 2023-06-17 23:35:26,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=71580.0, ans=0.2 2023-06-17 23:35:56,080 INFO [train.py:996] (3/4) Epoch 1, batch 11950, loss[loss=0.3275, simple_loss=0.3985, pruned_loss=0.1282, over 21623.00 frames. ], tot_loss[loss=0.3683, simple_loss=0.4127, pruned_loss=0.162, over 4264973.48 frames. ], batch size: 247, lr: 3.28e-02, grad_scale: 16.0 2023-06-17 23:35:59,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=71700.0, ans=0.2 2023-06-17 23:36:01,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=71700.0, ans=0.125 2023-06-17 23:36:14,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=71700.0, ans=0.125 2023-06-17 23:36:57,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=71820.0, ans=0.0 2023-06-17 23:37:32,873 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=22.5 2023-06-17 23:37:39,770 INFO [train.py:996] (3/4) Epoch 1, batch 12000, loss[loss=0.3913, simple_loss=0.4098, pruned_loss=0.1864, over 15346.00 frames. ], tot_loss[loss=0.3623, simple_loss=0.4045, pruned_loss=0.1601, over 4257837.15 frames. ], batch size: 60, lr: 3.28e-02, grad_scale: 32.0 2023-06-17 23:37:39,770 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-17 23:37:48,022 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.6572, 2.1020, 2.3603, 2.1638], device='cuda:3') 2023-06-17 23:37:57,340 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.3348, simple_loss=0.4196, pruned_loss=0.125, over 1796401.00 frames. 2023-06-17 23:37:57,341 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-17 23:38:52,943 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.252e+02 3.693e+02 4.861e+02 6.052e+02 1.192e+03, threshold=9.721e+02, percent-clipped=3.0 2023-06-17 23:39:18,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=72180.0, ans=0.125 2023-06-17 23:39:18,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=72180.0, ans=0.1 2023-06-17 23:39:41,327 INFO [train.py:996] (3/4) Epoch 1, batch 12050, loss[loss=0.3599, simple_loss=0.3973, pruned_loss=0.1613, over 21867.00 frames. ], tot_loss[loss=0.3645, simple_loss=0.4024, pruned_loss=0.1633, over 4269322.96 frames. ], batch size: 298, lr: 3.27e-02, grad_scale: 32.0 2023-06-17 23:40:10,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=72300.0, ans=0.2 2023-06-17 23:40:22,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=72360.0, ans=0.125 2023-06-17 23:40:52,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=72480.0, ans=0.02 2023-06-17 23:41:32,422 INFO [train.py:996] (3/4) Epoch 1, batch 12100, loss[loss=0.4081, simple_loss=0.4377, pruned_loss=0.1892, over 21379.00 frames. ], tot_loss[loss=0.3788, simple_loss=0.4155, pruned_loss=0.171, over 4272472.63 frames. ], batch size: 548, lr: 3.27e-02, grad_scale: 16.0 2023-06-17 23:42:02,730 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2023-06-17 23:42:05,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=72660.0, ans=0.05 2023-06-17 23:42:26,172 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.143e+02 4.439e+02 6.434e+02 8.417e+02 1.460e+03, threshold=1.287e+03, percent-clipped=16.0 2023-06-17 23:42:38,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=72780.0, ans=0.125 2023-06-17 23:43:17,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=72840.0, ans=0.2 2023-06-17 23:43:23,506 INFO [train.py:996] (3/4) Epoch 1, batch 12150, loss[loss=0.4254, simple_loss=0.4873, pruned_loss=0.1818, over 21673.00 frames. ], tot_loss[loss=0.3811, simple_loss=0.42, pruned_loss=0.1711, over 4275055.12 frames. ], batch size: 441, lr: 3.26e-02, grad_scale: 16.0 2023-06-17 23:43:49,518 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.22 vs. limit=15.0 2023-06-17 23:43:50,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=72960.0, ans=0.0 2023-06-17 23:43:56,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=72960.0, ans=0.125 2023-06-17 23:43:58,945 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2023-06-17 23:44:31,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=73080.0, ans=0.125 2023-06-17 23:44:54,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=73140.0, ans=0.125 2023-06-17 23:45:00,496 INFO [train.py:996] (3/4) Epoch 1, batch 12200, loss[loss=0.3594, simple_loss=0.3883, pruned_loss=0.1653, over 21757.00 frames. ], tot_loss[loss=0.3779, simple_loss=0.4155, pruned_loss=0.1701, over 4274876.17 frames. ], batch size: 351, lr: 3.26e-02, grad_scale: 16.0 2023-06-17 23:45:05,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=73200.0, ans=0.2 2023-06-17 23:45:05,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=73200.0, ans=0.0 2023-06-17 23:45:45,114 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.82 vs. limit=22.5 2023-06-17 23:45:45,898 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.575e+02 3.853e+02 4.664e+02 5.869e+02 1.070e+03, threshold=9.327e+02, percent-clipped=0.0 2023-06-17 23:46:06,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=73380.0, ans=0.0 2023-06-17 23:46:28,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=73440.0, ans=0.0 2023-06-17 23:46:42,500 INFO [train.py:996] (3/4) Epoch 1, batch 12250, loss[loss=0.3109, simple_loss=0.381, pruned_loss=0.1204, over 21212.00 frames. ], tot_loss[loss=0.3665, simple_loss=0.4054, pruned_loss=0.1638, over 4271784.26 frames. ], batch size: 548, lr: 3.25e-02, grad_scale: 16.0 2023-06-17 23:47:31,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=73620.0, ans=0.0 2023-06-17 23:47:32,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=73620.0, ans=0.1 2023-06-17 23:48:13,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=73740.0, ans=0.0 2023-06-17 23:48:25,683 INFO [train.py:996] (3/4) Epoch 1, batch 12300, loss[loss=0.2691, simple_loss=0.3258, pruned_loss=0.1062, over 21375.00 frames. ], tot_loss[loss=0.3484, simple_loss=0.3919, pruned_loss=0.1524, over 4272941.25 frames. ], batch size: 131, lr: 3.25e-02, grad_scale: 16.0 2023-06-17 23:48:28,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-17 23:48:33,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73800.0, ans=0.1 2023-06-17 23:48:54,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=73860.0, ans=0.125 2023-06-17 23:49:12,340 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.740e+02 4.870e+02 6.587e+02 1.091e+03, threshold=9.740e+02, percent-clipped=4.0 2023-06-17 23:49:27,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=73980.0, ans=0.1 2023-06-17 23:50:01,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=74040.0, ans=22.5 2023-06-17 23:50:08,209 INFO [train.py:996] (3/4) Epoch 1, batch 12350, loss[loss=0.453, simple_loss=0.4691, pruned_loss=0.2184, over 21852.00 frames. ], tot_loss[loss=0.3494, simple_loss=0.3949, pruned_loss=0.1519, over 4271059.30 frames. ], batch size: 414, lr: 3.24e-02, grad_scale: 16.0 2023-06-17 23:50:10,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=74100.0, ans=0.1 2023-06-17 23:50:18,777 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-17 23:50:51,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=74220.0, ans=0.0 2023-06-17 23:51:20,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=74280.0, ans=0.2 2023-06-17 23:51:37,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=74340.0, ans=0.0 2023-06-17 23:51:43,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=74340.0, ans=0.125 2023-06-17 23:51:49,163 INFO [train.py:996] (3/4) Epoch 1, batch 12400, loss[loss=0.4582, simple_loss=0.4542, pruned_loss=0.231, over 21758.00 frames. ], tot_loss[loss=0.3573, simple_loss=0.3987, pruned_loss=0.158, over 4283320.71 frames. ], batch size: 508, lr: 3.24e-02, grad_scale: 32.0 2023-06-17 23:51:57,309 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-17 23:52:00,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=74400.0, ans=0.0 2023-06-17 23:52:34,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.733e+02 4.026e+02 5.096e+02 6.661e+02 1.103e+03, threshold=1.019e+03, percent-clipped=2.0 2023-06-17 23:53:31,330 INFO [train.py:996] (3/4) Epoch 1, batch 12450, loss[loss=0.3359, simple_loss=0.357, pruned_loss=0.1574, over 20050.00 frames. ], tot_loss[loss=0.3671, simple_loss=0.405, pruned_loss=0.1646, over 4287923.02 frames. ], batch size: 702, lr: 3.23e-02, grad_scale: 32.0 2023-06-17 23:54:01,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=74760.0, ans=0.125 2023-06-17 23:54:04,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=74760.0, ans=0.2 2023-06-17 23:54:08,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=74820.0, ans=0.125 2023-06-17 23:54:34,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=74820.0, ans=0.2 2023-06-17 23:55:16,042 INFO [train.py:996] (3/4) Epoch 1, batch 12500, loss[loss=0.4495, simple_loss=0.4991, pruned_loss=0.2, over 21616.00 frames. ], tot_loss[loss=0.3803, simple_loss=0.4183, pruned_loss=0.1711, over 4288426.46 frames. ], batch size: 389, lr: 3.23e-02, grad_scale: 32.0 2023-06-17 23:55:31,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=75060.0, ans=0.125 2023-06-17 23:56:10,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.71 vs. limit=10.0 2023-06-17 23:56:14,393 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.070e+02 4.603e+02 5.505e+02 7.191e+02 1.270e+03, threshold=1.101e+03, percent-clipped=4.0 2023-06-17 23:56:14,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=75120.0, ans=0.125 2023-06-17 23:56:53,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=75240.0, ans=0.1 2023-06-17 23:57:02,229 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-06-17 23:57:02,632 INFO [train.py:996] (3/4) Epoch 1, batch 12550, loss[loss=0.3798, simple_loss=0.4204, pruned_loss=0.1696, over 21655.00 frames. ], tot_loss[loss=0.3877, simple_loss=0.4257, pruned_loss=0.1749, over 4285923.23 frames. ], batch size: 263, lr: 3.22e-02, grad_scale: 32.0 2023-06-17 23:57:06,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=75300.0, ans=0.0 2023-06-17 23:57:38,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=75360.0, ans=0.125 2023-06-17 23:57:39,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=75360.0, ans=0.125 2023-06-17 23:57:48,175 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:58:13,510 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.06 vs. limit=15.0 2023-06-17 23:58:40,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=75540.0, ans=0.125 2023-06-17 23:58:44,725 INFO [train.py:996] (3/4) Epoch 1, batch 12600, loss[loss=0.2668, simple_loss=0.3147, pruned_loss=0.1094, over 21870.00 frames. ], tot_loss[loss=0.3818, simple_loss=0.4222, pruned_loss=0.1707, over 4282470.91 frames. ], batch size: 98, lr: 3.22e-02, grad_scale: 32.0 2023-06-17 23:58:57,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=75600.0, ans=0.0 2023-06-17 23:59:41,394 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.279e+02 3.697e+02 4.571e+02 5.714e+02 1.241e+03, threshold=9.141e+02, percent-clipped=1.0 2023-06-17 23:59:49,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=75780.0, ans=0.125 2023-06-17 23:59:56,982 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-18 00:00:21,816 INFO [train.py:996] (3/4) Epoch 1, batch 12650, loss[loss=0.3382, simple_loss=0.3843, pruned_loss=0.146, over 21386.00 frames. ], tot_loss[loss=0.3677, simple_loss=0.4099, pruned_loss=0.1628, over 4273494.26 frames. ], batch size: 548, lr: 3.21e-02, grad_scale: 32.0 2023-06-18 00:00:30,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=75900.0, ans=0.125 2023-06-18 00:00:49,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=75960.0, ans=0.125 2023-06-18 00:00:57,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=75960.0, ans=0.0 2023-06-18 00:01:22,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76020.0, ans=0.1 2023-06-18 00:01:22,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=76020.0, ans=0.125 2023-06-18 00:01:52,255 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:02:03,320 INFO [train.py:996] (3/4) Epoch 1, batch 12700, loss[loss=0.3799, simple_loss=0.4109, pruned_loss=0.1745, over 21339.00 frames. ], tot_loss[loss=0.3722, simple_loss=0.4101, pruned_loss=0.1671, over 4283179.13 frames. ], batch size: 176, lr: 3.21e-02, grad_scale: 32.0 2023-06-18 00:02:03,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=76200.0, ans=0.0 2023-06-18 00:02:54,848 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.709e+02 4.032e+02 5.068e+02 7.064e+02 1.461e+03, threshold=1.014e+03, percent-clipped=9.0 2023-06-18 00:03:00,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76320.0, ans=0.1 2023-06-18 00:03:40,085 INFO [train.py:996] (3/4) Epoch 1, batch 12750, loss[loss=0.4, simple_loss=0.4232, pruned_loss=0.1884, over 21934.00 frames. ], tot_loss[loss=0.3735, simple_loss=0.4113, pruned_loss=0.1678, over 4283603.62 frames. ], batch size: 113, lr: 3.20e-02, grad_scale: 32.0 2023-06-18 00:03:56,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=76500.0, ans=0.125 2023-06-18 00:04:05,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76560.0, ans=0.1 2023-06-18 00:04:39,472 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-18 00:05:04,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=76740.0, ans=0.125 2023-06-18 00:05:32,631 INFO [train.py:996] (3/4) Epoch 1, batch 12800, loss[loss=0.3599, simple_loss=0.3966, pruned_loss=0.1616, over 21824.00 frames. ], tot_loss[loss=0.3769, simple_loss=0.4124, pruned_loss=0.1707, over 4288153.95 frames. ], batch size: 107, lr: 3.20e-02, grad_scale: 32.0 2023-06-18 00:05:36,861 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.20 vs. limit=15.0 2023-06-18 00:06:20,445 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.836e+02 3.969e+02 4.961e+02 6.426e+02 1.503e+03, threshold=9.923e+02, percent-clipped=9.0 2023-06-18 00:06:29,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=76980.0, ans=0.1 2023-06-18 00:07:06,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=77040.0, ans=0.95 2023-06-18 00:07:12,457 INFO [train.py:996] (3/4) Epoch 1, batch 12850, loss[loss=0.3679, simple_loss=0.4271, pruned_loss=0.1543, over 21755.00 frames. ], tot_loss[loss=0.382, simple_loss=0.4165, pruned_loss=0.1738, over 4284469.90 frames. ], batch size: 351, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 00:07:28,771 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.98 vs. limit=15.0 2023-06-18 00:07:31,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=77100.0, ans=0.125 2023-06-18 00:07:48,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=77220.0, ans=0.125 2023-06-18 00:08:31,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=77280.0, ans=0.1 2023-06-18 00:09:00,984 INFO [train.py:996] (3/4) Epoch 1, batch 12900, loss[loss=0.3302, simple_loss=0.3842, pruned_loss=0.138, over 21681.00 frames. ], tot_loss[loss=0.374, simple_loss=0.4127, pruned_loss=0.1677, over 4275695.26 frames. ], batch size: 247, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 00:09:03,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=77400.0, ans=0.0 2023-06-18 00:09:08,697 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-18 00:09:11,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=77400.0, ans=0.125 2023-06-18 00:09:27,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=77460.0, ans=0.0 2023-06-18 00:09:46,041 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.277e+02 3.853e+02 4.882e+02 6.013e+02 9.581e+02, threshold=9.764e+02, percent-clipped=0.0 2023-06-18 00:10:09,008 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-18 00:10:34,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=77640.0, ans=0.1 2023-06-18 00:10:39,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=77640.0, ans=0.2 2023-06-18 00:10:43,801 INFO [train.py:996] (3/4) Epoch 1, batch 12950, loss[loss=0.3683, simple_loss=0.4027, pruned_loss=0.167, over 21736.00 frames. ], tot_loss[loss=0.368, simple_loss=0.4088, pruned_loss=0.1636, over 4279487.06 frames. ], batch size: 298, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 00:11:57,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=77880.0, ans=0.2 2023-06-18 00:12:23,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=77940.0, ans=0.1 2023-06-18 00:12:28,706 INFO [train.py:996] (3/4) Epoch 1, batch 13000, loss[loss=0.2856, simple_loss=0.3532, pruned_loss=0.109, over 21739.00 frames. ], tot_loss[loss=0.3676, simple_loss=0.4085, pruned_loss=0.1634, over 4279790.02 frames. ], batch size: 332, lr: 3.18e-02, grad_scale: 16.0 2023-06-18 00:12:46,302 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.60 vs. limit=6.0 2023-06-18 00:13:10,230 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.47 vs. limit=22.5 2023-06-18 00:13:20,460 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 4.234e+02 5.570e+02 6.916e+02 1.204e+03, threshold=1.114e+03, percent-clipped=4.0 2023-06-18 00:14:09,919 INFO [train.py:996] (3/4) Epoch 1, batch 13050, loss[loss=0.3948, simple_loss=0.4172, pruned_loss=0.1862, over 21766.00 frames. ], tot_loss[loss=0.3594, simple_loss=0.4011, pruned_loss=0.1589, over 4276031.41 frames. ], batch size: 247, lr: 3.18e-02, grad_scale: 16.0 2023-06-18 00:14:24,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=78360.0, ans=0.125 2023-06-18 00:15:21,968 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.73 vs. limit=10.0 2023-06-18 00:15:55,015 INFO [train.py:996] (3/4) Epoch 1, batch 13100, loss[loss=0.3903, simple_loss=0.4281, pruned_loss=0.1763, over 21358.00 frames. ], tot_loss[loss=0.3634, simple_loss=0.4052, pruned_loss=0.1608, over 4283178.53 frames. ], batch size: 159, lr: 3.17e-02, grad_scale: 16.0 2023-06-18 00:16:25,192 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.26 vs. limit=6.0 2023-06-18 00:16:54,422 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.991e+02 4.696e+02 5.724e+02 7.991e+02 1.405e+03, threshold=1.145e+03, percent-clipped=4.0 2023-06-18 00:16:54,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=78720.0, ans=0.125 2023-06-18 00:17:45,900 INFO [train.py:996] (3/4) Epoch 1, batch 13150, loss[loss=0.4238, simple_loss=0.4437, pruned_loss=0.2019, over 21374.00 frames. ], tot_loss[loss=0.3723, simple_loss=0.4111, pruned_loss=0.1667, over 4280562.13 frames. ], batch size: 507, lr: 3.17e-02, grad_scale: 16.0 2023-06-18 00:19:22,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=79140.0, ans=0.125 2023-06-18 00:19:30,330 INFO [train.py:996] (3/4) Epoch 1, batch 13200, loss[loss=0.3966, simple_loss=0.4195, pruned_loss=0.1869, over 22015.00 frames. ], tot_loss[loss=0.3705, simple_loss=0.4087, pruned_loss=0.1661, over 4281437.54 frames. ], batch size: 317, lr: 3.16e-02, grad_scale: 32.0 2023-06-18 00:20:18,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=79320.0, ans=0.0 2023-06-18 00:20:27,584 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.628e+02 3.872e+02 4.776e+02 6.394e+02 8.489e+02, threshold=9.552e+02, percent-clipped=0.0 2023-06-18 00:20:28,704 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.20 vs. limit=15.0 2023-06-18 00:20:31,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=79320.0, ans=0.0 2023-06-18 00:20:40,570 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.59 vs. limit=15.0 2023-06-18 00:21:08,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=79440.0, ans=0.0 2023-06-18 00:21:18,125 INFO [train.py:996] (3/4) Epoch 1, batch 13250, loss[loss=0.3634, simple_loss=0.4116, pruned_loss=0.1576, over 21785.00 frames. ], tot_loss[loss=0.3732, simple_loss=0.4094, pruned_loss=0.1685, over 4287008.71 frames. ], batch size: 247, lr: 3.16e-02, grad_scale: 32.0 2023-06-18 00:21:33,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=79500.0, ans=0.95 2023-06-18 00:21:43,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=79560.0, ans=0.1 2023-06-18 00:22:43,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=79740.0, ans=0.0 2023-06-18 00:23:07,128 INFO [train.py:996] (3/4) Epoch 1, batch 13300, loss[loss=0.3828, simple_loss=0.4202, pruned_loss=0.1727, over 21516.00 frames. ], tot_loss[loss=0.3725, simple_loss=0.4111, pruned_loss=0.167, over 4293012.20 frames. ], batch size: 131, lr: 3.15e-02, grad_scale: 32.0 2023-06-18 00:23:27,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=79860.0, ans=0.125 2023-06-18 00:23:38,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=79860.0, ans=0.125 2023-06-18 00:23:40,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=79860.0, ans=0.0 2023-06-18 00:23:55,794 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.809e+02 3.934e+02 5.014e+02 6.811e+02 1.186e+03, threshold=1.003e+03, percent-clipped=5.0 2023-06-18 00:24:31,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=80040.0, ans=0.125 2023-06-18 00:24:42,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=80040.0, ans=0.125 2023-06-18 00:24:42,904 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-18 00:24:51,641 INFO [train.py:996] (3/4) Epoch 1, batch 13350, loss[loss=0.3357, simple_loss=0.3635, pruned_loss=0.154, over 16349.00 frames. ], tot_loss[loss=0.3778, simple_loss=0.4154, pruned_loss=0.1701, over 4287289.74 frames. ], batch size: 60, lr: 3.15e-02, grad_scale: 32.0 2023-06-18 00:24:53,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=80100.0, ans=0.125 2023-06-18 00:25:34,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=80220.0, ans=0.1 2023-06-18 00:25:44,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=80220.0, ans=0.0 2023-06-18 00:25:54,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=80280.0, ans=0.2 2023-06-18 00:26:01,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=80280.0, ans=0.0 2023-06-18 00:26:20,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=80340.0, ans=0.0 2023-06-18 00:26:40,420 INFO [train.py:996] (3/4) Epoch 1, batch 13400, loss[loss=0.3651, simple_loss=0.4032, pruned_loss=0.1635, over 21804.00 frames. ], tot_loss[loss=0.3808, simple_loss=0.4172, pruned_loss=0.1722, over 4289159.30 frames. ], batch size: 247, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 00:27:23,477 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.62 vs. limit=22.5 2023-06-18 00:27:27,342 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.979e+02 4.393e+02 5.548e+02 7.060e+02 1.249e+03, threshold=1.110e+03, percent-clipped=4.0 2023-06-18 00:27:34,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=80580.0, ans=0.125 2023-06-18 00:27:55,776 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-18 00:27:56,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=80580.0, ans=10.0 2023-06-18 00:28:23,655 INFO [train.py:996] (3/4) Epoch 1, batch 13450, loss[loss=0.3539, simple_loss=0.3887, pruned_loss=0.1595, over 21696.00 frames. ], tot_loss[loss=0.3839, simple_loss=0.4176, pruned_loss=0.1751, over 4280692.00 frames. ], batch size: 298, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 00:28:40,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=80760.0, ans=0.0 2023-06-18 00:28:55,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-18 00:28:59,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=80820.0, ans=0.125 2023-06-18 00:29:06,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=80820.0, ans=0.5 2023-06-18 00:30:08,358 INFO [train.py:996] (3/4) Epoch 1, batch 13500, loss[loss=0.4549, simple_loss=0.4666, pruned_loss=0.2216, over 21448.00 frames. ], tot_loss[loss=0.3699, simple_loss=0.4041, pruned_loss=0.1679, over 4279602.36 frames. ], batch size: 509, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 00:30:20,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=81000.0, ans=0.0 2023-06-18 00:30:22,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=81000.0, ans=0.0 2023-06-18 00:30:31,203 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-18 00:31:07,782 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.719e+02 4.002e+02 4.680e+02 6.090e+02 1.151e+03, threshold=9.360e+02, percent-clipped=1.0 2023-06-18 00:31:18,656 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-18 00:31:52,219 INFO [train.py:996] (3/4) Epoch 1, batch 13550, loss[loss=0.369, simple_loss=0.4139, pruned_loss=0.1621, over 21785.00 frames. ], tot_loss[loss=0.375, simple_loss=0.4121, pruned_loss=0.1689, over 4284473.39 frames. ], batch size: 124, lr: 3.13e-02, grad_scale: 32.0 2023-06-18 00:33:34,640 INFO [train.py:996] (3/4) Epoch 1, batch 13600, loss[loss=0.3825, simple_loss=0.4058, pruned_loss=0.1796, over 21341.00 frames. ], tot_loss[loss=0.3788, simple_loss=0.4156, pruned_loss=0.171, over 4287180.48 frames. ], batch size: 176, lr: 3.13e-02, grad_scale: 32.0 2023-06-18 00:33:49,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=81600.0, ans=0.125 2023-06-18 00:34:27,094 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 4.484e+02 6.125e+02 7.575e+02 1.688e+03, threshold=1.225e+03, percent-clipped=13.0 2023-06-18 00:34:51,211 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-06-18 00:34:52,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=81780.0, ans=0.2 2023-06-18 00:34:55,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=81840.0, ans=0.125 2023-06-18 00:35:11,054 INFO [train.py:996] (3/4) Epoch 1, batch 13650, loss[loss=0.3403, simple_loss=0.3714, pruned_loss=0.1547, over 21775.00 frames. ], tot_loss[loss=0.37, simple_loss=0.4082, pruned_loss=0.166, over 4292406.78 frames. ], batch size: 371, lr: 3.12e-02, grad_scale: 32.0 2023-06-18 00:35:21,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=81900.0, ans=0.125 2023-06-18 00:36:38,705 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:36:42,455 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=12.0 2023-06-18 00:36:49,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=82140.0, ans=0.0 2023-06-18 00:36:59,454 INFO [train.py:996] (3/4) Epoch 1, batch 13700, loss[loss=0.3015, simple_loss=0.3311, pruned_loss=0.136, over 16449.00 frames. ], tot_loss[loss=0.3652, simple_loss=0.4002, pruned_loss=0.165, over 4278557.21 frames. ], batch size: 64, lr: 3.12e-02, grad_scale: 32.0 2023-06-18 00:37:25,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=82260.0, ans=0.125 2023-06-18 00:37:39,670 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-06-18 00:37:47,952 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.14 vs. limit=10.0 2023-06-18 00:37:53,344 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.729e+02 3.850e+02 5.196e+02 6.756e+02 1.127e+03, threshold=1.039e+03, percent-clipped=0.0 2023-06-18 00:38:07,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=82380.0, ans=0.2 2023-06-18 00:38:10,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=82380.0, ans=0.2 2023-06-18 00:38:34,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=82440.0, ans=0.125 2023-06-18 00:38:43,653 INFO [train.py:996] (3/4) Epoch 1, batch 13750, loss[loss=0.4354, simple_loss=0.4605, pruned_loss=0.2051, over 21593.00 frames. ], tot_loss[loss=0.3633, simple_loss=0.4, pruned_loss=0.1633, over 4271008.10 frames. ], batch size: 442, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 00:38:56,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=82500.0, ans=0.0 2023-06-18 00:38:58,034 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.19 vs. limit=15.0 2023-06-18 00:39:39,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=82620.0, ans=0.125 2023-06-18 00:40:32,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-18 00:40:40,335 INFO [train.py:996] (3/4) Epoch 1, batch 13800, loss[loss=0.3608, simple_loss=0.4345, pruned_loss=0.1435, over 21802.00 frames. ], tot_loss[loss=0.3651, simple_loss=0.4062, pruned_loss=0.1621, over 4264966.54 frames. ], batch size: 316, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 00:40:57,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=82860.0, ans=0.02 2023-06-18 00:40:57,459 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.07 vs. limit=6.0 2023-06-18 00:41:28,041 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.821e+02 3.900e+02 5.256e+02 6.721e+02 1.169e+03, threshold=1.051e+03, percent-clipped=1.0 2023-06-18 00:42:22,643 INFO [train.py:996] (3/4) Epoch 1, batch 13850, loss[loss=0.3836, simple_loss=0.4224, pruned_loss=0.1724, over 21432.00 frames. ], tot_loss[loss=0.3689, simple_loss=0.4117, pruned_loss=0.163, over 4271347.45 frames. ], batch size: 211, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 00:43:17,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=83220.0, ans=0.0 2023-06-18 00:43:18,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=83220.0, ans=0.1 2023-06-18 00:43:24,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=83280.0, ans=0.125 2023-06-18 00:43:57,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=83340.0, ans=0.5 2023-06-18 00:44:06,033 INFO [train.py:996] (3/4) Epoch 1, batch 13900, loss[loss=0.4073, simple_loss=0.429, pruned_loss=0.1928, over 21855.00 frames. ], tot_loss[loss=0.3764, simple_loss=0.416, pruned_loss=0.1684, over 4277698.54 frames. ], batch size: 332, lr: 3.10e-02, grad_scale: 32.0 2023-06-18 00:44:44,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=83520.0, ans=0.125 2023-06-18 00:44:48,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=83520.0, ans=0.125 2023-06-18 00:44:58,290 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.781e+02 4.127e+02 5.100e+02 6.768e+02 1.105e+03, threshold=1.020e+03, percent-clipped=2.0 2023-06-18 00:44:59,223 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.90 vs. limit=22.5 2023-06-18 00:45:22,326 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:45:29,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=12.0 2023-06-18 00:45:37,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=83640.0, ans=0.0 2023-06-18 00:45:48,143 INFO [train.py:996] (3/4) Epoch 1, batch 13950, loss[loss=0.3661, simple_loss=0.4072, pruned_loss=0.1625, over 21896.00 frames. ], tot_loss[loss=0.381, simple_loss=0.419, pruned_loss=0.1715, over 4279213.14 frames. ], batch size: 118, lr: 3.10e-02, grad_scale: 32.0 2023-06-18 00:46:00,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=83700.0, ans=0.125 2023-06-18 00:47:30,506 INFO [train.py:996] (3/4) Epoch 1, batch 14000, loss[loss=0.3196, simple_loss=0.3728, pruned_loss=0.1332, over 21412.00 frames. ], tot_loss[loss=0.3745, simple_loss=0.4137, pruned_loss=0.1676, over 4277954.89 frames. ], batch size: 548, lr: 3.09e-02, grad_scale: 32.0 2023-06-18 00:47:54,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=84060.0, ans=0.0 2023-06-18 00:47:56,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=84060.0, ans=10.0 2023-06-18 00:48:01,398 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.01 vs. limit=6.0 2023-06-18 00:48:12,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=84120.0, ans=0.125 2023-06-18 00:48:28,676 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 3.708e+02 4.933e+02 6.099e+02 9.890e+02, threshold=9.866e+02, percent-clipped=0.0 2023-06-18 00:48:35,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=84180.0, ans=0.0 2023-06-18 00:49:18,907 INFO [train.py:996] (3/4) Epoch 1, batch 14050, loss[loss=0.2855, simple_loss=0.3424, pruned_loss=0.1142, over 21758.00 frames. ], tot_loss[loss=0.3668, simple_loss=0.4099, pruned_loss=0.1618, over 4263257.24 frames. ], batch size: 282, lr: 3.09e-02, grad_scale: 32.0 2023-06-18 00:49:19,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=84300.0, ans=0.125 2023-06-18 00:49:21,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=84300.0, ans=0.125 2023-06-18 00:49:30,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=84300.0, ans=0.2 2023-06-18 00:49:37,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=84360.0, ans=0.07 2023-06-18 00:51:01,823 INFO [train.py:996] (3/4) Epoch 1, batch 14100, loss[loss=0.3391, simple_loss=0.3527, pruned_loss=0.1627, over 15008.00 frames. ], tot_loss[loss=0.3631, simple_loss=0.4027, pruned_loss=0.1618, over 4263118.36 frames. ], batch size: 61, lr: 3.08e-02, grad_scale: 32.0 2023-06-18 00:51:07,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=84600.0, ans=0.2 2023-06-18 00:51:34,017 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.49 vs. limit=15.0 2023-06-18 00:51:54,168 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.428e+02 4.143e+02 4.965e+02 6.574e+02 1.166e+03, threshold=9.930e+02, percent-clipped=2.0 2023-06-18 00:52:09,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=84780.0, ans=0.5 2023-06-18 00:52:29,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=84840.0, ans=0.125 2023-06-18 00:52:37,610 INFO [train.py:996] (3/4) Epoch 1, batch 14150, loss[loss=0.4192, simple_loss=0.4538, pruned_loss=0.1923, over 21495.00 frames. ], tot_loss[loss=0.3651, simple_loss=0.4053, pruned_loss=0.1625, over 4263786.54 frames. ], batch size: 509, lr: 3.08e-02, grad_scale: 32.0 2023-06-18 00:53:05,022 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=22.5 2023-06-18 00:53:07,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=84960.0, ans=0.0 2023-06-18 00:53:09,642 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.60 vs. limit=22.5 2023-06-18 00:53:35,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=85020.0, ans=0.0 2023-06-18 00:53:46,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=85080.0, ans=0.125 2023-06-18 00:53:59,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=85080.0, ans=0.0 2023-06-18 00:54:18,328 INFO [train.py:996] (3/4) Epoch 1, batch 14200, loss[loss=0.2901, simple_loss=0.3459, pruned_loss=0.1171, over 21664.00 frames. ], tot_loss[loss=0.3596, simple_loss=0.4011, pruned_loss=0.1591, over 4269501.02 frames. ], batch size: 230, lr: 3.08e-02, grad_scale: 32.0 2023-06-18 00:54:24,903 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=2.438e-02 2023-06-18 00:54:54,467 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.35 vs. limit=15.0 2023-06-18 00:54:55,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=85320.0, ans=0.2 2023-06-18 00:55:09,942 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 4.023e+02 4.862e+02 6.439e+02 1.166e+03, threshold=9.724e+02, percent-clipped=3.0 2023-06-18 00:55:28,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.18 vs. limit=10.0 2023-06-18 00:55:58,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=85500.0, ans=0.125 2023-06-18 00:55:59,264 INFO [train.py:996] (3/4) Epoch 1, batch 14250, loss[loss=0.3538, simple_loss=0.4096, pruned_loss=0.149, over 20929.00 frames. ], tot_loss[loss=0.3563, simple_loss=0.3949, pruned_loss=0.1589, over 4264619.33 frames. ], batch size: 607, lr: 3.07e-02, grad_scale: 32.0 2023-06-18 00:56:04,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=85500.0, ans=0.125 2023-06-18 00:57:36,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=85740.0, ans=0.0 2023-06-18 00:57:43,759 INFO [train.py:996] (3/4) Epoch 1, batch 14300, loss[loss=0.3485, simple_loss=0.4085, pruned_loss=0.1442, over 21587.00 frames. ], tot_loss[loss=0.3572, simple_loss=0.3977, pruned_loss=0.1583, over 4256054.98 frames. ], batch size: 230, lr: 3.07e-02, grad_scale: 32.0 2023-06-18 00:58:38,737 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.854e+02 5.533e+02 8.207e+02 1.409e+03, threshold=1.107e+03, percent-clipped=13.0 2023-06-18 00:59:26,772 INFO [train.py:996] (3/4) Epoch 1, batch 14350, loss[loss=0.4446, simple_loss=0.4519, pruned_loss=0.2187, over 21643.00 frames. ], tot_loss[loss=0.3606, simple_loss=0.402, pruned_loss=0.1596, over 4245410.09 frames. ], batch size: 471, lr: 3.06e-02, grad_scale: 16.0 2023-06-18 00:59:47,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=86160.0, ans=0.0 2023-06-18 01:00:08,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=86220.0, ans=0.125 2023-06-18 01:00:58,362 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.97 vs. limit=6.0 2023-06-18 01:01:08,531 INFO [train.py:996] (3/4) Epoch 1, batch 14400, loss[loss=0.3926, simple_loss=0.4055, pruned_loss=0.1898, over 21782.00 frames. ], tot_loss[loss=0.3631, simple_loss=0.4016, pruned_loss=0.1622, over 4256633.13 frames. ], batch size: 351, lr: 3.06e-02, grad_scale: 32.0 2023-06-18 01:01:15,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=86400.0, ans=0.125 2023-06-18 01:01:39,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=86460.0, ans=0.125 2023-06-18 01:01:58,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=86520.0, ans=0.0 2023-06-18 01:02:08,112 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.669e+02 3.871e+02 4.703e+02 5.738e+02 1.217e+03, threshold=9.407e+02, percent-clipped=2.0 2023-06-18 01:02:24,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=86580.0, ans=0.2 2023-06-18 01:02:50,413 INFO [train.py:996] (3/4) Epoch 1, batch 14450, loss[loss=0.3014, simple_loss=0.3373, pruned_loss=0.1328, over 21637.00 frames. ], tot_loss[loss=0.3597, simple_loss=0.3956, pruned_loss=0.1619, over 4265274.81 frames. ], batch size: 247, lr: 3.05e-02, grad_scale: 32.0 2023-06-18 01:02:50,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=86700.0, ans=0.2 2023-06-18 01:02:52,994 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=15.0 2023-06-18 01:02:53,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=86700.0, ans=0.1 2023-06-18 01:04:10,689 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.62 vs. limit=15.0 2023-06-18 01:04:33,706 INFO [train.py:996] (3/4) Epoch 1, batch 14500, loss[loss=0.355, simple_loss=0.3944, pruned_loss=0.1577, over 21882.00 frames. ], tot_loss[loss=0.3551, simple_loss=0.3904, pruned_loss=0.1598, over 4259582.24 frames. ], batch size: 107, lr: 3.05e-02, grad_scale: 32.0 2023-06-18 01:04:51,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=87000.0, ans=0.95 2023-06-18 01:05:31,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=87120.0, ans=0.125 2023-06-18 01:05:36,048 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.357e+02 3.944e+02 5.039e+02 7.491e+02 1.788e+03, threshold=1.008e+03, percent-clipped=13.0 2023-06-18 01:05:41,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=87180.0, ans=0.125 2023-06-18 01:06:17,605 INFO [train.py:996] (3/4) Epoch 1, batch 14550, loss[loss=0.3883, simple_loss=0.4272, pruned_loss=0.1747, over 21202.00 frames. ], tot_loss[loss=0.3595, simple_loss=0.3964, pruned_loss=0.1613, over 4258752.87 frames. ], batch size: 143, lr: 3.05e-02, grad_scale: 16.0 2023-06-18 01:06:40,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=87360.0, ans=0.125 2023-06-18 01:06:47,491 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=22.5 2023-06-18 01:07:20,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=87420.0, ans=0.125 2023-06-18 01:07:28,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=87480.0, ans=0.125 2023-06-18 01:07:53,801 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:08:01,695 INFO [train.py:996] (3/4) Epoch 1, batch 14600, loss[loss=0.3993, simple_loss=0.4298, pruned_loss=0.1844, over 21370.00 frames. ], tot_loss[loss=0.3722, simple_loss=0.4077, pruned_loss=0.1684, over 4260996.04 frames. ], batch size: 159, lr: 3.04e-02, grad_scale: 16.0 2023-06-18 01:08:13,107 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=22.5 2023-06-18 01:08:22,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=87660.0, ans=0.0 2023-06-18 01:08:22,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=87660.0, ans=0.025 2023-06-18 01:09:02,461 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.680e+02 4.017e+02 5.030e+02 6.430e+02 1.157e+03, threshold=1.006e+03, percent-clipped=2.0 2023-06-18 01:09:26,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=87840.0, ans=0.0 2023-06-18 01:09:35,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=87840.0, ans=0.125 2023-06-18 01:09:43,514 INFO [train.py:996] (3/4) Epoch 1, batch 14650, loss[loss=0.367, simple_loss=0.4225, pruned_loss=0.1557, over 19914.00 frames. ], tot_loss[loss=0.3711, simple_loss=0.4087, pruned_loss=0.1668, over 4254393.10 frames. ], batch size: 702, lr: 3.04e-02, grad_scale: 16.0 2023-06-18 01:10:32,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=88020.0, ans=0.125 2023-06-18 01:10:48,520 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-18 01:11:16,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=88140.0, ans=0.0 2023-06-18 01:11:30,874 INFO [train.py:996] (3/4) Epoch 1, batch 14700, loss[loss=0.2964, simple_loss=0.3649, pruned_loss=0.114, over 21473.00 frames. ], tot_loss[loss=0.3528, simple_loss=0.396, pruned_loss=0.1549, over 4254698.28 frames. ], batch size: 211, lr: 3.03e-02, grad_scale: 16.0 2023-06-18 01:11:39,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=88200.0, ans=0.125 2023-06-18 01:11:41,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=88200.0, ans=0.2 2023-06-18 01:12:00,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=88260.0, ans=0.0 2023-06-18 01:12:22,540 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-18 01:12:23,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=88320.0, ans=0.0 2023-06-18 01:12:32,564 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 3.580e+02 4.552e+02 5.267e+02 1.016e+03, threshold=9.103e+02, percent-clipped=1.0 2023-06-18 01:13:06,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=88440.0, ans=0.0 2023-06-18 01:13:14,657 INFO [train.py:996] (3/4) Epoch 1, batch 14750, loss[loss=0.4849, simple_loss=0.5086, pruned_loss=0.2306, over 21900.00 frames. ], tot_loss[loss=0.3641, simple_loss=0.4048, pruned_loss=0.1617, over 4259932.87 frames. ], batch size: 372, lr: 3.03e-02, grad_scale: 16.0 2023-06-18 01:13:35,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=88560.0, ans=0.0 2023-06-18 01:14:24,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=88680.0, ans=0.125 2023-06-18 01:14:59,893 INFO [train.py:996] (3/4) Epoch 1, batch 14800, loss[loss=0.3515, simple_loss=0.3889, pruned_loss=0.1571, over 21199.00 frames. ], tot_loss[loss=0.3776, simple_loss=0.4169, pruned_loss=0.1692, over 4261657.07 frames. ], batch size: 176, lr: 3.03e-02, grad_scale: 32.0 2023-06-18 01:15:17,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88800.0, ans=0.1 2023-06-18 01:15:17,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=88800.0, ans=0.2 2023-06-18 01:15:42,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=88860.0, ans=0.0 2023-06-18 01:15:45,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=88860.0, ans=0.125 2023-06-18 01:15:52,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=88920.0, ans=0.125 2023-06-18 01:16:02,065 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.486e+02 4.489e+02 5.229e+02 7.110e+02 1.407e+03, threshold=1.046e+03, percent-clipped=11.0 2023-06-18 01:16:04,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.41 vs. limit=15.0 2023-06-18 01:16:55,418 INFO [train.py:996] (3/4) Epoch 1, batch 14850, loss[loss=0.3793, simple_loss=0.4044, pruned_loss=0.1771, over 21544.00 frames. ], tot_loss[loss=0.3742, simple_loss=0.4118, pruned_loss=0.1683, over 4257600.43 frames. ], batch size: 263, lr: 3.02e-02, grad_scale: 32.0 2023-06-18 01:17:49,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=89220.0, ans=0.125 2023-06-18 01:18:41,316 INFO [train.py:996] (3/4) Epoch 1, batch 14900, loss[loss=0.3978, simple_loss=0.4262, pruned_loss=0.1847, over 22027.00 frames. ], tot_loss[loss=0.3799, simple_loss=0.4165, pruned_loss=0.1716, over 4254548.61 frames. ], batch size: 317, lr: 3.02e-02, grad_scale: 32.0 2023-06-18 01:19:05,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=89460.0, ans=0.125 2023-06-18 01:19:23,253 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.47 vs. limit=15.0 2023-06-18 01:19:25,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=89520.0, ans=0.125 2023-06-18 01:19:25,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=89520.0, ans=0.1 2023-06-18 01:19:34,030 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.655e+02 4.060e+02 5.286e+02 6.323e+02 1.154e+03, threshold=1.057e+03, percent-clipped=2.0 2023-06-18 01:19:52,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=89580.0, ans=0.125 2023-06-18 01:20:09,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=89640.0, ans=0.125 2023-06-18 01:20:21,532 INFO [train.py:996] (3/4) Epoch 1, batch 14950, loss[loss=0.3384, simple_loss=0.3937, pruned_loss=0.1416, over 21299.00 frames. ], tot_loss[loss=0.378, simple_loss=0.4159, pruned_loss=0.17, over 4260516.64 frames. ], batch size: 176, lr: 3.01e-02, grad_scale: 32.0 2023-06-18 01:20:37,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=89760.0, ans=0.1 2023-06-18 01:20:54,128 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.22 vs. limit=22.5 2023-06-18 01:20:57,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=89760.0, ans=0.0 2023-06-18 01:21:29,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=89880.0, ans=0.1 2023-06-18 01:21:51,934 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:21:59,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=89940.0, ans=0.125 2023-06-18 01:22:04,786 INFO [train.py:996] (3/4) Epoch 1, batch 15000, loss[loss=0.3538, simple_loss=0.3925, pruned_loss=0.1576, over 21558.00 frames. ], tot_loss[loss=0.383, simple_loss=0.4196, pruned_loss=0.1732, over 4265320.43 frames. ], batch size: 194, lr: 3.01e-02, grad_scale: 32.0 2023-06-18 01:22:04,786 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 01:22:23,156 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.3215, simple_loss=0.4085, pruned_loss=0.1173, over 1796401.00 frames. 2023-06-18 01:22:23,157 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-18 01:22:42,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=90000.0, ans=0.015 2023-06-18 01:23:23,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=90120.0, ans=0.125 2023-06-18 01:23:25,154 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.727e+02 3.992e+02 4.836e+02 5.829e+02 8.010e+02, threshold=9.672e+02, percent-clipped=0.0 2023-06-18 01:23:50,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=90240.0, ans=0.2 2023-06-18 01:24:12,193 INFO [train.py:996] (3/4) Epoch 1, batch 15050, loss[loss=0.4787, simple_loss=0.5166, pruned_loss=0.2204, over 21519.00 frames. ], tot_loss[loss=0.3833, simple_loss=0.4195, pruned_loss=0.1736, over 4275495.67 frames. ], batch size: 471, lr: 3.01e-02, grad_scale: 32.0 2023-06-18 01:24:17,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=90300.0, ans=0.1 2023-06-18 01:24:25,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=90300.0, ans=0.125 2023-06-18 01:25:54,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=90600.0, ans=0.125 2023-06-18 01:25:55,090 INFO [train.py:996] (3/4) Epoch 1, batch 15100, loss[loss=0.316, simple_loss=0.357, pruned_loss=0.1375, over 16967.00 frames. ], tot_loss[loss=0.3813, simple_loss=0.4191, pruned_loss=0.1718, over 4264529.76 frames. ], batch size: 60, lr: 3.00e-02, grad_scale: 32.0 2023-06-18 01:26:40,217 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.55 vs. limit=22.5 2023-06-18 01:26:50,177 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.69 vs. limit=5.0 2023-06-18 01:26:50,646 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 4.036e+02 5.408e+02 6.449e+02 1.241e+03, threshold=1.082e+03, percent-clipped=5.0 2023-06-18 01:26:52,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=90780.0, ans=0.2 2023-06-18 01:27:01,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=90780.0, ans=0.035 2023-06-18 01:27:06,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=90780.0, ans=0.1 2023-06-18 01:27:32,769 INFO [train.py:996] (3/4) Epoch 1, batch 15150, loss[loss=0.3254, simple_loss=0.3539, pruned_loss=0.1485, over 21624.00 frames. ], tot_loss[loss=0.3816, simple_loss=0.4168, pruned_loss=0.1732, over 4252083.99 frames. ], batch size: 231, lr: 3.00e-02, grad_scale: 32.0 2023-06-18 01:28:09,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=90960.0, ans=0.125 2023-06-18 01:28:14,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=90960.0, ans=0.05 2023-06-18 01:29:01,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.21 vs. limit=22.5 2023-06-18 01:29:15,447 INFO [train.py:996] (3/4) Epoch 1, batch 15200, loss[loss=0.3143, simple_loss=0.3554, pruned_loss=0.1366, over 21779.00 frames. ], tot_loss[loss=0.3703, simple_loss=0.4075, pruned_loss=0.1665, over 4247926.51 frames. ], batch size: 112, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 01:30:01,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=91320.0, ans=0.125 2023-06-18 01:30:15,291 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.443e+02 3.952e+02 4.981e+02 6.119e+02 1.167e+03, threshold=9.963e+02, percent-clipped=1.0 2023-06-18 01:30:19,266 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=12.0 2023-06-18 01:30:56,076 INFO [train.py:996] (3/4) Epoch 1, batch 15250, loss[loss=0.343, simple_loss=0.3704, pruned_loss=0.1578, over 21693.00 frames. ], tot_loss[loss=0.3632, simple_loss=0.3993, pruned_loss=0.1636, over 4245603.50 frames. ], batch size: 333, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 01:31:03,182 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-06-18 01:31:27,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=91560.0, ans=0.2 2023-06-18 01:32:07,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=91680.0, ans=0.125 2023-06-18 01:32:38,835 INFO [train.py:996] (3/4) Epoch 1, batch 15300, loss[loss=0.2926, simple_loss=0.3179, pruned_loss=0.1337, over 20843.00 frames. ], tot_loss[loss=0.3696, simple_loss=0.403, pruned_loss=0.1681, over 4252623.68 frames. ], batch size: 609, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 01:33:46,559 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.623e+02 4.254e+02 5.015e+02 5.905e+02 1.167e+03, threshold=1.003e+03, percent-clipped=1.0 2023-06-18 01:34:21,886 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=15.0 2023-06-18 01:34:28,055 INFO [train.py:996] (3/4) Epoch 1, batch 15350, loss[loss=0.4178, simple_loss=0.4523, pruned_loss=0.1916, over 21891.00 frames. ], tot_loss[loss=0.376, simple_loss=0.4091, pruned_loss=0.1715, over 4258364.15 frames. ], batch size: 371, lr: 2.98e-02, grad_scale: 32.0 2023-06-18 01:34:30,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=92100.0, ans=0.125 2023-06-18 01:35:27,724 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-18 01:35:37,214 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.02 vs. limit=22.5 2023-06-18 01:35:38,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=92280.0, ans=0.125 2023-06-18 01:36:04,743 INFO [train.py:996] (3/4) Epoch 1, batch 15400, loss[loss=0.3583, simple_loss=0.3993, pruned_loss=0.1587, over 21877.00 frames. ], tot_loss[loss=0.3724, simple_loss=0.4078, pruned_loss=0.1685, over 4270782.33 frames. ], batch size: 351, lr: 2.98e-02, grad_scale: 32.0 2023-06-18 01:36:46,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=92460.0, ans=0.0 2023-06-18 01:37:10,690 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.478e+02 3.998e+02 4.934e+02 5.907e+02 9.449e+02, threshold=9.868e+02, percent-clipped=0.0 2023-06-18 01:37:18,307 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.94 vs. limit=15.0 2023-06-18 01:37:39,233 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.48 vs. limit=6.0 2023-06-18 01:37:46,495 INFO [train.py:996] (3/4) Epoch 1, batch 15450, loss[loss=0.3168, simple_loss=0.3904, pruned_loss=0.1215, over 21859.00 frames. ], tot_loss[loss=0.3686, simple_loss=0.4051, pruned_loss=0.1661, over 4267315.04 frames. ], batch size: 316, lr: 2.97e-02, grad_scale: 32.0 2023-06-18 01:38:05,833 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:39:14,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=92940.0, ans=0.2 2023-06-18 01:39:26,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=92940.0, ans=0.125 2023-06-18 01:39:35,840 INFO [train.py:996] (3/4) Epoch 1, batch 15500, loss[loss=0.4383, simple_loss=0.4671, pruned_loss=0.2047, over 21746.00 frames. ], tot_loss[loss=0.3708, simple_loss=0.4088, pruned_loss=0.1664, over 4254687.43 frames. ], batch size: 441, lr: 2.97e-02, grad_scale: 32.0 2023-06-18 01:39:44,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=93000.0, ans=0.1 2023-06-18 01:40:02,375 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=15.0 2023-06-18 01:40:39,758 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.584e+02 4.777e+02 6.158e+02 1.272e+03, threshold=9.553e+02, percent-clipped=7.0 2023-06-18 01:41:10,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=93240.0, ans=0.125 2023-06-18 01:41:23,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=93300.0, ans=0.2 2023-06-18 01:41:30,353 INFO [train.py:996] (3/4) Epoch 1, batch 15550, loss[loss=0.346, simple_loss=0.3965, pruned_loss=0.1478, over 21739.00 frames. ], tot_loss[loss=0.3654, simple_loss=0.4061, pruned_loss=0.1624, over 4256949.63 frames. ], batch size: 298, lr: 2.97e-02, grad_scale: 16.0 2023-06-18 01:41:34,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=93300.0, ans=0.125 2023-06-18 01:41:45,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=93360.0, ans=0.2 2023-06-18 01:42:12,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=93420.0, ans=0.2 2023-06-18 01:42:26,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=93480.0, ans=0.125 2023-06-18 01:42:29,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=93480.0, ans=0.025 2023-06-18 01:42:36,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=93480.0, ans=0.125 2023-06-18 01:42:57,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=93540.0, ans=0.125 2023-06-18 01:43:07,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=93600.0, ans=0.0 2023-06-18 01:43:13,864 INFO [train.py:996] (3/4) Epoch 1, batch 15600, loss[loss=0.3407, simple_loss=0.3619, pruned_loss=0.1597, over 21433.00 frames. ], tot_loss[loss=0.3623, simple_loss=0.4017, pruned_loss=0.1615, over 4257728.10 frames. ], batch size: 212, lr: 2.96e-02, grad_scale: 32.0 2023-06-18 01:43:19,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=93600.0, ans=0.125 2023-06-18 01:43:39,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=93660.0, ans=0.0 2023-06-18 01:43:59,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=93720.0, ans=0.05 2023-06-18 01:44:01,089 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-18 01:44:06,392 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.574e+02 3.769e+02 4.800e+02 6.009e+02 1.224e+03, threshold=9.599e+02, percent-clipped=5.0 2023-06-18 01:44:06,856 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:44:06,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=93780.0, ans=0.125 2023-06-18 01:44:37,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=93840.0, ans=0.025 2023-06-18 01:44:38,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=93840.0, ans=0.2 2023-06-18 01:44:43,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=93840.0, ans=0.125 2023-06-18 01:44:55,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=93900.0, ans=0.125 2023-06-18 01:44:56,276 INFO [train.py:996] (3/4) Epoch 1, batch 15650, loss[loss=0.3507, simple_loss=0.3827, pruned_loss=0.1593, over 21653.00 frames. ], tot_loss[loss=0.3603, simple_loss=0.4002, pruned_loss=0.1602, over 4252071.66 frames. ], batch size: 282, lr: 2.96e-02, grad_scale: 32.0 2023-06-18 01:45:18,923 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=22.5 2023-06-18 01:45:28,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=93960.0, ans=0.0 2023-06-18 01:45:51,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=12.0 2023-06-18 01:45:54,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=94080.0, ans=0.0 2023-06-18 01:46:38,851 INFO [train.py:996] (3/4) Epoch 1, batch 15700, loss[loss=0.3107, simple_loss=0.3479, pruned_loss=0.1368, over 21409.00 frames. ], tot_loss[loss=0.3568, simple_loss=0.3961, pruned_loss=0.1588, over 4256935.34 frames. ], batch size: 131, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 01:47:01,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=94260.0, ans=0.0 2023-06-18 01:47:31,849 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.759e+02 5.241e+02 6.627e+02 1.144e+03, threshold=1.048e+03, percent-clipped=4.0 2023-06-18 01:48:01,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=94440.0, ans=0.2 2023-06-18 01:48:20,498 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2023-06-18 01:48:21,249 INFO [train.py:996] (3/4) Epoch 1, batch 15750, loss[loss=0.3445, simple_loss=0.3759, pruned_loss=0.1565, over 21712.00 frames. ], tot_loss[loss=0.3529, simple_loss=0.3902, pruned_loss=0.1578, over 4240696.80 frames. ], batch size: 282, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 01:48:44,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=94560.0, ans=0.125 2023-06-18 01:50:03,621 INFO [train.py:996] (3/4) Epoch 1, batch 15800, loss[loss=0.3729, simple_loss=0.3969, pruned_loss=0.1745, over 21668.00 frames. ], tot_loss[loss=0.3502, simple_loss=0.3849, pruned_loss=0.1578, over 4235204.54 frames. ], batch size: 332, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 01:50:05,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=94800.0, ans=0.2 2023-06-18 01:50:05,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=94800.0, ans=0.2 2023-06-18 01:50:07,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=94800.0, ans=0.05 2023-06-18 01:50:22,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94800.0, ans=0.1 2023-06-18 01:50:41,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=94920.0, ans=0.125 2023-06-18 01:51:07,129 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 3.489e+02 4.310e+02 5.461e+02 1.002e+03, threshold=8.621e+02, percent-clipped=0.0 2023-06-18 01:51:33,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=95040.0, ans=0.125 2023-06-18 01:51:47,032 INFO [train.py:996] (3/4) Epoch 1, batch 15850, loss[loss=0.328, simple_loss=0.3679, pruned_loss=0.144, over 21773.00 frames. ], tot_loss[loss=0.3531, simple_loss=0.3864, pruned_loss=0.1599, over 4236425.92 frames. ], batch size: 124, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 01:52:01,001 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:52:10,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=95160.0, ans=0.125 2023-06-18 01:52:16,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=95160.0, ans=0.125 2023-06-18 01:52:45,220 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.67 vs. limit=6.0 2023-06-18 01:53:31,113 INFO [train.py:996] (3/4) Epoch 1, batch 15900, loss[loss=0.3304, simple_loss=0.3548, pruned_loss=0.153, over 21083.00 frames. ], tot_loss[loss=0.3515, simple_loss=0.3838, pruned_loss=0.1596, over 4244802.41 frames. ], batch size: 143, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 01:53:59,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=95460.0, ans=0.1 2023-06-18 01:54:07,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=95520.0, ans=0.125 2023-06-18 01:54:18,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=95520.0, ans=0.2 2023-06-18 01:54:28,638 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.552e+02 4.278e+02 5.211e+02 7.153e+02 1.346e+03, threshold=1.042e+03, percent-clipped=13.0 2023-06-18 01:55:07,512 INFO [train.py:996] (3/4) Epoch 1, batch 15950, loss[loss=0.355, simple_loss=0.4001, pruned_loss=0.1549, over 21672.00 frames. ], tot_loss[loss=0.3486, simple_loss=0.3847, pruned_loss=0.1562, over 4250030.26 frames. ], batch size: 298, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 01:55:09,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=95700.0, ans=0.2 2023-06-18 01:56:20,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=95880.0, ans=0.0 2023-06-18 01:56:24,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=95880.0, ans=0.125 2023-06-18 01:56:47,222 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-06-18 01:56:54,417 INFO [train.py:996] (3/4) Epoch 1, batch 16000, loss[loss=0.3436, simple_loss=0.395, pruned_loss=0.1461, over 21829.00 frames. ], tot_loss[loss=0.342, simple_loss=0.3828, pruned_loss=0.1506, over 4246217.98 frames. ], batch size: 351, lr: 2.93e-02, grad_scale: 32.0 2023-06-18 01:57:42,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=96120.0, ans=0.125 2023-06-18 01:57:44,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=96120.0, ans=0.125 2023-06-18 01:57:50,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=96120.0, ans=0.1 2023-06-18 01:57:52,169 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 3.630e+02 4.249e+02 5.497e+02 1.232e+03, threshold=8.498e+02, percent-clipped=2.0 2023-06-18 01:58:30,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=96240.0, ans=0.1 2023-06-18 01:58:35,936 INFO [train.py:996] (3/4) Epoch 1, batch 16050, loss[loss=0.3017, simple_loss=0.351, pruned_loss=0.1262, over 21432.00 frames. ], tot_loss[loss=0.337, simple_loss=0.3825, pruned_loss=0.1457, over 4254809.73 frames. ], batch size: 131, lr: 2.93e-02, grad_scale: 32.0 2023-06-18 01:59:00,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=96300.0, ans=0.0 2023-06-18 01:59:51,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=96480.0, ans=0.125 2023-06-18 02:00:04,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=96540.0, ans=0.0 2023-06-18 02:00:17,281 INFO [train.py:996] (3/4) Epoch 1, batch 16100, loss[loss=0.3553, simple_loss=0.3921, pruned_loss=0.1593, over 21520.00 frames. ], tot_loss[loss=0.3432, simple_loss=0.3885, pruned_loss=0.1489, over 4269018.05 frames. ], batch size: 548, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 02:00:20,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=96600.0, ans=0.125 2023-06-18 02:01:14,751 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.462e+02 3.670e+02 4.719e+02 5.843e+02 1.104e+03, threshold=9.438e+02, percent-clipped=5.0 2023-06-18 02:01:53,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=96840.0, ans=0.125 2023-06-18 02:01:59,165 INFO [train.py:996] (3/4) Epoch 1, batch 16150, loss[loss=0.3549, simple_loss=0.399, pruned_loss=0.1553, over 21598.00 frames. ], tot_loss[loss=0.3478, simple_loss=0.3912, pruned_loss=0.1522, over 4278173.47 frames. ], batch size: 195, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 02:02:58,910 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:03:10,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=97080.0, ans=0.1 2023-06-18 02:03:11,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=15.0 2023-06-18 02:03:29,995 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=22.5 2023-06-18 02:03:45,787 INFO [train.py:996] (3/4) Epoch 1, batch 16200, loss[loss=0.4916, simple_loss=0.4893, pruned_loss=0.247, over 21353.00 frames. ], tot_loss[loss=0.3545, simple_loss=0.3967, pruned_loss=0.1561, over 4279216.53 frames. ], batch size: 507, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 02:03:47,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=97200.0, ans=0.0 2023-06-18 02:04:22,431 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.10 vs. limit=22.5 2023-06-18 02:04:49,154 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.675e+02 4.045e+02 5.128e+02 6.271e+02 1.195e+03, threshold=1.026e+03, percent-clipped=3.0 2023-06-18 02:04:51,729 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=15.0 2023-06-18 02:05:31,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=97440.0, ans=0.0 2023-06-18 02:05:34,460 INFO [train.py:996] (3/4) Epoch 1, batch 16250, loss[loss=0.2725, simple_loss=0.3237, pruned_loss=0.1107, over 21457.00 frames. ], tot_loss[loss=0.3567, simple_loss=0.3982, pruned_loss=0.1576, over 4277866.21 frames. ], batch size: 212, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 02:05:59,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=97560.0, ans=0.0 2023-06-18 02:06:01,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=97560.0, ans=0.125 2023-06-18 02:06:34,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=97680.0, ans=0.125 2023-06-18 02:06:36,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=97680.0, ans=0.1 2023-06-18 02:07:23,944 INFO [train.py:996] (3/4) Epoch 1, batch 16300, loss[loss=0.3333, simple_loss=0.3934, pruned_loss=0.1366, over 21217.00 frames. ], tot_loss[loss=0.3475, simple_loss=0.3918, pruned_loss=0.1516, over 4274660.26 frames. ], batch size: 548, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 02:07:32,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=97800.0, ans=0.035 2023-06-18 02:08:02,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=97920.0, ans=0.125 2023-06-18 02:08:16,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=97920.0, ans=0.125 2023-06-18 02:08:17,057 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 3.435e+02 4.309e+02 5.263e+02 1.274e+03, threshold=8.618e+02, percent-clipped=4.0 2023-06-18 02:09:07,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=98100.0, ans=0.125 2023-06-18 02:09:08,874 INFO [train.py:996] (3/4) Epoch 1, batch 16350, loss[loss=0.4051, simple_loss=0.416, pruned_loss=0.1971, over 20193.00 frames. ], tot_loss[loss=0.348, simple_loss=0.391, pruned_loss=0.1525, over 4275956.63 frames. ], batch size: 707, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 02:10:03,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=98220.0, ans=0.2 2023-06-18 02:10:51,184 INFO [train.py:996] (3/4) Epoch 1, batch 16400, loss[loss=0.3696, simple_loss=0.4138, pruned_loss=0.1627, over 21696.00 frames. ], tot_loss[loss=0.3537, simple_loss=0.3963, pruned_loss=0.1556, over 4273353.24 frames. ], batch size: 389, lr: 2.90e-02, grad_scale: 32.0 2023-06-18 02:10:53,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=98400.0, ans=0.1 2023-06-18 02:11:15,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=98460.0, ans=0.0 2023-06-18 02:11:24,041 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.94 vs. limit=10.0 2023-06-18 02:11:47,760 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.785e+02 3.743e+02 5.331e+02 6.637e+02 1.239e+03, threshold=1.066e+03, percent-clipped=10.0 2023-06-18 02:11:51,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=98580.0, ans=0.0 2023-06-18 02:12:06,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=98640.0, ans=0.0 2023-06-18 02:12:08,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=98640.0, ans=0.125 2023-06-18 02:12:33,615 INFO [train.py:996] (3/4) Epoch 1, batch 16450, loss[loss=0.3631, simple_loss=0.4039, pruned_loss=0.1611, over 20674.00 frames. ], tot_loss[loss=0.354, simple_loss=0.3955, pruned_loss=0.1562, over 4277000.81 frames. ], batch size: 607, lr: 2.90e-02, grad_scale: 32.0 2023-06-18 02:12:47,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=98700.0, ans=0.2 2023-06-18 02:13:15,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=98820.0, ans=0.2 2023-06-18 02:13:27,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=98820.0, ans=0.125 2023-06-18 02:13:30,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=98820.0, ans=0.5 2023-06-18 02:13:51,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=98940.0, ans=0.125 2023-06-18 02:14:16,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=99000.0, ans=10.0 2023-06-18 02:14:17,006 INFO [train.py:996] (3/4) Epoch 1, batch 16500, loss[loss=0.2777, simple_loss=0.3244, pruned_loss=0.1155, over 21746.00 frames. ], tot_loss[loss=0.353, simple_loss=0.3935, pruned_loss=0.1562, over 4279366.85 frames. ], batch size: 247, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 02:14:19,535 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=22.5 2023-06-18 02:14:19,714 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-18 02:14:36,880 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-18 02:14:40,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=99060.0, ans=0.125 2023-06-18 02:14:57,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=99120.0, ans=0.0 2023-06-18 02:15:16,397 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.870e+02 4.027e+02 4.822e+02 5.863e+02 1.078e+03, threshold=9.645e+02, percent-clipped=1.0 2023-06-18 02:15:31,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=99180.0, ans=0.0 2023-06-18 02:16:01,035 INFO [train.py:996] (3/4) Epoch 1, batch 16550, loss[loss=0.3129, simple_loss=0.3823, pruned_loss=0.1218, over 21879.00 frames. ], tot_loss[loss=0.3444, simple_loss=0.3874, pruned_loss=0.1507, over 4279492.54 frames. ], batch size: 316, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 02:16:50,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=99420.0, ans=0.0 2023-06-18 02:17:24,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=99480.0, ans=0.0 2023-06-18 02:17:45,744 INFO [train.py:996] (3/4) Epoch 1, batch 16600, loss[loss=0.3904, simple_loss=0.4476, pruned_loss=0.1666, over 21617.00 frames. ], tot_loss[loss=0.3557, simple_loss=0.3985, pruned_loss=0.1564, over 4265312.35 frames. ], batch size: 263, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 02:17:46,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=99600.0, ans=0.0 2023-06-18 02:18:04,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=99600.0, ans=0.125 2023-06-18 02:18:34,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=99720.0, ans=0.0 2023-06-18 02:18:55,017 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.830e+02 4.483e+02 5.513e+02 7.810e+02 1.353e+03, threshold=1.103e+03, percent-clipped=10.0 2023-06-18 02:19:10,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=99780.0, ans=0.125 2023-06-18 02:19:38,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=99840.0, ans=0.07 2023-06-18 02:19:40,962 INFO [train.py:996] (3/4) Epoch 1, batch 16650, loss[loss=0.4551, simple_loss=0.4734, pruned_loss=0.2184, over 21455.00 frames. ], tot_loss[loss=0.3693, simple_loss=0.4129, pruned_loss=0.1629, over 4266625.11 frames. ], batch size: 471, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 02:20:38,185 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=15.0 2023-06-18 02:21:01,941 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-18 02:21:03,258 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:21:28,207 INFO [train.py:996] (3/4) Epoch 1, batch 16700, loss[loss=0.3691, simple_loss=0.4174, pruned_loss=0.1604, over 21886.00 frames. ], tot_loss[loss=0.3698, simple_loss=0.4131, pruned_loss=0.1633, over 4265046.89 frames. ], batch size: 372, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 02:21:35,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=100200.0, ans=0.2 2023-06-18 02:22:34,500 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.534e+02 4.065e+02 5.046e+02 6.706e+02 1.129e+03, threshold=1.009e+03, percent-clipped=1.0 2023-06-18 02:23:27,851 INFO [train.py:996] (3/4) Epoch 1, batch 16750, loss[loss=0.3999, simple_loss=0.4195, pruned_loss=0.1902, over 21376.00 frames. ], tot_loss[loss=0.374, simple_loss=0.4153, pruned_loss=0.1663, over 4266798.89 frames. ], batch size: 549, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 02:23:38,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=100500.0, ans=0.125 2023-06-18 02:23:44,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=100560.0, ans=0.1 2023-06-18 02:24:17,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=100620.0, ans=0.1 2023-06-18 02:25:08,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=100740.0, ans=0.0 2023-06-18 02:25:11,907 INFO [train.py:996] (3/4) Epoch 1, batch 16800, loss[loss=0.3726, simple_loss=0.407, pruned_loss=0.1691, over 21864.00 frames. ], tot_loss[loss=0.3808, simple_loss=0.4252, pruned_loss=0.1682, over 4263044.33 frames. ], batch size: 351, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 02:25:42,569 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:26:08,492 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.594e+02 4.332e+02 5.464e+02 7.061e+02 1.204e+03, threshold=1.093e+03, percent-clipped=8.0 2023-06-18 02:26:44,378 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-18 02:26:44,460 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-18 02:26:44,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=101040.0, ans=15.0 2023-06-18 02:26:54,559 INFO [train.py:996] (3/4) Epoch 1, batch 16850, loss[loss=0.4469, simple_loss=0.4451, pruned_loss=0.2243, over 21658.00 frames. ], tot_loss[loss=0.3793, simple_loss=0.4215, pruned_loss=0.1685, over 4271271.68 frames. ], batch size: 471, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 02:28:15,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=101280.0, ans=0.125 2023-06-18 02:28:37,448 INFO [train.py:996] (3/4) Epoch 1, batch 16900, loss[loss=0.3182, simple_loss=0.3559, pruned_loss=0.1403, over 21677.00 frames. ], tot_loss[loss=0.3717, simple_loss=0.4131, pruned_loss=0.1651, over 4276770.49 frames. ], batch size: 298, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 02:28:43,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.66 vs. limit=15.0 2023-06-18 02:29:22,252 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=12.0 2023-06-18 02:29:39,424 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.564e+02 4.278e+02 5.386e+02 1.254e+03, threshold=8.556e+02, percent-clipped=1.0 2023-06-18 02:30:00,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=101580.0, ans=0.09899494936611666 2023-06-18 02:30:19,191 INFO [train.py:996] (3/4) Epoch 1, batch 16950, loss[loss=0.3926, simple_loss=0.405, pruned_loss=0.1901, over 21752.00 frames. ], tot_loss[loss=0.3639, simple_loss=0.4039, pruned_loss=0.1619, over 4278836.75 frames. ], batch size: 508, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 02:30:23,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.76 vs. limit=15.0 2023-06-18 02:30:47,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=101760.0, ans=0.1 2023-06-18 02:31:51,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=101940.0, ans=0.125 2023-06-18 02:31:59,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=102000.0, ans=0.125 2023-06-18 02:32:00,686 INFO [train.py:996] (3/4) Epoch 1, batch 17000, loss[loss=0.425, simple_loss=0.4381, pruned_loss=0.2059, over 21798.00 frames. ], tot_loss[loss=0.3626, simple_loss=0.4002, pruned_loss=0.1625, over 4286280.16 frames. ], batch size: 441, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 02:33:09,872 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.503e+02 3.976e+02 5.576e+02 8.538e+02 1.340e+03, threshold=1.115e+03, percent-clipped=23.0 2023-06-18 02:33:23,789 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.61 vs. limit=22.5 2023-06-18 02:33:44,509 INFO [train.py:996] (3/4) Epoch 1, batch 17050, loss[loss=0.4047, simple_loss=0.4514, pruned_loss=0.1791, over 21875.00 frames. ], tot_loss[loss=0.3704, simple_loss=0.4068, pruned_loss=0.167, over 4281989.37 frames. ], batch size: 316, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 02:34:10,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=102360.0, ans=0.0 2023-06-18 02:34:10,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=102360.0, ans=0.125 2023-06-18 02:35:23,888 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-18 02:35:25,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=102600.0, ans=0.0 2023-06-18 02:35:26,137 INFO [train.py:996] (3/4) Epoch 1, batch 17100, loss[loss=0.3142, simple_loss=0.3558, pruned_loss=0.1364, over 21563.00 frames. ], tot_loss[loss=0.3708, simple_loss=0.4069, pruned_loss=0.1673, over 4281408.81 frames. ], batch size: 195, lr: 2.85e-02, grad_scale: 32.0 2023-06-18 02:35:32,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=102600.0, ans=0.125 2023-06-18 02:35:43,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=102600.0, ans=0.125 2023-06-18 02:35:59,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=102660.0, ans=0.2 2023-06-18 02:36:09,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=102720.0, ans=0.125 2023-06-18 02:36:28,701 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.900e+02 4.752e+02 6.955e+02 1.664e+03, threshold=9.503e+02, percent-clipped=6.0 2023-06-18 02:36:47,742 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.30 vs. limit=10.0 2023-06-18 02:37:03,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=102840.0, ans=0.1 2023-06-18 02:37:07,579 INFO [train.py:996] (3/4) Epoch 1, batch 17150, loss[loss=0.2937, simple_loss=0.3531, pruned_loss=0.1171, over 21872.00 frames. ], tot_loss[loss=0.3659, simple_loss=0.4014, pruned_loss=0.1652, over 4277459.89 frames. ], batch size: 371, lr: 2.85e-02, grad_scale: 32.0 2023-06-18 02:38:25,998 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-18 02:38:36,408 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.67 vs. limit=8.0 2023-06-18 02:38:52,562 INFO [train.py:996] (3/4) Epoch 1, batch 17200, loss[loss=0.4936, simple_loss=0.4876, pruned_loss=0.2498, over 21333.00 frames. ], tot_loss[loss=0.3651, simple_loss=0.4013, pruned_loss=0.1645, over 4278221.37 frames. ], batch size: 507, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 02:39:12,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=103260.0, ans=0.035 2023-06-18 02:39:45,757 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 3.944e+02 4.838e+02 6.225e+02 9.968e+02, threshold=9.676e+02, percent-clipped=1.0 2023-06-18 02:39:46,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=103380.0, ans=0.1 2023-06-18 02:40:31,372 INFO [train.py:996] (3/4) Epoch 1, batch 17250, loss[loss=0.3477, simple_loss=0.4173, pruned_loss=0.1391, over 16946.00 frames. ], tot_loss[loss=0.3689, simple_loss=0.4054, pruned_loss=0.1662, over 4265094.16 frames. ], batch size: 60, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 02:40:32,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=103500.0, ans=0.2 2023-06-18 02:42:10,760 INFO [train.py:996] (3/4) Epoch 1, batch 17300, loss[loss=0.4228, simple_loss=0.4459, pruned_loss=0.1998, over 21493.00 frames. ], tot_loss[loss=0.3775, simple_loss=0.4139, pruned_loss=0.1706, over 4266047.32 frames. ], batch size: 131, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 02:42:11,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=103800.0, ans=0.125 2023-06-18 02:42:37,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=103860.0, ans=0.125 2023-06-18 02:42:46,719 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=15.0 2023-06-18 02:43:07,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=103920.0, ans=0.125 2023-06-18 02:43:17,408 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.468e+02 3.798e+02 4.894e+02 6.344e+02 1.044e+03, threshold=9.789e+02, percent-clipped=2.0 2023-06-18 02:44:00,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=104100.0, ans=0.2 2023-06-18 02:44:02,183 INFO [train.py:996] (3/4) Epoch 1, batch 17350, loss[loss=0.3435, simple_loss=0.425, pruned_loss=0.131, over 21239.00 frames. ], tot_loss[loss=0.376, simple_loss=0.4132, pruned_loss=0.1693, over 4260308.65 frames. ], batch size: 548, lr: 2.83e-02, grad_scale: 16.0 2023-06-18 02:44:09,841 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.23 vs. limit=22.5 2023-06-18 02:44:46,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=104220.0, ans=0.125 2023-06-18 02:44:55,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=104220.0, ans=0.125 2023-06-18 02:45:14,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=104280.0, ans=0.0 2023-06-18 02:45:45,835 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:45:47,344 INFO [train.py:996] (3/4) Epoch 1, batch 17400, loss[loss=0.2832, simple_loss=0.3127, pruned_loss=0.1269, over 21228.00 frames. ], tot_loss[loss=0.3696, simple_loss=0.4094, pruned_loss=0.1649, over 4262627.75 frames. ], batch size: 143, lr: 2.83e-02, grad_scale: 16.0 2023-06-18 02:46:14,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=104460.0, ans=0.04949747468305833 2023-06-18 02:46:31,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=104520.0, ans=0.0 2023-06-18 02:46:45,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=104580.0, ans=0.0 2023-06-18 02:46:47,229 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 3.665e+02 5.087e+02 7.278e+02 1.204e+03, threshold=1.017e+03, percent-clipped=4.0 2023-06-18 02:47:03,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=104580.0, ans=0.1 2023-06-18 02:47:16,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=104640.0, ans=0.0 2023-06-18 02:47:30,809 INFO [train.py:996] (3/4) Epoch 1, batch 17450, loss[loss=0.285, simple_loss=0.3598, pruned_loss=0.1051, over 21737.00 frames. ], tot_loss[loss=0.3593, simple_loss=0.4022, pruned_loss=0.1582, over 4268920.69 frames. ], batch size: 351, lr: 2.83e-02, grad_scale: 16.0 2023-06-18 02:47:41,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=104700.0, ans=0.125 2023-06-18 02:47:52,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=104760.0, ans=0.0 2023-06-18 02:48:02,567 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.98 vs. limit=10.0 2023-06-18 02:48:39,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=104880.0, ans=0.125 2023-06-18 02:49:07,390 INFO [train.py:996] (3/4) Epoch 1, batch 17500, loss[loss=0.3485, simple_loss=0.3899, pruned_loss=0.1535, over 21465.00 frames. ], tot_loss[loss=0.3507, simple_loss=0.3957, pruned_loss=0.1528, over 4269405.15 frames. ], batch size: 548, lr: 2.82e-02, grad_scale: 16.0 2023-06-18 02:49:15,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-18 02:49:16,169 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.832e-02 2023-06-18 02:50:08,542 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:50:08,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=105120.0, ans=0.125 2023-06-18 02:50:12,881 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 3.016e+02 3.973e+02 5.519e+02 1.327e+03, threshold=7.947e+02, percent-clipped=4.0 2023-06-18 02:50:17,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=105180.0, ans=0.125 2023-06-18 02:50:49,638 INFO [train.py:996] (3/4) Epoch 1, batch 17550, loss[loss=0.3193, simple_loss=0.3841, pruned_loss=0.1272, over 21272.00 frames. ], tot_loss[loss=0.3483, simple_loss=0.3951, pruned_loss=0.1508, over 4271387.65 frames. ], batch size: 143, lr: 2.82e-02, grad_scale: 16.0 2023-06-18 02:51:17,898 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.075e-03 2023-06-18 02:51:23,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=105360.0, ans=0.125 2023-06-18 02:51:56,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=105480.0, ans=0.035 2023-06-18 02:52:08,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=105480.0, ans=0.1 2023-06-18 02:52:24,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=105540.0, ans=0.1 2023-06-18 02:52:32,332 INFO [train.py:996] (3/4) Epoch 1, batch 17600, loss[loss=0.3273, simple_loss=0.3834, pruned_loss=0.1356, over 21281.00 frames. ], tot_loss[loss=0.3478, simple_loss=0.3955, pruned_loss=0.15, over 4272513.77 frames. ], batch size: 143, lr: 2.82e-02, grad_scale: 32.0 2023-06-18 02:52:50,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=105600.0, ans=0.125 2023-06-18 02:53:33,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=105720.0, ans=0.0 2023-06-18 02:53:35,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=105780.0, ans=0.1 2023-06-18 02:53:36,621 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 3.619e+02 4.865e+02 6.950e+02 1.496e+03, threshold=9.730e+02, percent-clipped=22.0 2023-06-18 02:54:03,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=105840.0, ans=0.0 2023-06-18 02:54:19,853 INFO [train.py:996] (3/4) Epoch 1, batch 17650, loss[loss=0.254, simple_loss=0.3007, pruned_loss=0.1036, over 21578.00 frames. ], tot_loss[loss=0.3474, simple_loss=0.3931, pruned_loss=0.1508, over 4263658.63 frames. ], batch size: 230, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 02:54:25,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=105900.0, ans=0.0 2023-06-18 02:54:59,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=106020.0, ans=0.2 2023-06-18 02:55:21,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=106080.0, ans=0.1 2023-06-18 02:56:02,787 INFO [train.py:996] (3/4) Epoch 1, batch 17700, loss[loss=0.3748, simple_loss=0.4276, pruned_loss=0.161, over 20788.00 frames. ], tot_loss[loss=0.3374, simple_loss=0.384, pruned_loss=0.1454, over 4269181.70 frames. ], batch size: 609, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 02:56:03,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=106200.0, ans=0.2 2023-06-18 02:56:13,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=106200.0, ans=0.2 2023-06-18 02:56:19,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=106260.0, ans=0.125 2023-06-18 02:56:43,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=106260.0, ans=0.125 2023-06-18 02:56:44,148 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.30 vs. limit=22.5 2023-06-18 02:56:54,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=106320.0, ans=0.2 2023-06-18 02:57:07,753 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.749e+02 4.413e+02 5.536e+02 1.027e+03, threshold=8.827e+02, percent-clipped=1.0 2023-06-18 02:57:18,175 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:57:45,696 INFO [train.py:996] (3/4) Epoch 1, batch 17750, loss[loss=0.4094, simple_loss=0.4432, pruned_loss=0.1878, over 21722.00 frames. ], tot_loss[loss=0.3509, simple_loss=0.3961, pruned_loss=0.1528, over 4272510.67 frames. ], batch size: 298, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 02:57:46,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=106500.0, ans=0.125 2023-06-18 02:57:55,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=106500.0, ans=0.125 2023-06-18 02:58:45,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=106620.0, ans=0.0 2023-06-18 02:58:59,286 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-18 02:59:30,613 INFO [train.py:996] (3/4) Epoch 1, batch 17800, loss[loss=0.36, simple_loss=0.4052, pruned_loss=0.1574, over 21859.00 frames. ], tot_loss[loss=0.3511, simple_loss=0.3964, pruned_loss=0.1529, over 4271827.82 frames. ], batch size: 372, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 02:59:50,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=106800.0, ans=0.125 2023-06-18 03:00:01,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=106860.0, ans=0.1 2023-06-18 03:00:01,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=106860.0, ans=0.2 2023-06-18 03:00:10,503 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=32.14 vs. limit=15.0 2023-06-18 03:00:21,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=106920.0, ans=0.025 2023-06-18 03:00:41,768 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.616e+02 4.998e+02 5.902e+02 1.082e+03, threshold=9.996e+02, percent-clipped=5.0 2023-06-18 03:00:43,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=106980.0, ans=0.125 2023-06-18 03:01:00,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=107040.0, ans=0.0 2023-06-18 03:01:01,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=107040.0, ans=0.025 2023-06-18 03:01:01,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=107040.0, ans=0.5 2023-06-18 03:01:25,747 INFO [train.py:996] (3/4) Epoch 1, batch 17850, loss[loss=0.3652, simple_loss=0.399, pruned_loss=0.1657, over 21816.00 frames. ], tot_loss[loss=0.3492, simple_loss=0.3947, pruned_loss=0.1519, over 4273474.12 frames. ], batch size: 282, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 03:01:43,179 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:01:51,002 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=15.0 2023-06-18 03:02:00,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=107160.0, ans=0.0 2023-06-18 03:02:06,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=107220.0, ans=0.125 2023-06-18 03:02:08,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=107220.0, ans=0.125 2023-06-18 03:02:32,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=107280.0, ans=0.1 2023-06-18 03:03:08,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=107340.0, ans=0.0 2023-06-18 03:03:11,351 INFO [train.py:996] (3/4) Epoch 1, batch 17900, loss[loss=0.3383, simple_loss=0.405, pruned_loss=0.1358, over 21640.00 frames. ], tot_loss[loss=0.3583, simple_loss=0.4034, pruned_loss=0.1566, over 4277479.84 frames. ], batch size: 263, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 03:03:26,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=107460.0, ans=0.0 2023-06-18 03:04:10,895 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.685e+02 4.154e+02 5.194e+02 6.786e+02 1.159e+03, threshold=1.039e+03, percent-clipped=5.0 2023-06-18 03:04:53,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=107700.0, ans=0.07 2023-06-18 03:04:55,237 INFO [train.py:996] (3/4) Epoch 1, batch 17950, loss[loss=0.2228, simple_loss=0.2688, pruned_loss=0.0884, over 15945.00 frames. ], tot_loss[loss=0.3516, simple_loss=0.401, pruned_loss=0.1511, over 4269595.31 frames. ], batch size: 60, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 03:05:40,984 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-06-18 03:05:51,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=107820.0, ans=0.1 2023-06-18 03:06:38,438 INFO [train.py:996] (3/4) Epoch 1, batch 18000, loss[loss=0.4121, simple_loss=0.3977, pruned_loss=0.2133, over 21400.00 frames. ], tot_loss[loss=0.3475, simple_loss=0.3944, pruned_loss=0.1503, over 4265522.78 frames. ], batch size: 509, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 03:06:38,439 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 03:06:50,862 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.9849, 3.2708, 1.6564, 1.9006], device='cuda:3') 2023-06-18 03:06:57,891 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.3324, simple_loss=0.4216, pruned_loss=0.1216, over 1796401.00 frames. 2023-06-18 03:06:57,892 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-18 03:07:48,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=108120.0, ans=0.0 2023-06-18 03:08:03,602 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 3.501e+02 4.751e+02 6.240e+02 1.819e+03, threshold=9.502e+02, percent-clipped=6.0 2023-06-18 03:08:22,564 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.86 vs. limit=22.5 2023-06-18 03:08:33,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=108240.0, ans=0.125 2023-06-18 03:08:41,368 INFO [train.py:996] (3/4) Epoch 1, batch 18050, loss[loss=0.3447, simple_loss=0.3871, pruned_loss=0.1512, over 21403.00 frames. ], tot_loss[loss=0.3435, simple_loss=0.3883, pruned_loss=0.1494, over 4260860.84 frames. ], batch size: 131, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 03:08:59,536 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.98 vs. limit=15.0 2023-06-18 03:09:31,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=108420.0, ans=0.1 2023-06-18 03:09:54,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=108480.0, ans=0.125 2023-06-18 03:10:15,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=108540.0, ans=15.0 2023-06-18 03:10:33,497 INFO [train.py:996] (3/4) Epoch 1, batch 18100, loss[loss=0.3352, simple_loss=0.4066, pruned_loss=0.1319, over 21710.00 frames. ], tot_loss[loss=0.3544, simple_loss=0.3976, pruned_loss=0.1556, over 4269021.17 frames. ], batch size: 298, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 03:10:52,424 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.81 vs. limit=15.0 2023-06-18 03:10:53,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=108660.0, ans=0.0 2023-06-18 03:11:04,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=108660.0, ans=0.125 2023-06-18 03:11:24,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108720.0, ans=0.1 2023-06-18 03:11:33,722 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.544e+02 4.148e+02 5.231e+02 6.853e+02 1.250e+03, threshold=1.046e+03, percent-clipped=5.0 2023-06-18 03:12:07,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=108840.0, ans=0.02 2023-06-18 03:12:17,527 INFO [train.py:996] (3/4) Epoch 1, batch 18150, loss[loss=0.3168, simple_loss=0.3399, pruned_loss=0.1469, over 15639.00 frames. ], tot_loss[loss=0.354, simple_loss=0.3988, pruned_loss=0.1547, over 4264162.10 frames. ], batch size: 60, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 03:12:32,612 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.36 vs. limit=15.0 2023-06-18 03:12:35,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=108960.0, ans=0.125 2023-06-18 03:13:03,794 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=15.0 2023-06-18 03:13:11,653 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.72 vs. limit=10.0 2023-06-18 03:13:27,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=109080.0, ans=0.1 2023-06-18 03:13:34,488 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-06-18 03:13:35,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=109080.0, ans=0.0 2023-06-18 03:13:56,520 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.95 vs. limit=10.0 2023-06-18 03:13:58,919 INFO [train.py:996] (3/4) Epoch 1, batch 18200, loss[loss=0.3168, simple_loss=0.3497, pruned_loss=0.142, over 21861.00 frames. ], tot_loss[loss=0.3492, simple_loss=0.3912, pruned_loss=0.1536, over 4265254.04 frames. ], batch size: 107, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 03:14:04,905 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.74 vs. limit=15.0 2023-06-18 03:14:56,477 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 3.703e+02 5.000e+02 6.238e+02 9.945e+02, threshold=1.000e+03, percent-clipped=0.0 2023-06-18 03:15:29,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=109440.0, ans=0.125 2023-06-18 03:15:29,873 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.12 vs. limit=10.0 2023-06-18 03:15:31,142 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-18 03:15:33,205 INFO [train.py:996] (3/4) Epoch 1, batch 18250, loss[loss=0.2918, simple_loss=0.3356, pruned_loss=0.124, over 21430.00 frames. ], tot_loss[loss=0.3365, simple_loss=0.3791, pruned_loss=0.147, over 4262794.64 frames. ], batch size: 194, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 03:15:43,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=109500.0, ans=0.0 2023-06-18 03:15:43,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=109500.0, ans=0.125 2023-06-18 03:15:44,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=12.0 2023-06-18 03:17:09,847 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=22.5 2023-06-18 03:17:17,158 INFO [train.py:996] (3/4) Epoch 1, batch 18300, loss[loss=0.4022, simple_loss=0.4308, pruned_loss=0.1868, over 21846.00 frames. ], tot_loss[loss=0.336, simple_loss=0.3789, pruned_loss=0.1466, over 4261473.76 frames. ], batch size: 351, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 03:18:21,686 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.295e+02 3.385e+02 4.237e+02 5.760e+02 9.388e+02, threshold=8.473e+02, percent-clipped=0.0 2023-06-18 03:18:31,094 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.44 vs. limit=22.5 2023-06-18 03:18:47,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=110040.0, ans=0.025 2023-06-18 03:18:49,497 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.32 vs. limit=15.0 2023-06-18 03:19:00,049 INFO [train.py:996] (3/4) Epoch 1, batch 18350, loss[loss=0.3034, simple_loss=0.3477, pruned_loss=0.1296, over 20805.00 frames. ], tot_loss[loss=0.3408, simple_loss=0.3881, pruned_loss=0.1468, over 4254076.66 frames. ], batch size: 608, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 03:19:02,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=110100.0, ans=0.2 2023-06-18 03:19:22,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=110160.0, ans=0.0 2023-06-18 03:19:37,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=110220.0, ans=0.125 2023-06-18 03:20:04,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=110280.0, ans=0.125 2023-06-18 03:20:06,602 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=22.5 2023-06-18 03:20:09,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=110280.0, ans=0.125 2023-06-18 03:20:34,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=110340.0, ans=0.2 2023-06-18 03:20:44,038 INFO [train.py:996] (3/4) Epoch 1, batch 18400, loss[loss=0.301, simple_loss=0.34, pruned_loss=0.131, over 21154.00 frames. ], tot_loss[loss=0.3383, simple_loss=0.3833, pruned_loss=0.1467, over 4252147.73 frames. ], batch size: 143, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 03:20:44,546 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:20:45,059 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-18 03:21:01,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=110400.0, ans=0.0 2023-06-18 03:21:14,545 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-18 03:21:19,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=110460.0, ans=0.07 2023-06-18 03:21:48,643 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.132e+02 3.866e+02 5.099e+02 1.496e+03, threshold=7.733e+02, percent-clipped=6.0 2023-06-18 03:22:17,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=110640.0, ans=0.2 2023-06-18 03:22:17,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.16 vs. limit=6.0 2023-06-18 03:22:31,353 INFO [train.py:996] (3/4) Epoch 1, batch 18450, loss[loss=0.3084, simple_loss=0.3661, pruned_loss=0.1254, over 21534.00 frames. ], tot_loss[loss=0.3296, simple_loss=0.3784, pruned_loss=0.1404, over 4254316.51 frames. ], batch size: 212, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 03:22:39,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=110700.0, ans=0.0 2023-06-18 03:22:43,583 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.62 vs. limit=15.0 2023-06-18 03:23:02,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=110820.0, ans=0.125 2023-06-18 03:23:16,979 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=15.0 2023-06-18 03:23:55,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=110940.0, ans=0.125 2023-06-18 03:24:04,709 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=10.0 2023-06-18 03:24:06,155 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=15.0 2023-06-18 03:24:10,264 INFO [train.py:996] (3/4) Epoch 1, batch 18500, loss[loss=0.3018, simple_loss=0.3575, pruned_loss=0.1231, over 21674.00 frames. ], tot_loss[loss=0.328, simple_loss=0.3746, pruned_loss=0.1407, over 4259449.90 frames. ], batch size: 247, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 03:24:22,561 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-18 03:24:25,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=111000.0, ans=0.125 2023-06-18 03:24:27,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=111000.0, ans=0.0 2023-06-18 03:25:14,308 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 3.536e+02 4.866e+02 6.375e+02 1.291e+03, threshold=9.732e+02, percent-clipped=16.0 2023-06-18 03:25:44,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=111240.0, ans=0.125 2023-06-18 03:25:52,420 INFO [train.py:996] (3/4) Epoch 1, batch 18550, loss[loss=0.2697, simple_loss=0.3137, pruned_loss=0.1129, over 21421.00 frames. ], tot_loss[loss=0.3261, simple_loss=0.3721, pruned_loss=0.14, over 4260502.18 frames. ], batch size: 194, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 03:26:00,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.29 vs. limit=6.0 2023-06-18 03:27:02,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=15.0 2023-06-18 03:27:44,749 INFO [train.py:996] (3/4) Epoch 1, batch 18600, loss[loss=0.344, simple_loss=0.3963, pruned_loss=0.1459, over 21890.00 frames. ], tot_loss[loss=0.3259, simple_loss=0.3697, pruned_loss=0.141, over 4244654.50 frames. ], batch size: 373, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 03:27:47,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.whiten.whitening_limit, batch_count=111600.0, ans=12.0 2023-06-18 03:28:38,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=111720.0, ans=0.125 2023-06-18 03:28:44,869 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 3.656e+02 4.245e+02 5.529e+02 8.990e+02, threshold=8.491e+02, percent-clipped=0.0 2023-06-18 03:29:06,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=111840.0, ans=0.0 2023-06-18 03:29:16,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=111840.0, ans=0.1 2023-06-18 03:29:27,665 INFO [train.py:996] (3/4) Epoch 1, batch 18650, loss[loss=0.2741, simple_loss=0.333, pruned_loss=0.1076, over 21452.00 frames. ], tot_loss[loss=0.3266, simple_loss=0.3692, pruned_loss=0.1419, over 4241584.84 frames. ], batch size: 212, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 03:29:57,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=111960.0, ans=0.125 2023-06-18 03:30:43,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=112080.0, ans=0.125 2023-06-18 03:31:04,856 INFO [train.py:996] (3/4) Epoch 1, batch 18700, loss[loss=0.4047, simple_loss=0.4164, pruned_loss=0.1965, over 21723.00 frames. ], tot_loss[loss=0.3297, simple_loss=0.3691, pruned_loss=0.1451, over 4254498.31 frames. ], batch size: 389, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 03:32:03,397 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.596e+02 4.888e+02 6.142e+02 1.184e+03, threshold=9.776e+02, percent-clipped=7.0 2023-06-18 03:32:31,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=112440.0, ans=0.0 2023-06-18 03:32:47,380 INFO [train.py:996] (3/4) Epoch 1, batch 18750, loss[loss=0.3077, simple_loss=0.3463, pruned_loss=0.1346, over 21344.00 frames. ], tot_loss[loss=0.3338, simple_loss=0.3712, pruned_loss=0.1482, over 4242448.93 frames. ], batch size: 176, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 03:33:14,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=112560.0, ans=0.0 2023-06-18 03:33:40,239 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:33:40,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=112620.0, ans=0.2 2023-06-18 03:33:55,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=112680.0, ans=0.0 2023-06-18 03:34:15,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=112740.0, ans=0.2 2023-06-18 03:34:31,338 INFO [train.py:996] (3/4) Epoch 1, batch 18800, loss[loss=0.2944, simple_loss=0.3552, pruned_loss=0.1168, over 21833.00 frames. ], tot_loss[loss=0.3388, simple_loss=0.3782, pruned_loss=0.1497, over 4251807.52 frames. ], batch size: 316, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 03:35:33,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=112920.0, ans=0.125 2023-06-18 03:35:36,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=112980.0, ans=0.125 2023-06-18 03:35:38,116 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 3.151e+02 4.208e+02 5.876e+02 1.169e+03, threshold=8.416e+02, percent-clipped=1.0 2023-06-18 03:36:00,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=113040.0, ans=0.1 2023-06-18 03:36:15,755 INFO [train.py:996] (3/4) Epoch 1, batch 18850, loss[loss=0.2832, simple_loss=0.3303, pruned_loss=0.1181, over 21509.00 frames. ], tot_loss[loss=0.3267, simple_loss=0.3708, pruned_loss=0.1412, over 4247796.12 frames. ], batch size: 230, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 03:37:39,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=113280.0, ans=0.0 2023-06-18 03:37:39,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=113280.0, ans=12.0 2023-06-18 03:37:49,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.88 vs. limit=15.0 2023-06-18 03:37:57,690 INFO [train.py:996] (3/4) Epoch 1, batch 18900, loss[loss=0.301, simple_loss=0.3261, pruned_loss=0.138, over 21059.00 frames. ], tot_loss[loss=0.3251, simple_loss=0.3671, pruned_loss=0.1415, over 4253197.54 frames. ], batch size: 608, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 03:38:21,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=113460.0, ans=0.0 2023-06-18 03:38:30,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=113460.0, ans=0.125 2023-06-18 03:38:57,640 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 3.397e+02 4.704e+02 6.198e+02 1.365e+03, threshold=9.409e+02, percent-clipped=10.0 2023-06-18 03:39:09,268 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.83 vs. limit=22.5 2023-06-18 03:39:48,133 INFO [train.py:996] (3/4) Epoch 1, batch 18950, loss[loss=0.3595, simple_loss=0.4266, pruned_loss=0.1462, over 21835.00 frames. ], tot_loss[loss=0.3339, simple_loss=0.3732, pruned_loss=0.1473, over 4262388.21 frames. ], batch size: 351, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 03:39:51,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=113700.0, ans=0.125 2023-06-18 03:39:57,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=113700.0, ans=0.125 2023-06-18 03:41:00,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=113880.0, ans=0.125 2023-06-18 03:41:03,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=113880.0, ans=0.2 2023-06-18 03:41:10,777 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=22.5 2023-06-18 03:41:20,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=113940.0, ans=0.125 2023-06-18 03:41:31,599 INFO [train.py:996] (3/4) Epoch 1, batch 19000, loss[loss=0.3872, simple_loss=0.4394, pruned_loss=0.1675, over 21407.00 frames. ], tot_loss[loss=0.3425, simple_loss=0.3847, pruned_loss=0.1502, over 4268422.07 frames. ], batch size: 471, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 03:41:40,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=114000.0, ans=0.0 2023-06-18 03:41:59,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=114060.0, ans=0.2 2023-06-18 03:42:32,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=114120.0, ans=0.0 2023-06-18 03:42:37,049 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.785e+02 4.163e+02 4.936e+02 6.528e+02 1.667e+03, threshold=9.873e+02, percent-clipped=8.0 2023-06-18 03:43:15,035 INFO [train.py:996] (3/4) Epoch 1, batch 19050, loss[loss=0.3451, simple_loss=0.3924, pruned_loss=0.1489, over 20628.00 frames. ], tot_loss[loss=0.3496, simple_loss=0.3897, pruned_loss=0.1547, over 4272168.70 frames. ], batch size: 607, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 03:44:00,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=114420.0, ans=0.125 2023-06-18 03:44:16,762 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:44:27,778 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=12.0 2023-06-18 03:44:55,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=114540.0, ans=0.1 2023-06-18 03:44:58,922 INFO [train.py:996] (3/4) Epoch 1, batch 19100, loss[loss=0.3789, simple_loss=0.3981, pruned_loss=0.1798, over 21555.00 frames. ], tot_loss[loss=0.3517, simple_loss=0.3892, pruned_loss=0.1571, over 4274456.41 frames. ], batch size: 414, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 03:45:07,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.46 vs. limit=22.5 2023-06-18 03:45:49,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=114720.0, ans=0.2 2023-06-18 03:46:04,824 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.557e+02 3.849e+02 4.992e+02 6.577e+02 2.048e+03, threshold=9.985e+02, percent-clipped=3.0 2023-06-18 03:46:14,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=114780.0, ans=0.0 2023-06-18 03:46:35,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=114840.0, ans=0.0 2023-06-18 03:46:42,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=114900.0, ans=0.0 2023-06-18 03:46:43,843 INFO [train.py:996] (3/4) Epoch 1, batch 19150, loss[loss=0.4895, simple_loss=0.5297, pruned_loss=0.2247, over 21497.00 frames. ], tot_loss[loss=0.3527, simple_loss=0.3905, pruned_loss=0.1575, over 4265382.31 frames. ], batch size: 471, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 03:46:54,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=114900.0, ans=0.0 2023-06-18 03:47:26,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=114960.0, ans=0.1 2023-06-18 03:47:43,523 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.57 vs. limit=22.5 2023-06-18 03:47:53,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=115080.0, ans=0.125 2023-06-18 03:48:28,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=115200.0, ans=0.125 2023-06-18 03:48:29,954 INFO [train.py:996] (3/4) Epoch 1, batch 19200, loss[loss=0.3341, simple_loss=0.4205, pruned_loss=0.1239, over 21719.00 frames. ], tot_loss[loss=0.3586, simple_loss=0.401, pruned_loss=0.1581, over 4269346.38 frames. ], batch size: 298, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 03:49:08,911 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.36 vs. limit=15.0 2023-06-18 03:49:30,834 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.474e+02 4.215e+02 5.397e+02 9.229e+02, threshold=8.431e+02, percent-clipped=0.0 2023-06-18 03:50:04,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=115440.0, ans=0.1 2023-06-18 03:50:08,857 INFO [train.py:996] (3/4) Epoch 1, batch 19250, loss[loss=0.3645, simple_loss=0.4095, pruned_loss=0.1598, over 21716.00 frames. ], tot_loss[loss=0.3474, simple_loss=0.3974, pruned_loss=0.1487, over 4278024.76 frames. ], batch size: 441, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 03:50:09,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=115500.0, ans=0.2 2023-06-18 03:51:05,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.13 vs. limit=22.5 2023-06-18 03:51:09,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=115620.0, ans=0.125 2023-06-18 03:51:11,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=115620.0, ans=0.95 2023-06-18 03:51:30,295 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-06-18 03:51:52,357 INFO [train.py:996] (3/4) Epoch 1, batch 19300, loss[loss=0.3247, simple_loss=0.3686, pruned_loss=0.1404, over 21585.00 frames. ], tot_loss[loss=0.3462, simple_loss=0.3951, pruned_loss=0.1487, over 4276692.09 frames. ], batch size: 195, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 03:52:47,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=115920.0, ans=0.125 2023-06-18 03:53:02,593 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 3.240e+02 4.224e+02 5.313e+02 1.250e+03, threshold=8.447e+02, percent-clipped=7.0 2023-06-18 03:53:09,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=115980.0, ans=0.125 2023-06-18 03:53:41,267 INFO [train.py:996] (3/4) Epoch 1, batch 19350, loss[loss=0.4323, simple_loss=0.4979, pruned_loss=0.1834, over 19714.00 frames. ], tot_loss[loss=0.3364, simple_loss=0.3876, pruned_loss=0.1426, over 4276006.24 frames. ], batch size: 703, lr: 2.71e-02, grad_scale: 64.0 2023-06-18 03:54:01,664 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=12.0 2023-06-18 03:54:15,586 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:54:28,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=116220.0, ans=0.2 2023-06-18 03:54:30,889 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.45 vs. limit=6.0 2023-06-18 03:54:44,092 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=22.5 2023-06-18 03:54:58,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=116280.0, ans=0.1 2023-06-18 03:55:24,566 INFO [train.py:996] (3/4) Epoch 1, batch 19400, loss[loss=0.2977, simple_loss=0.3476, pruned_loss=0.124, over 21533.00 frames. ], tot_loss[loss=0.3317, simple_loss=0.3829, pruned_loss=0.1403, over 4280482.51 frames. ], batch size: 195, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 03:56:15,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=116520.0, ans=0.0 2023-06-18 03:56:28,873 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-18 03:56:29,451 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 3.749e+02 4.636e+02 5.829e+02 1.066e+03, threshold=9.272e+02, percent-clipped=6.0 2023-06-18 03:57:05,434 INFO [train.py:996] (3/4) Epoch 1, batch 19450, loss[loss=0.3495, simple_loss=0.3723, pruned_loss=0.1633, over 21860.00 frames. ], tot_loss[loss=0.3358, simple_loss=0.3824, pruned_loss=0.1446, over 4290127.47 frames. ], batch size: 107, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 03:57:09,675 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.48 vs. limit=10.0 2023-06-18 03:57:27,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=22.5 2023-06-18 03:58:33,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=116940.0, ans=0.04949747468305833 2023-06-18 03:58:56,288 INFO [train.py:996] (3/4) Epoch 1, batch 19500, loss[loss=0.2959, simple_loss=0.3311, pruned_loss=0.1303, over 21532.00 frames. ], tot_loss[loss=0.3357, simple_loss=0.3785, pruned_loss=0.1464, over 4281951.03 frames. ], batch size: 230, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 03:59:19,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=117060.0, ans=0.0 2023-06-18 03:59:57,080 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 3.820e+02 4.726e+02 6.793e+02 1.461e+03, threshold=9.451e+02, percent-clipped=7.0 2023-06-18 04:00:25,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=117240.0, ans=0.0 2023-06-18 04:00:32,864 INFO [train.py:996] (3/4) Epoch 1, batch 19550, loss[loss=0.3567, simple_loss=0.4175, pruned_loss=0.1479, over 21534.00 frames. ], tot_loss[loss=0.3308, simple_loss=0.3737, pruned_loss=0.1439, over 4265553.17 frames. ], batch size: 471, lr: 2.69e-02, grad_scale: 32.0 2023-06-18 04:00:53,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=117300.0, ans=0.125 2023-06-18 04:01:04,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=117360.0, ans=0.125 2023-06-18 04:01:31,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=117420.0, ans=0.125 2023-06-18 04:01:42,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=117480.0, ans=0.125 2023-06-18 04:02:13,851 INFO [train.py:996] (3/4) Epoch 1, batch 19600, loss[loss=0.3911, simple_loss=0.4035, pruned_loss=0.1893, over 21620.00 frames. ], tot_loss[loss=0.3362, simple_loss=0.3781, pruned_loss=0.1472, over 4272048.21 frames. ], batch size: 548, lr: 2.69e-02, grad_scale: 32.0 2023-06-18 04:02:43,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=117660.0, ans=0.125 2023-06-18 04:02:45,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=117660.0, ans=0.125 2023-06-18 04:03:20,208 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.375e+02 3.478e+02 4.292e+02 5.648e+02 1.125e+03, threshold=8.585e+02, percent-clipped=2.0 2023-06-18 04:04:03,563 INFO [train.py:996] (3/4) Epoch 1, batch 19650, loss[loss=0.3754, simple_loss=0.4168, pruned_loss=0.167, over 21468.00 frames. ], tot_loss[loss=0.348, simple_loss=0.387, pruned_loss=0.1544, over 4276030.48 frames. ], batch size: 131, lr: 2.69e-02, grad_scale: 32.0 2023-06-18 04:04:10,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=117900.0, ans=0.2 2023-06-18 04:04:43,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=117960.0, ans=0.1 2023-06-18 04:05:52,058 INFO [train.py:996] (3/4) Epoch 1, batch 19700, loss[loss=0.3354, simple_loss=0.4034, pruned_loss=0.1337, over 21708.00 frames. ], tot_loss[loss=0.352, simple_loss=0.3925, pruned_loss=0.1558, over 4275274.70 frames. ], batch size: 351, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 04:06:12,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=118260.0, ans=0.0 2023-06-18 04:06:27,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=118260.0, ans=0.0 2023-06-18 04:06:28,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=118320.0, ans=0.125 2023-06-18 04:06:35,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=118320.0, ans=0.0 2023-06-18 04:06:59,766 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.705e+02 3.771e+02 4.552e+02 5.763e+02 1.165e+03, threshold=9.104e+02, percent-clipped=3.0 2023-06-18 04:07:11,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=118380.0, ans=0.95 2023-06-18 04:07:30,613 INFO [train.py:996] (3/4) Epoch 1, batch 19750, loss[loss=0.3441, simple_loss=0.4038, pruned_loss=0.1422, over 21615.00 frames. ], tot_loss[loss=0.3578, simple_loss=0.4024, pruned_loss=0.1566, over 4276144.62 frames. ], batch size: 263, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 04:07:55,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=118560.0, ans=0.1 2023-06-18 04:08:58,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=118740.0, ans=0.0 2023-06-18 04:09:04,703 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.835e-02 2023-06-18 04:09:17,493 INFO [train.py:996] (3/4) Epoch 1, batch 19800, loss[loss=0.3626, simple_loss=0.3972, pruned_loss=0.164, over 21829.00 frames. ], tot_loss[loss=0.3579, simple_loss=0.4012, pruned_loss=0.1573, over 4280014.18 frames. ], batch size: 282, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 04:09:37,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=118860.0, ans=0.0 2023-06-18 04:09:45,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=118860.0, ans=0.1 2023-06-18 04:09:58,228 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=22.5 2023-06-18 04:10:23,926 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.318e+02 3.634e+02 4.450e+02 5.874e+02 9.997e+02, threshold=8.899e+02, percent-clipped=2.0 2023-06-18 04:10:29,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=118980.0, ans=0.125 2023-06-18 04:10:31,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=118980.0, ans=0.125 2023-06-18 04:10:31,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=118980.0, ans=0.09899494936611666 2023-06-18 04:10:36,614 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-18 04:11:00,156 INFO [train.py:996] (3/4) Epoch 1, batch 19850, loss[loss=0.3014, simple_loss=0.371, pruned_loss=0.1159, over 21876.00 frames. ], tot_loss[loss=0.3432, simple_loss=0.3905, pruned_loss=0.1479, over 4283043.44 frames. ], batch size: 317, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 04:12:45,761 INFO [train.py:996] (3/4) Epoch 1, batch 19900, loss[loss=0.2236, simple_loss=0.2959, pruned_loss=0.07567, over 15775.00 frames. ], tot_loss[loss=0.3388, simple_loss=0.3889, pruned_loss=0.1443, over 4282947.92 frames. ], batch size: 60, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 04:13:36,193 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.77 vs. limit=5.0 2023-06-18 04:13:51,578 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.203e+02 3.596e+02 4.410e+02 6.393e+02 1.239e+03, threshold=8.821e+02, percent-clipped=7.0 2023-06-18 04:13:58,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=119580.0, ans=0.125 2023-06-18 04:14:14,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=119640.0, ans=0.0 2023-06-18 04:14:27,000 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.77 vs. limit=10.0 2023-06-18 04:14:27,872 INFO [train.py:996] (3/4) Epoch 1, batch 19950, loss[loss=0.3109, simple_loss=0.3572, pruned_loss=0.1323, over 21758.00 frames. ], tot_loss[loss=0.3353, simple_loss=0.3815, pruned_loss=0.1445, over 4272623.95 frames. ], batch size: 351, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 04:15:22,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=119820.0, ans=0.0 2023-06-18 04:15:29,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=119820.0, ans=0.125 2023-06-18 04:16:12,227 INFO [train.py:996] (3/4) Epoch 1, batch 20000, loss[loss=0.362, simple_loss=0.3966, pruned_loss=0.1637, over 21875.00 frames. ], tot_loss[loss=0.3377, simple_loss=0.383, pruned_loss=0.1462, over 4281662.96 frames. ], batch size: 118, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 04:16:49,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=120060.0, ans=0.1 2023-06-18 04:17:05,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=120120.0, ans=0.1 2023-06-18 04:17:18,780 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 3.616e+02 4.426e+02 6.098e+02 1.164e+03, threshold=8.852e+02, percent-clipped=3.0 2023-06-18 04:17:29,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=120180.0, ans=0.1 2023-06-18 04:17:52,848 INFO [train.py:996] (3/4) Epoch 1, batch 20050, loss[loss=0.3994, simple_loss=0.418, pruned_loss=0.1904, over 21775.00 frames. ], tot_loss[loss=0.3434, simple_loss=0.3865, pruned_loss=0.1502, over 4285626.66 frames. ], batch size: 441, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 04:17:58,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=120300.0, ans=0.125 2023-06-18 04:18:30,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=120360.0, ans=0.125 2023-06-18 04:19:16,265 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=12.0 2023-06-18 04:19:17,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=120480.0, ans=0.2 2023-06-18 04:19:36,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=120600.0, ans=0.0 2023-06-18 04:19:37,710 INFO [train.py:996] (3/4) Epoch 1, batch 20100, loss[loss=0.3109, simple_loss=0.3654, pruned_loss=0.1282, over 21373.00 frames. ], tot_loss[loss=0.3481, simple_loss=0.3892, pruned_loss=0.1535, over 4290742.25 frames. ], batch size: 159, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 04:20:14,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=120660.0, ans=0.2 2023-06-18 04:20:51,577 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.579e+02 3.978e+02 4.839e+02 6.470e+02 1.176e+03, threshold=9.678e+02, percent-clipped=4.0 2023-06-18 04:20:54,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=120780.0, ans=0.125 2023-06-18 04:21:32,587 INFO [train.py:996] (3/4) Epoch 1, batch 20150, loss[loss=0.3695, simple_loss=0.4049, pruned_loss=0.1671, over 21548.00 frames. ], tot_loss[loss=0.3602, simple_loss=0.4021, pruned_loss=0.1592, over 4293081.27 frames. ], batch size: 230, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 04:22:02,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.81 vs. limit=8.0 2023-06-18 04:22:24,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=121020.0, ans=0.025 2023-06-18 04:22:35,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=121080.0, ans=0.125 2023-06-18 04:23:23,942 INFO [train.py:996] (3/4) Epoch 1, batch 20200, loss[loss=0.3564, simple_loss=0.4183, pruned_loss=0.1472, over 21391.00 frames. ], tot_loss[loss=0.3658, simple_loss=0.4079, pruned_loss=0.1619, over 4290523.73 frames. ], batch size: 194, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 04:23:44,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=121260.0, ans=0.125 2023-06-18 04:24:24,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=121380.0, ans=0.125 2023-06-18 04:24:26,086 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.426e+02 4.003e+02 5.201e+02 6.811e+02 1.420e+03, threshold=1.040e+03, percent-clipped=11.0 2023-06-18 04:25:05,801 INFO [train.py:996] (3/4) Epoch 1, batch 20250, loss[loss=0.4115, simple_loss=0.4322, pruned_loss=0.1955, over 21604.00 frames. ], tot_loss[loss=0.363, simple_loss=0.4076, pruned_loss=0.1592, over 4293165.00 frames. ], batch size: 507, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 04:25:07,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=121500.0, ans=0.125 2023-06-18 04:25:21,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=121500.0, ans=0.0 2023-06-18 04:26:18,446 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=15.0 2023-06-18 04:26:41,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=121740.0, ans=0.5 2023-06-18 04:26:47,828 INFO [train.py:996] (3/4) Epoch 1, batch 20300, loss[loss=0.2778, simple_loss=0.3293, pruned_loss=0.1131, over 21878.00 frames. ], tot_loss[loss=0.3544, simple_loss=0.4021, pruned_loss=0.1533, over 4282951.65 frames. ], batch size: 98, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 04:26:59,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=121800.0, ans=0.2 2023-06-18 04:27:47,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=121980.0, ans=0.0 2023-06-18 04:27:55,657 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 3.171e+02 3.718e+02 4.802e+02 8.828e+02, threshold=7.436e+02, percent-clipped=0.0 2023-06-18 04:28:12,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=122040.0, ans=0.09899494936611666 2023-06-18 04:28:18,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=122040.0, ans=0.0 2023-06-18 04:28:28,651 INFO [train.py:996] (3/4) Epoch 1, batch 20350, loss[loss=0.3489, simple_loss=0.3981, pruned_loss=0.1498, over 21654.00 frames. ], tot_loss[loss=0.3536, simple_loss=0.4012, pruned_loss=0.1529, over 4280709.79 frames. ], batch size: 389, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 04:28:38,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=122100.0, ans=0.125 2023-06-18 04:28:45,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=122100.0, ans=0.125 2023-06-18 04:28:58,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=122160.0, ans=0.0 2023-06-18 04:28:59,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=122160.0, ans=0.125 2023-06-18 04:29:06,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=122220.0, ans=0.0 2023-06-18 04:29:22,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=15.0 2023-06-18 04:29:26,547 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=15.0 2023-06-18 04:30:16,412 INFO [train.py:996] (3/4) Epoch 1, batch 20400, loss[loss=0.3941, simple_loss=0.4321, pruned_loss=0.178, over 21678.00 frames. ], tot_loss[loss=0.3598, simple_loss=0.4051, pruned_loss=0.1572, over 4274105.96 frames. ], batch size: 389, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 04:31:00,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=122520.0, ans=0.07 2023-06-18 04:31:13,482 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.464e+02 4.014e+02 4.909e+02 5.768e+02 1.154e+03, threshold=9.817e+02, percent-clipped=10.0 2023-06-18 04:31:49,553 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.68 vs. limit=15.0 2023-06-18 04:31:51,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=122700.0, ans=0.125 2023-06-18 04:31:53,069 INFO [train.py:996] (3/4) Epoch 1, batch 20450, loss[loss=0.3668, simple_loss=0.3981, pruned_loss=0.1678, over 21696.00 frames. ], tot_loss[loss=0.3656, simple_loss=0.4074, pruned_loss=0.1619, over 4267039.18 frames. ], batch size: 112, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 04:31:53,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=122700.0, ans=0.1 2023-06-18 04:33:33,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=123000.0, ans=0.125 2023-06-18 04:33:34,854 INFO [train.py:996] (3/4) Epoch 1, batch 20500, loss[loss=0.3325, simple_loss=0.3576, pruned_loss=0.1537, over 21357.00 frames. ], tot_loss[loss=0.3649, simple_loss=0.4037, pruned_loss=0.1631, over 4267849.47 frames. ], batch size: 176, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 04:34:08,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=123060.0, ans=0.1 2023-06-18 04:34:19,514 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.44 vs. limit=6.0 2023-06-18 04:34:36,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=123180.0, ans=0.0 2023-06-18 04:34:43,262 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.920e+02 3.898e+02 4.731e+02 5.915e+02 1.084e+03, threshold=9.462e+02, percent-clipped=4.0 2023-06-18 04:35:23,672 INFO [train.py:996] (3/4) Epoch 1, batch 20550, loss[loss=0.3916, simple_loss=0.441, pruned_loss=0.1711, over 21475.00 frames. ], tot_loss[loss=0.3577, simple_loss=0.3955, pruned_loss=0.1599, over 4254174.59 frames. ], batch size: 473, lr: 2.63e-02, grad_scale: 32.0 2023-06-18 04:35:34,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=123300.0, ans=0.125 2023-06-18 04:35:35,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=123300.0, ans=0.2 2023-06-18 04:36:10,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=123420.0, ans=0.0 2023-06-18 04:36:25,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=123480.0, ans=0.125 2023-06-18 04:37:06,864 INFO [train.py:996] (3/4) Epoch 1, batch 20600, loss[loss=0.468, simple_loss=0.4682, pruned_loss=0.2339, over 21619.00 frames. ], tot_loss[loss=0.3534, simple_loss=0.396, pruned_loss=0.1553, over 4255618.96 frames. ], batch size: 507, lr: 2.63e-02, grad_scale: 32.0 2023-06-18 04:37:23,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=123660.0, ans=0.125 2023-06-18 04:37:28,785 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=12.0 2023-06-18 04:37:31,901 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-06-18 04:37:33,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=15.0 2023-06-18 04:37:37,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=123720.0, ans=0.0 2023-06-18 04:38:06,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=123780.0, ans=0.0 2023-06-18 04:38:09,149 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 3.464e+02 4.439e+02 5.514e+02 9.400e+02, threshold=8.878e+02, percent-clipped=0.0 2023-06-18 04:38:20,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=123780.0, ans=0.1 2023-06-18 04:38:26,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=123840.0, ans=0.125 2023-06-18 04:38:36,875 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.60 vs. limit=15.0 2023-06-18 04:38:48,796 INFO [train.py:996] (3/4) Epoch 1, batch 20650, loss[loss=0.2929, simple_loss=0.3274, pruned_loss=0.1292, over 21574.00 frames. ], tot_loss[loss=0.3499, simple_loss=0.3904, pruned_loss=0.1547, over 4266561.46 frames. ], batch size: 263, lr: 2.63e-02, grad_scale: 32.0 2023-06-18 04:39:24,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=124020.0, ans=0.1 2023-06-18 04:39:37,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=124020.0, ans=0.2 2023-06-18 04:40:03,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=124080.0, ans=0.0 2023-06-18 04:40:08,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=124140.0, ans=0.2 2023-06-18 04:40:31,323 INFO [train.py:996] (3/4) Epoch 1, batch 20700, loss[loss=0.4117, simple_loss=0.4541, pruned_loss=0.1847, over 19994.00 frames. ], tot_loss[loss=0.342, simple_loss=0.3829, pruned_loss=0.1505, over 4256565.18 frames. ], batch size: 703, lr: 2.63e-02, grad_scale: 32.0 2023-06-18 04:40:37,000 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-18 04:40:42,854 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 04:40:50,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=124260.0, ans=0.2 2023-06-18 04:41:38,280 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 3.293e+02 3.859e+02 5.120e+02 8.262e+02, threshold=7.718e+02, percent-clipped=0.0 2023-06-18 04:41:52,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=124380.0, ans=0.1 2023-06-18 04:42:12,821 INFO [train.py:996] (3/4) Epoch 1, batch 20750, loss[loss=0.3632, simple_loss=0.4237, pruned_loss=0.1513, over 21773.00 frames. ], tot_loss[loss=0.3403, simple_loss=0.3845, pruned_loss=0.148, over 4263351.46 frames. ], batch size: 332, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 04:42:32,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=124560.0, ans=0.125 2023-06-18 04:42:34,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=124560.0, ans=0.125 2023-06-18 04:43:26,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=124680.0, ans=0.0 2023-06-18 04:43:28,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=12.0 2023-06-18 04:43:45,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=124740.0, ans=15.0 2023-06-18 04:43:50,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=124740.0, ans=0.0 2023-06-18 04:43:56,107 INFO [train.py:996] (3/4) Epoch 1, batch 20800, loss[loss=0.3135, simple_loss=0.3449, pruned_loss=0.1411, over 21974.00 frames. ], tot_loss[loss=0.345, simple_loss=0.388, pruned_loss=0.151, over 4253525.86 frames. ], batch size: 103, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 04:43:58,945 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-18 04:44:12,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=124860.0, ans=0.125 2023-06-18 04:44:24,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=124860.0, ans=0.125 2023-06-18 04:45:08,180 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.368e+02 3.772e+02 4.526e+02 5.632e+02 1.034e+03, threshold=9.051e+02, percent-clipped=9.0 2023-06-18 04:45:27,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=125040.0, ans=0.125 2023-06-18 04:45:36,932 INFO [train.py:996] (3/4) Epoch 1, batch 20850, loss[loss=0.338, simple_loss=0.3708, pruned_loss=0.1526, over 21531.00 frames. ], tot_loss[loss=0.3354, simple_loss=0.3779, pruned_loss=0.1465, over 4255830.10 frames. ], batch size: 471, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 04:45:58,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=125160.0, ans=0.0 2023-06-18 04:46:23,945 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.69 vs. limit=6.0 2023-06-18 04:46:33,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=125220.0, ans=0.0 2023-06-18 04:46:56,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=125280.0, ans=0.95 2023-06-18 04:46:59,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=125280.0, ans=0.0 2023-06-18 04:47:18,193 INFO [train.py:996] (3/4) Epoch 1, batch 20900, loss[loss=0.32, simple_loss=0.3776, pruned_loss=0.1313, over 21785.00 frames. ], tot_loss[loss=0.3382, simple_loss=0.3789, pruned_loss=0.1487, over 4272450.28 frames. ], batch size: 332, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 04:47:18,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=125400.0, ans=0.0 2023-06-18 04:48:25,144 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 3.309e+02 3.915e+02 5.105e+02 1.001e+03, threshold=7.830e+02, percent-clipped=2.0 2023-06-18 04:48:28,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=125580.0, ans=0.1 2023-06-18 04:48:47,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=125640.0, ans=0.0 2023-06-18 04:48:48,573 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-18 04:48:53,588 INFO [train.py:996] (3/4) Epoch 1, batch 20950, loss[loss=0.3893, simple_loss=0.4035, pruned_loss=0.1876, over 21594.00 frames. ], tot_loss[loss=0.3291, simple_loss=0.3734, pruned_loss=0.1424, over 4275807.26 frames. ], batch size: 508, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 04:49:36,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=125820.0, ans=0.2 2023-06-18 04:49:36,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=125820.0, ans=0.125 2023-06-18 04:50:25,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=125940.0, ans=0.125 2023-06-18 04:50:26,052 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-18 04:50:33,080 INFO [train.py:996] (3/4) Epoch 1, batch 21000, loss[loss=0.3621, simple_loss=0.3942, pruned_loss=0.165, over 21873.00 frames. ], tot_loss[loss=0.33, simple_loss=0.3732, pruned_loss=0.1434, over 4274637.76 frames. ], batch size: 414, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 04:50:33,081 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 04:50:47,289 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.5444, 2.8620, 1.4155, 1.9523], device='cuda:3') 2023-06-18 04:50:50,127 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.3151, simple_loss=0.4075, pruned_loss=0.1114, over 1796401.00 frames. 2023-06-18 04:50:50,128 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-18 04:51:51,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=126120.0, ans=0.2 2023-06-18 04:52:02,298 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 3.430e+02 4.586e+02 6.344e+02 1.913e+03, threshold=9.172e+02, percent-clipped=11.0 2023-06-18 04:52:08,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=22.5 2023-06-18 04:52:30,649 INFO [train.py:996] (3/4) Epoch 1, batch 21050, loss[loss=0.3438, simple_loss=0.3828, pruned_loss=0.1523, over 21535.00 frames. ], tot_loss[loss=0.3298, simple_loss=0.3713, pruned_loss=0.1441, over 4269267.54 frames. ], batch size: 414, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 04:52:39,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=126300.0, ans=0.125 2023-06-18 04:52:40,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=126300.0, ans=0.125 2023-06-18 04:53:02,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.87 vs. limit=10.0 2023-06-18 04:53:12,058 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=22.5 2023-06-18 04:53:19,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=126420.0, ans=0.125 2023-06-18 04:54:06,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=126600.0, ans=0.09899494936611666 2023-06-18 04:54:07,564 INFO [train.py:996] (3/4) Epoch 1, batch 21100, loss[loss=0.298, simple_loss=0.3354, pruned_loss=0.1303, over 21865.00 frames. ], tot_loss[loss=0.3259, simple_loss=0.3662, pruned_loss=0.1428, over 4266745.47 frames. ], batch size: 107, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 04:54:57,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=126720.0, ans=0.125 2023-06-18 04:55:14,650 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.520e+02 3.438e+02 4.271e+02 5.279e+02 9.041e+02, threshold=8.542e+02, percent-clipped=0.0 2023-06-18 04:55:32,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=126840.0, ans=0.07 2023-06-18 04:55:43,336 INFO [train.py:996] (3/4) Epoch 1, batch 21150, loss[loss=0.3253, simple_loss=0.361, pruned_loss=0.1448, over 15647.00 frames. ], tot_loss[loss=0.3226, simple_loss=0.3612, pruned_loss=0.142, over 4266023.80 frames. ], batch size: 60, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 04:55:50,179 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 04:55:50,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=126900.0, ans=0.0 2023-06-18 04:55:58,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=126960.0, ans=0.0 2023-06-18 04:56:01,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=126960.0, ans=0.1 2023-06-18 04:56:53,695 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-18 04:56:57,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=127080.0, ans=0.1 2023-06-18 04:57:05,896 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 04:57:20,002 INFO [train.py:996] (3/4) Epoch 1, batch 21200, loss[loss=0.3456, simple_loss=0.3733, pruned_loss=0.159, over 21586.00 frames. ], tot_loss[loss=0.3181, simple_loss=0.3568, pruned_loss=0.1398, over 4267649.15 frames. ], batch size: 414, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 04:58:33,959 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.453e+02 3.663e+02 4.545e+02 5.734e+02 1.350e+03, threshold=9.091e+02, percent-clipped=8.0 2023-06-18 04:59:02,992 INFO [train.py:996] (3/4) Epoch 1, batch 21250, loss[loss=0.329, simple_loss=0.3705, pruned_loss=0.1437, over 21339.00 frames. ], tot_loss[loss=0.3182, simple_loss=0.3561, pruned_loss=0.1401, over 4268941.59 frames. ], batch size: 159, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 05:00:19,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-18 05:00:24,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=127740.0, ans=0.125 2023-06-18 05:00:41,952 INFO [train.py:996] (3/4) Epoch 1, batch 21300, loss[loss=0.3227, simple_loss=0.3544, pruned_loss=0.1455, over 20042.00 frames. ], tot_loss[loss=0.327, simple_loss=0.3643, pruned_loss=0.1449, over 4270441.54 frames. ], batch size: 702, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 05:01:46,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=127920.0, ans=0.125 2023-06-18 05:01:48,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=127920.0, ans=0.0 2023-06-18 05:01:54,303 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.417e+02 3.515e+02 4.385e+02 5.674e+02 1.308e+03, threshold=8.770e+02, percent-clipped=8.0 2023-06-18 05:02:14,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=128040.0, ans=0.1 2023-06-18 05:02:23,574 INFO [train.py:996] (3/4) Epoch 1, batch 21350, loss[loss=0.2576, simple_loss=0.3271, pruned_loss=0.09404, over 21442.00 frames. ], tot_loss[loss=0.3308, simple_loss=0.3701, pruned_loss=0.1458, over 4280813.77 frames. ], batch size: 195, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 05:03:33,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=128280.0, ans=0.125 2023-06-18 05:04:06,116 INFO [train.py:996] (3/4) Epoch 1, batch 21400, loss[loss=0.3829, simple_loss=0.4182, pruned_loss=0.1738, over 21703.00 frames. ], tot_loss[loss=0.3359, simple_loss=0.3769, pruned_loss=0.1474, over 4283032.95 frames. ], batch size: 351, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 05:05:00,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=128520.0, ans=0.125 2023-06-18 05:05:09,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=128520.0, ans=0.1 2023-06-18 05:05:18,993 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 3.244e+02 3.990e+02 4.956e+02 1.756e+03, threshold=7.981e+02, percent-clipped=8.0 2023-06-18 05:05:47,921 INFO [train.py:996] (3/4) Epoch 1, batch 21450, loss[loss=0.355, simple_loss=0.3899, pruned_loss=0.1601, over 21894.00 frames. ], tot_loss[loss=0.3376, simple_loss=0.3792, pruned_loss=0.148, over 4282303.89 frames. ], batch size: 414, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 05:06:04,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=128700.0, ans=0.1 2023-06-18 05:06:12,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=128760.0, ans=0.125 2023-06-18 05:07:01,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=128880.0, ans=0.125 2023-06-18 05:07:23,878 INFO [train.py:996] (3/4) Epoch 1, batch 21500, loss[loss=0.3137, simple_loss=0.3486, pruned_loss=0.1394, over 21665.00 frames. ], tot_loss[loss=0.3373, simple_loss=0.3771, pruned_loss=0.1488, over 4272675.08 frames. ], batch size: 333, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 05:08:25,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=129120.0, ans=0.125 2023-06-18 05:08:35,886 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.639e+02 3.287e+02 4.067e+02 5.300e+02 1.405e+03, threshold=8.134e+02, percent-clipped=7.0 2023-06-18 05:08:59,359 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:09:05,154 INFO [train.py:996] (3/4) Epoch 1, batch 21550, loss[loss=0.2846, simple_loss=0.3183, pruned_loss=0.1255, over 21273.00 frames. ], tot_loss[loss=0.3289, simple_loss=0.3685, pruned_loss=0.1446, over 4272469.71 frames. ], batch size: 159, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 05:09:14,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=129300.0, ans=0.0 2023-06-18 05:09:37,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=129360.0, ans=0.0 2023-06-18 05:09:41,503 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-06-18 05:10:07,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=129420.0, ans=0.0 2023-06-18 05:10:24,203 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=22.5 2023-06-18 05:10:35,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=129540.0, ans=0.2 2023-06-18 05:10:40,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=129540.0, ans=0.04949747468305833 2023-06-18 05:10:53,620 INFO [train.py:996] (3/4) Epoch 1, batch 21600, loss[loss=0.3234, simple_loss=0.3766, pruned_loss=0.1351, over 21828.00 frames. ], tot_loss[loss=0.3232, simple_loss=0.3628, pruned_loss=0.1418, over 4273160.72 frames. ], batch size: 372, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 05:11:03,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=129600.0, ans=0.0 2023-06-18 05:11:06,450 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=15.0 2023-06-18 05:11:36,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=129660.0, ans=0.1 2023-06-18 05:11:43,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=129720.0, ans=0.2 2023-06-18 05:11:52,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=129720.0, ans=0.125 2023-06-18 05:11:56,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=129780.0, ans=0.0 2023-06-18 05:12:01,385 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 3.331e+02 4.167e+02 5.142e+02 1.133e+03, threshold=8.334e+02, percent-clipped=4.0 2023-06-18 05:12:34,518 INFO [train.py:996] (3/4) Epoch 1, batch 21650, loss[loss=0.2814, simple_loss=0.3451, pruned_loss=0.1088, over 21741.00 frames. ], tot_loss[loss=0.3254, simple_loss=0.3696, pruned_loss=0.1406, over 4269895.30 frames. ], batch size: 124, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 05:13:29,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=130020.0, ans=0.0 2023-06-18 05:13:53,450 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=22.5 2023-06-18 05:14:15,273 INFO [train.py:996] (3/4) Epoch 1, batch 21700, loss[loss=0.3011, simple_loss=0.3458, pruned_loss=0.1282, over 21285.00 frames. ], tot_loss[loss=0.3185, simple_loss=0.3662, pruned_loss=0.1354, over 4261640.18 frames. ], batch size: 176, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 05:14:34,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=130260.0, ans=0.125 2023-06-18 05:14:38,688 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.62 vs. limit=15.0 2023-06-18 05:14:50,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130260.0, ans=0.1 2023-06-18 05:14:52,327 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=15.0 2023-06-18 05:14:58,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=130320.0, ans=0.125 2023-06-18 05:14:58,885 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2023-06-18 05:15:15,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=130380.0, ans=0.125 2023-06-18 05:15:16,909 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 3.479e+02 4.448e+02 5.687e+02 1.020e+03, threshold=8.895e+02, percent-clipped=10.0 2023-06-18 05:15:50,771 INFO [train.py:996] (3/4) Epoch 1, batch 21750, loss[loss=0.319, simple_loss=0.3581, pruned_loss=0.14, over 21794.00 frames. ], tot_loss[loss=0.3193, simple_loss=0.3635, pruned_loss=0.1376, over 4258259.82 frames. ], batch size: 107, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 05:16:46,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=130620.0, ans=0.125 2023-06-18 05:16:48,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=130620.0, ans=0.125 2023-06-18 05:17:34,228 INFO [train.py:996] (3/4) Epoch 1, batch 21800, loss[loss=0.4136, simple_loss=0.4413, pruned_loss=0.193, over 21521.00 frames. ], tot_loss[loss=0.3218, simple_loss=0.3629, pruned_loss=0.1404, over 4256473.02 frames. ], batch size: 509, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 05:17:40,313 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.22 vs. limit=15.0 2023-06-18 05:17:41,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=130800.0, ans=0.2 2023-06-18 05:18:02,335 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.29 vs. limit=22.5 2023-06-18 05:18:06,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=130860.0, ans=0.125 2023-06-18 05:18:39,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=130980.0, ans=0.07 2023-06-18 05:18:42,234 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.714e+02 4.449e+02 6.326e+02 1.060e+03, threshold=8.898e+02, percent-clipped=3.0 2023-06-18 05:18:45,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=130980.0, ans=0.07 2023-06-18 05:18:46,362 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-18 05:19:02,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=131040.0, ans=0.2 2023-06-18 05:19:16,630 INFO [train.py:996] (3/4) Epoch 1, batch 21850, loss[loss=0.3692, simple_loss=0.4073, pruned_loss=0.1656, over 21852.00 frames. ], tot_loss[loss=0.3258, simple_loss=0.3694, pruned_loss=0.1411, over 4255890.30 frames. ], batch size: 414, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 05:20:13,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=131220.0, ans=0.0 2023-06-18 05:20:40,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=131340.0, ans=0.125 2023-06-18 05:20:47,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=131340.0, ans=0.0 2023-06-18 05:20:56,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=131400.0, ans=0.125 2023-06-18 05:20:58,027 INFO [train.py:996] (3/4) Epoch 1, batch 21900, loss[loss=0.3496, simple_loss=0.3767, pruned_loss=0.1613, over 21818.00 frames. ], tot_loss[loss=0.3312, simple_loss=0.3745, pruned_loss=0.1439, over 4257268.45 frames. ], batch size: 316, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 05:21:29,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=131460.0, ans=0.0 2023-06-18 05:21:38,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=131520.0, ans=0.0 2023-06-18 05:21:52,314 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.61 vs. limit=10.0 2023-06-18 05:22:04,610 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.542e+02 3.404e+02 4.082e+02 5.077e+02 9.199e+02, threshold=8.164e+02, percent-clipped=1.0 2023-06-18 05:22:12,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=131580.0, ans=0.125 2023-06-18 05:22:22,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=131640.0, ans=0.125 2023-06-18 05:22:33,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=131640.0, ans=0.125 2023-06-18 05:22:38,126 INFO [train.py:996] (3/4) Epoch 1, batch 21950, loss[loss=0.2369, simple_loss=0.2924, pruned_loss=0.09063, over 21723.00 frames. ], tot_loss[loss=0.3239, simple_loss=0.3667, pruned_loss=0.1406, over 4257909.30 frames. ], batch size: 112, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 05:23:19,560 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.86 vs. limit=6.0 2023-06-18 05:23:26,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=131820.0, ans=0.125 2023-06-18 05:23:32,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=131820.0, ans=0.015 2023-06-18 05:23:35,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=131880.0, ans=0.2 2023-06-18 05:24:19,969 INFO [train.py:996] (3/4) Epoch 1, batch 22000, loss[loss=0.3525, simple_loss=0.3856, pruned_loss=0.1597, over 21826.00 frames. ], tot_loss[loss=0.3144, simple_loss=0.3586, pruned_loss=0.1351, over 4253060.44 frames. ], batch size: 372, lr: 2.56e-02, grad_scale: 64.0 2023-06-18 05:24:32,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=132000.0, ans=0.025 2023-06-18 05:24:37,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=132000.0, ans=0.0 2023-06-18 05:25:07,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=132120.0, ans=0.0 2023-06-18 05:25:30,645 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 3.717e+02 4.714e+02 6.490e+02 1.072e+03, threshold=9.428e+02, percent-clipped=6.0 2023-06-18 05:25:34,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=132180.0, ans=0.1 2023-06-18 05:26:08,201 INFO [train.py:996] (3/4) Epoch 1, batch 22050, loss[loss=0.4106, simple_loss=0.4445, pruned_loss=0.1884, over 21262.00 frames. ], tot_loss[loss=0.3174, simple_loss=0.3621, pruned_loss=0.1364, over 4256064.24 frames. ], batch size: 159, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 05:26:31,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=132360.0, ans=0.0 2023-06-18 05:27:38,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=132540.0, ans=10.0 2023-06-18 05:27:46,139 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=22.5 2023-06-18 05:27:48,446 INFO [train.py:996] (3/4) Epoch 1, batch 22100, loss[loss=0.3591, simple_loss=0.4011, pruned_loss=0.1586, over 21776.00 frames. ], tot_loss[loss=0.3346, simple_loss=0.3771, pruned_loss=0.1461, over 4259695.72 frames. ], batch size: 282, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 05:28:04,448 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:28:06,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=132600.0, ans=0.125 2023-06-18 05:28:16,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=132660.0, ans=0.125 2023-06-18 05:28:25,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=132660.0, ans=0.125 2023-06-18 05:28:54,533 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.580e+02 4.025e+02 4.912e+02 6.450e+02 1.246e+03, threshold=9.825e+02, percent-clipped=3.0 2023-06-18 05:29:32,050 INFO [train.py:996] (3/4) Epoch 1, batch 22150, loss[loss=0.3236, simple_loss=0.3609, pruned_loss=0.1432, over 21914.00 frames. ], tot_loss[loss=0.3393, simple_loss=0.3818, pruned_loss=0.1484, over 4259513.96 frames. ], batch size: 107, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 05:30:31,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=133080.0, ans=0.1 2023-06-18 05:31:12,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=133200.0, ans=0.125 2023-06-18 05:31:13,342 INFO [train.py:996] (3/4) Epoch 1, batch 22200, loss[loss=0.4095, simple_loss=0.4403, pruned_loss=0.1894, over 21781.00 frames. ], tot_loss[loss=0.3422, simple_loss=0.3836, pruned_loss=0.1504, over 4272137.40 frames. ], batch size: 441, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 05:31:22,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=133200.0, ans=10.0 2023-06-18 05:31:38,066 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:31:47,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=133260.0, ans=0.125 2023-06-18 05:32:16,571 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.785e+02 4.029e+02 4.889e+02 6.211e+02 1.093e+03, threshold=9.779e+02, percent-clipped=2.0 2023-06-18 05:32:28,756 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.69 vs. limit=22.5 2023-06-18 05:32:59,483 INFO [train.py:996] (3/4) Epoch 1, batch 22250, loss[loss=0.362, simple_loss=0.4119, pruned_loss=0.1561, over 21413.00 frames. ], tot_loss[loss=0.3474, simple_loss=0.3909, pruned_loss=0.1519, over 4281764.11 frames. ], batch size: 211, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 05:33:06,710 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-06-18 05:33:36,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=133620.0, ans=0.125 2023-06-18 05:33:46,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=133620.0, ans=0.125 2023-06-18 05:34:30,865 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.16 vs. limit=15.0 2023-06-18 05:34:39,732 INFO [train.py:996] (3/4) Epoch 1, batch 22300, loss[loss=0.3318, simple_loss=0.3619, pruned_loss=0.1508, over 21327.00 frames. ], tot_loss[loss=0.3519, simple_loss=0.3936, pruned_loss=0.1551, over 4285616.76 frames. ], batch size: 176, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 05:35:36,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=133980.0, ans=0.2 2023-06-18 05:35:37,686 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.601e+02 3.602e+02 4.280e+02 5.421e+02 8.254e+02, threshold=8.559e+02, percent-clipped=0.0 2023-06-18 05:35:46,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=133980.0, ans=0.2 2023-06-18 05:35:52,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=134040.0, ans=0.0 2023-06-18 05:36:20,509 INFO [train.py:996] (3/4) Epoch 1, batch 22350, loss[loss=0.2918, simple_loss=0.3532, pruned_loss=0.1152, over 21847.00 frames. ], tot_loss[loss=0.3511, simple_loss=0.391, pruned_loss=0.1556, over 4284828.13 frames. ], batch size: 351, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 05:36:39,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=134160.0, ans=0.0 2023-06-18 05:36:40,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=134160.0, ans=0.2 2023-06-18 05:37:09,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=134220.0, ans=0.125 2023-06-18 05:37:31,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=134280.0, ans=0.0 2023-06-18 05:38:03,433 INFO [train.py:996] (3/4) Epoch 1, batch 22400, loss[loss=0.3058, simple_loss=0.3478, pruned_loss=0.1319, over 21347.00 frames. ], tot_loss[loss=0.3431, simple_loss=0.3854, pruned_loss=0.1504, over 4287290.48 frames. ], batch size: 211, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 05:38:10,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=134400.0, ans=0.125 2023-06-18 05:38:19,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=134460.0, ans=0.125 2023-06-18 05:38:21,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=134460.0, ans=0.125 2023-06-18 05:38:44,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=134520.0, ans=0.2 2023-06-18 05:39:07,172 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 3.472e+02 4.159e+02 5.652e+02 9.879e+02, threshold=8.318e+02, percent-clipped=2.0 2023-06-18 05:39:39,871 INFO [train.py:996] (3/4) Epoch 1, batch 22450, loss[loss=0.2934, simple_loss=0.335, pruned_loss=0.1259, over 21227.00 frames. ], tot_loss[loss=0.3386, simple_loss=0.3791, pruned_loss=0.1491, over 4283316.82 frames. ], batch size: 144, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 05:39:59,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=134700.0, ans=0.1 2023-06-18 05:40:13,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=134760.0, ans=0.2 2023-06-18 05:40:20,457 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:40:36,281 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.62 vs. limit=15.0 2023-06-18 05:40:58,376 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:41:25,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=135000.0, ans=0.0 2023-06-18 05:41:26,124 INFO [train.py:996] (3/4) Epoch 1, batch 22500, loss[loss=0.3646, simple_loss=0.3763, pruned_loss=0.1764, over 21204.00 frames. ], tot_loss[loss=0.3343, simple_loss=0.3734, pruned_loss=0.1476, over 4281281.05 frames. ], batch size: 471, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 05:41:57,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=135060.0, ans=0.125 2023-06-18 05:42:40,698 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.359e+02 3.574e+02 4.487e+02 5.410e+02 9.033e+02, threshold=8.975e+02, percent-clipped=2.0 2023-06-18 05:43:03,760 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2023-06-18 05:43:08,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=135300.0, ans=0.0 2023-06-18 05:43:09,812 INFO [train.py:996] (3/4) Epoch 1, batch 22550, loss[loss=0.3588, simple_loss=0.3858, pruned_loss=0.1659, over 21874.00 frames. ], tot_loss[loss=0.3381, simple_loss=0.3787, pruned_loss=0.1487, over 4283249.88 frames. ], batch size: 282, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 05:43:46,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=135360.0, ans=0.0 2023-06-18 05:44:20,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=135480.0, ans=0.0 2023-06-18 05:44:38,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=135540.0, ans=0.0 2023-06-18 05:44:59,029 INFO [train.py:996] (3/4) Epoch 1, batch 22600, loss[loss=0.2205, simple_loss=0.2414, pruned_loss=0.09984, over 16348.00 frames. ], tot_loss[loss=0.3385, simple_loss=0.3808, pruned_loss=0.1481, over 4277237.00 frames. ], batch size: 62, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 05:44:59,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=135600.0, ans=0.04949747468305833 2023-06-18 05:45:27,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=135660.0, ans=0.125 2023-06-18 05:45:37,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=135660.0, ans=0.125 2023-06-18 05:45:44,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=135660.0, ans=0.0 2023-06-18 05:46:07,723 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.490e+02 4.199e+02 5.117e+02 6.564e+02 1.237e+03, threshold=1.023e+03, percent-clipped=4.0 2023-06-18 05:46:39,938 INFO [train.py:996] (3/4) Epoch 1, batch 22650, loss[loss=0.4199, simple_loss=0.4635, pruned_loss=0.1882, over 21626.00 frames. ], tot_loss[loss=0.3361, simple_loss=0.3778, pruned_loss=0.1472, over 4277833.38 frames. ], batch size: 441, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 05:46:58,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=135900.0, ans=0.0 2023-06-18 05:47:04,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=135960.0, ans=0.125 2023-06-18 05:47:27,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=136020.0, ans=0.0 2023-06-18 05:48:20,073 INFO [train.py:996] (3/4) Epoch 1, batch 22700, loss[loss=0.349, simple_loss=0.3754, pruned_loss=0.1613, over 21833.00 frames. ], tot_loss[loss=0.3332, simple_loss=0.3732, pruned_loss=0.1466, over 4273019.71 frames. ], batch size: 317, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 05:49:24,143 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 3.785e+02 4.714e+02 6.670e+02 1.093e+03, threshold=9.427e+02, percent-clipped=5.0 2023-06-18 05:49:31,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=136380.0, ans=0.125 2023-06-18 05:49:57,043 INFO [train.py:996] (3/4) Epoch 1, batch 22750, loss[loss=0.369, simple_loss=0.4051, pruned_loss=0.1665, over 21738.00 frames. ], tot_loss[loss=0.3356, simple_loss=0.3738, pruned_loss=0.1487, over 4260525.65 frames. ], batch size: 124, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 05:50:25,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=136560.0, ans=0.125 2023-06-18 05:50:47,661 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-18 05:51:35,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=136740.0, ans=0.0 2023-06-18 05:51:35,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=136740.0, ans=0.125 2023-06-18 05:51:39,022 INFO [train.py:996] (3/4) Epoch 1, batch 22800, loss[loss=0.3418, simple_loss=0.3843, pruned_loss=0.1496, over 21495.00 frames. ], tot_loss[loss=0.3427, simple_loss=0.3803, pruned_loss=0.1526, over 4270778.84 frames. ], batch size: 230, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 05:51:59,388 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.29 vs. limit=22.5 2023-06-18 05:52:37,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=136920.0, ans=0.125 2023-06-18 05:52:43,264 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.08 vs. limit=15.0 2023-06-18 05:52:47,244 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.251e+02 4.256e+02 5.590e+02 8.268e+02 1.334e+03, threshold=1.118e+03, percent-clipped=16.0 2023-06-18 05:53:04,713 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-18 05:53:12,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=137040.0, ans=0.125 2023-06-18 05:53:20,658 INFO [train.py:996] (3/4) Epoch 1, batch 22850, loss[loss=0.2759, simple_loss=0.3212, pruned_loss=0.1153, over 21662.00 frames. ], tot_loss[loss=0.3415, simple_loss=0.3795, pruned_loss=0.1517, over 4260928.92 frames. ], batch size: 247, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 05:53:27,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=137100.0, ans=0.0 2023-06-18 05:53:37,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137100.0, ans=0.1 2023-06-18 05:54:16,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137220.0, ans=0.1 2023-06-18 05:54:27,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=137280.0, ans=0.0 2023-06-18 05:54:37,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=137280.0, ans=0.0 2023-06-18 05:54:50,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=137340.0, ans=0.125 2023-06-18 05:55:06,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=137340.0, ans=0.125 2023-06-18 05:55:09,105 INFO [train.py:996] (3/4) Epoch 1, batch 22900, loss[loss=0.2999, simple_loss=0.3273, pruned_loss=0.1363, over 21727.00 frames. ], tot_loss[loss=0.3406, simple_loss=0.3804, pruned_loss=0.1504, over 4266712.86 frames. ], batch size: 112, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 05:56:13,231 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.811e+02 3.596e+02 4.279e+02 5.382e+02 9.756e+02, threshold=8.557e+02, percent-clipped=0.0 2023-06-18 05:56:22,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=137580.0, ans=0.1 2023-06-18 05:56:52,903 INFO [train.py:996] (3/4) Epoch 1, batch 22950, loss[loss=0.3145, simple_loss=0.4087, pruned_loss=0.1101, over 21682.00 frames. ], tot_loss[loss=0.3442, simple_loss=0.3921, pruned_loss=0.1481, over 4265107.40 frames. ], batch size: 247, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 05:57:14,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=137760.0, ans=0.125 2023-06-18 05:58:00,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=137880.0, ans=0.0 2023-06-18 05:58:22,496 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-06-18 05:58:34,346 INFO [train.py:996] (3/4) Epoch 1, batch 23000, loss[loss=0.3205, simple_loss=0.3718, pruned_loss=0.1346, over 21516.00 frames. ], tot_loss[loss=0.3377, simple_loss=0.3879, pruned_loss=0.1438, over 4274675.51 frames. ], batch size: 131, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 05:58:36,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=138000.0, ans=0.125 2023-06-18 05:59:09,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=138060.0, ans=0.04949747468305833 2023-06-18 05:59:14,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=138060.0, ans=0.125 2023-06-18 05:59:42,671 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 3.447e+02 4.093e+02 5.344e+02 1.227e+03, threshold=8.186e+02, percent-clipped=4.0 2023-06-18 06:00:01,886 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.77 vs. limit=6.0 2023-06-18 06:00:15,821 INFO [train.py:996] (3/4) Epoch 1, batch 23050, loss[loss=0.3763, simple_loss=0.4185, pruned_loss=0.167, over 21555.00 frames. ], tot_loss[loss=0.3436, simple_loss=0.3907, pruned_loss=0.1482, over 4278371.68 frames. ], batch size: 414, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 06:00:35,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=138300.0, ans=0.125 2023-06-18 06:00:48,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=138360.0, ans=0.2 2023-06-18 06:00:49,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=138360.0, ans=0.125 2023-06-18 06:01:39,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=138480.0, ans=0.125 2023-06-18 06:01:54,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=138540.0, ans=0.125 2023-06-18 06:02:02,470 INFO [train.py:996] (3/4) Epoch 1, batch 23100, loss[loss=0.3055, simple_loss=0.3375, pruned_loss=0.1367, over 21236.00 frames. ], tot_loss[loss=0.3407, simple_loss=0.3851, pruned_loss=0.1482, over 4266914.75 frames. ], batch size: 548, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 06:02:12,236 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=22.5 2023-06-18 06:02:43,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=138720.0, ans=0.125 2023-06-18 06:02:45,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=138720.0, ans=0.0 2023-06-18 06:03:11,214 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.259e+02 3.616e+02 4.226e+02 5.778e+02 1.152e+03, threshold=8.452e+02, percent-clipped=7.0 2023-06-18 06:03:37,798 INFO [train.py:996] (3/4) Epoch 1, batch 23150, loss[loss=0.3258, simple_loss=0.3596, pruned_loss=0.146, over 21856.00 frames. ], tot_loss[loss=0.3361, simple_loss=0.3784, pruned_loss=0.1469, over 4272258.13 frames. ], batch size: 298, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 06:04:18,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=139020.0, ans=0.125 2023-06-18 06:04:37,384 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.41 vs. limit=10.0 2023-06-18 06:05:03,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=139140.0, ans=0.0 2023-06-18 06:05:05,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=139140.0, ans=0.2 2023-06-18 06:05:23,898 INFO [train.py:996] (3/4) Epoch 1, batch 23200, loss[loss=0.3196, simple_loss=0.3619, pruned_loss=0.1387, over 21334.00 frames. ], tot_loss[loss=0.3357, simple_loss=0.3769, pruned_loss=0.1473, over 4280951.66 frames. ], batch size: 159, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 06:05:55,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=139260.0, ans=12.0 2023-06-18 06:06:12,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=139320.0, ans=0.125 2023-06-18 06:06:26,391 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.420e+02 3.602e+02 4.092e+02 5.263e+02 8.445e+02, threshold=8.184e+02, percent-clipped=0.0 2023-06-18 06:06:44,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=139440.0, ans=0.125 2023-06-18 06:06:58,358 INFO [train.py:996] (3/4) Epoch 1, batch 23250, loss[loss=0.3498, simple_loss=0.3838, pruned_loss=0.1579, over 21910.00 frames. ], tot_loss[loss=0.338, simple_loss=0.3776, pruned_loss=0.1491, over 4282303.45 frames. ], batch size: 316, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 06:07:27,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=139560.0, ans=0.0 2023-06-18 06:07:48,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=139620.0, ans=0.5 2023-06-18 06:08:52,737 INFO [train.py:996] (3/4) Epoch 1, batch 23300, loss[loss=0.3929, simple_loss=0.4783, pruned_loss=0.1538, over 21684.00 frames. ], tot_loss[loss=0.3463, simple_loss=0.3872, pruned_loss=0.1527, over 4279361.39 frames. ], batch size: 389, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 06:08:54,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=139800.0, ans=0.125 2023-06-18 06:08:56,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=139800.0, ans=0.1 2023-06-18 06:08:59,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=139800.0, ans=0.125 2023-06-18 06:09:11,777 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 06:09:19,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=139860.0, ans=0.0 2023-06-18 06:09:56,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.55 vs. limit=15.0 2023-06-18 06:09:57,999 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.429e+02 3.912e+02 5.511e+02 7.628e+02 1.360e+03, threshold=1.102e+03, percent-clipped=20.0 2023-06-18 06:10:35,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=140040.0, ans=0.125 2023-06-18 06:10:38,091 INFO [train.py:996] (3/4) Epoch 1, batch 23350, loss[loss=0.3358, simple_loss=0.3926, pruned_loss=0.1395, over 21660.00 frames. ], tot_loss[loss=0.3478, simple_loss=0.3929, pruned_loss=0.1513, over 4280632.07 frames. ], batch size: 263, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 06:10:46,096 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.30 vs. limit=22.5 2023-06-18 06:12:10,488 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.45 vs. limit=15.0 2023-06-18 06:12:12,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=140340.0, ans=0.0 2023-06-18 06:12:16,714 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.65 vs. limit=6.0 2023-06-18 06:12:19,200 INFO [train.py:996] (3/4) Epoch 1, batch 23400, loss[loss=0.297, simple_loss=0.3366, pruned_loss=0.1287, over 20124.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.3835, pruned_loss=0.1454, over 4274686.20 frames. ], batch size: 703, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 06:12:38,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=140460.0, ans=0.125 2023-06-18 06:12:38,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=140460.0, ans=0.125 2023-06-18 06:13:01,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=140520.0, ans=0.125 2023-06-18 06:13:14,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=140520.0, ans=0.04949747468305833 2023-06-18 06:13:27,632 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 3.226e+02 4.219e+02 5.285e+02 8.873e+02, threshold=8.438e+02, percent-clipped=0.0 2023-06-18 06:13:51,812 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=22.5 2023-06-18 06:14:00,434 INFO [train.py:996] (3/4) Epoch 1, batch 23450, loss[loss=0.2857, simple_loss=0.2999, pruned_loss=0.1357, over 20091.00 frames. ], tot_loss[loss=0.3429, simple_loss=0.3859, pruned_loss=0.15, over 4279838.14 frames. ], batch size: 703, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 06:14:23,420 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-18 06:14:42,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=140820.0, ans=0.0 2023-06-18 06:14:51,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=140820.0, ans=0.0 2023-06-18 06:15:21,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=140880.0, ans=0.125 2023-06-18 06:15:27,127 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=22.5 2023-06-18 06:15:41,712 INFO [train.py:996] (3/4) Epoch 1, batch 23500, loss[loss=0.3444, simple_loss=0.3803, pruned_loss=0.1543, over 21903.00 frames. ], tot_loss[loss=0.3453, simple_loss=0.3862, pruned_loss=0.1521, over 4286979.55 frames. ], batch size: 351, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 06:15:42,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=141000.0, ans=0.07 2023-06-18 06:15:51,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=141000.0, ans=0.125 2023-06-18 06:16:49,415 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.488e+02 3.727e+02 4.969e+02 6.081e+02 9.256e+02, threshold=9.939e+02, percent-clipped=2.0 2023-06-18 06:17:18,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=141240.0, ans=0.2 2023-06-18 06:17:22,052 INFO [train.py:996] (3/4) Epoch 1, batch 23550, loss[loss=0.3077, simple_loss=0.3366, pruned_loss=0.1394, over 21172.00 frames. ], tot_loss[loss=0.3403, simple_loss=0.3793, pruned_loss=0.1506, over 4265145.88 frames. ], batch size: 176, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 06:17:45,920 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 06:17:56,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=141360.0, ans=0.1 2023-06-18 06:18:47,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=141540.0, ans=0.0 2023-06-18 06:18:56,265 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.11 vs. limit=15.0 2023-06-18 06:19:05,105 INFO [train.py:996] (3/4) Epoch 1, batch 23600, loss[loss=0.4508, simple_loss=0.4642, pruned_loss=0.2187, over 21425.00 frames. ], tot_loss[loss=0.3428, simple_loss=0.3821, pruned_loss=0.1517, over 4261016.36 frames. ], batch size: 471, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 06:19:53,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=141720.0, ans=0.125 2023-06-18 06:20:21,066 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.501e+02 3.688e+02 4.463e+02 5.931e+02 8.627e+02, threshold=8.927e+02, percent-clipped=0.0 2023-06-18 06:20:34,211 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.62 vs. limit=10.0 2023-06-18 06:20:39,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=141840.0, ans=0.125 2023-06-18 06:20:59,598 INFO [train.py:996] (3/4) Epoch 1, batch 23650, loss[loss=0.2754, simple_loss=0.3495, pruned_loss=0.1007, over 21695.00 frames. ], tot_loss[loss=0.3394, simple_loss=0.3812, pruned_loss=0.1488, over 4266568.99 frames. ], batch size: 298, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 06:21:13,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=141900.0, ans=0.125 2023-06-18 06:21:47,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=142020.0, ans=0.0 2023-06-18 06:22:00,384 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.75 vs. limit=15.0 2023-06-18 06:22:03,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=142080.0, ans=0.2 2023-06-18 06:22:07,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=142080.0, ans=0.0 2023-06-18 06:22:42,910 INFO [train.py:996] (3/4) Epoch 1, batch 23700, loss[loss=0.4099, simple_loss=0.439, pruned_loss=0.1904, over 21851.00 frames. ], tot_loss[loss=0.3404, simple_loss=0.3847, pruned_loss=0.148, over 4269300.02 frames. ], batch size: 118, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 06:23:11,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=142260.0, ans=0.125 2023-06-18 06:23:32,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=142320.0, ans=0.0 2023-06-18 06:23:53,389 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 3.745e+02 4.445e+02 5.198e+02 9.027e+02, threshold=8.891e+02, percent-clipped=1.0 2023-06-18 06:24:09,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=142440.0, ans=0.035 2023-06-18 06:24:32,856 INFO [train.py:996] (3/4) Epoch 1, batch 23750, loss[loss=0.3135, simple_loss=0.3858, pruned_loss=0.1206, over 21720.00 frames. ], tot_loss[loss=0.3425, simple_loss=0.387, pruned_loss=0.149, over 4272319.78 frames. ], batch size: 351, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 06:24:40,603 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.72 vs. limit=10.0 2023-06-18 06:25:31,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=142680.0, ans=10.0 2023-06-18 06:25:41,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=142680.0, ans=0.2 2023-06-18 06:25:43,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=142680.0, ans=0.0 2023-06-18 06:25:50,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=142680.0, ans=0.025 2023-06-18 06:26:10,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=142740.0, ans=0.1 2023-06-18 06:26:17,119 INFO [train.py:996] (3/4) Epoch 1, batch 23800, loss[loss=0.35, simple_loss=0.4248, pruned_loss=0.1376, over 21733.00 frames. ], tot_loss[loss=0.3365, simple_loss=0.3837, pruned_loss=0.1447, over 4274395.70 frames. ], batch size: 351, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 06:26:37,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=142860.0, ans=0.125 2023-06-18 06:26:50,469 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.63 vs. limit=22.5 2023-06-18 06:27:27,675 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 3.344e+02 4.883e+02 6.088e+02 1.077e+03, threshold=9.766e+02, percent-clipped=8.0 2023-06-18 06:28:02,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=143040.0, ans=0.04949747468305833 2023-06-18 06:28:06,747 INFO [train.py:996] (3/4) Epoch 1, batch 23850, loss[loss=0.5089, simple_loss=0.5191, pruned_loss=0.2493, over 21360.00 frames. ], tot_loss[loss=0.3479, simple_loss=0.3967, pruned_loss=0.1496, over 4280093.87 frames. ], batch size: 507, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:28:37,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=143160.0, ans=0.2 2023-06-18 06:28:39,523 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.64 vs. limit=10.0 2023-06-18 06:29:13,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=143280.0, ans=0.125 2023-06-18 06:29:16,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=143280.0, ans=0.125 2023-06-18 06:29:21,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=143280.0, ans=0.125 2023-06-18 06:29:24,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=143280.0, ans=0.125 2023-06-18 06:29:48,576 INFO [train.py:996] (3/4) Epoch 1, batch 23900, loss[loss=0.318, simple_loss=0.3669, pruned_loss=0.1346, over 21190.00 frames. ], tot_loss[loss=0.3556, simple_loss=0.4047, pruned_loss=0.1532, over 4283366.33 frames. ], batch size: 159, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:30:14,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=143460.0, ans=0.0 2023-06-18 06:30:19,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=143460.0, ans=0.2 2023-06-18 06:30:30,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=143520.0, ans=0.2 2023-06-18 06:30:38,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=143520.0, ans=0.2 2023-06-18 06:30:56,978 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.496e+02 3.761e+02 4.724e+02 6.134e+02 1.060e+03, threshold=9.448e+02, percent-clipped=2.0 2023-06-18 06:31:26,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=143640.0, ans=0.125 2023-06-18 06:31:30,059 INFO [train.py:996] (3/4) Epoch 1, batch 23950, loss[loss=0.3182, simple_loss=0.348, pruned_loss=0.1442, over 20066.00 frames. ], tot_loss[loss=0.3526, simple_loss=0.3983, pruned_loss=0.1534, over 4276694.12 frames. ], batch size: 702, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:32:35,491 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=22.5 2023-06-18 06:33:13,944 INFO [train.py:996] (3/4) Epoch 1, batch 24000, loss[loss=0.3859, simple_loss=0.4229, pruned_loss=0.1744, over 21356.00 frames. ], tot_loss[loss=0.3542, simple_loss=0.3975, pruned_loss=0.1554, over 4276109.17 frames. ], batch size: 176, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:33:13,944 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 06:33:36,588 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.32, simple_loss=0.4122, pruned_loss=0.1139, over 1796401.00 frames. 2023-06-18 06:33:36,589 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-18 06:33:40,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=144000.0, ans=0.125 2023-06-18 06:34:27,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=144120.0, ans=0.1 2023-06-18 06:34:30,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=144120.0, ans=0.125 2023-06-18 06:34:48,614 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.428e+02 3.687e+02 4.611e+02 5.908e+02 1.149e+03, threshold=9.222e+02, percent-clipped=2.0 2023-06-18 06:35:11,683 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=22.5 2023-06-18 06:35:20,177 INFO [train.py:996] (3/4) Epoch 1, batch 24050, loss[loss=0.2784, simple_loss=0.339, pruned_loss=0.1089, over 21169.00 frames. ], tot_loss[loss=0.355, simple_loss=0.3987, pruned_loss=0.1556, over 4274133.08 frames. ], batch size: 143, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:35:22,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=144300.0, ans=0.2 2023-06-18 06:35:25,848 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=15.0 2023-06-18 06:35:32,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=144300.0, ans=0.125 2023-06-18 06:35:48,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.91 vs. limit=15.0 2023-06-18 06:35:56,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=144360.0, ans=0.2 2023-06-18 06:36:24,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=144480.0, ans=0.2 2023-06-18 06:36:48,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=144540.0, ans=0.125 2023-06-18 06:36:50,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=144540.0, ans=0.125 2023-06-18 06:36:51,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=144540.0, ans=0.0 2023-06-18 06:37:06,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=144600.0, ans=0.2 2023-06-18 06:37:07,451 INFO [train.py:996] (3/4) Epoch 1, batch 24100, loss[loss=0.2522, simple_loss=0.3155, pruned_loss=0.0944, over 16538.00 frames. ], tot_loss[loss=0.3514, simple_loss=0.3977, pruned_loss=0.1525, over 4271697.87 frames. ], batch size: 61, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 06:37:46,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=144720.0, ans=0.2 2023-06-18 06:38:12,994 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.309e+02 3.321e+02 4.048e+02 5.410e+02 1.299e+03, threshold=8.096e+02, percent-clipped=1.0 2023-06-18 06:38:33,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=144840.0, ans=0.125 2023-06-18 06:38:49,121 INFO [train.py:996] (3/4) Epoch 1, batch 24150, loss[loss=0.3346, simple_loss=0.3697, pruned_loss=0.1498, over 21486.00 frames. ], tot_loss[loss=0.3546, simple_loss=0.398, pruned_loss=0.1556, over 4281501.67 frames. ], batch size: 211, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 06:39:17,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=144960.0, ans=0.0 2023-06-18 06:39:22,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=144960.0, ans=0.125 2023-06-18 06:40:31,830 INFO [train.py:996] (3/4) Epoch 1, batch 24200, loss[loss=0.296, simple_loss=0.3635, pruned_loss=0.1143, over 21456.00 frames. ], tot_loss[loss=0.3586, simple_loss=0.4011, pruned_loss=0.1581, over 4280598.59 frames. ], batch size: 211, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 06:40:57,822 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.52 vs. limit=22.5 2023-06-18 06:40:58,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=145260.0, ans=0.125 2023-06-18 06:41:35,444 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-18 06:41:49,441 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.922e+02 3.664e+02 4.494e+02 5.781e+02 1.168e+03, threshold=8.988e+02, percent-clipped=4.0 2023-06-18 06:42:21,370 INFO [train.py:996] (3/4) Epoch 1, batch 24250, loss[loss=0.3493, simple_loss=0.4178, pruned_loss=0.1404, over 21507.00 frames. ], tot_loss[loss=0.3444, simple_loss=0.3948, pruned_loss=0.147, over 4286250.25 frames. ], batch size: 471, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 06:42:53,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=145560.0, ans=0.125 2023-06-18 06:43:13,528 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=22.5 2023-06-18 06:43:14,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=145620.0, ans=0.1 2023-06-18 06:43:35,612 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-18 06:43:36,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=145680.0, ans=0.125 2023-06-18 06:44:01,971 INFO [train.py:996] (3/4) Epoch 1, batch 24300, loss[loss=0.1781, simple_loss=0.2513, pruned_loss=0.05248, over 21085.00 frames. ], tot_loss[loss=0.3294, simple_loss=0.3835, pruned_loss=0.1376, over 4286523.55 frames. ], batch size: 143, lr: 2.44e-02, grad_scale: 16.0 2023-06-18 06:44:19,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=145800.0, ans=0.125 2023-06-18 06:44:24,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=145860.0, ans=0.125 2023-06-18 06:45:13,975 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 3.046e+02 3.863e+02 5.440e+02 1.504e+03, threshold=7.726e+02, percent-clipped=4.0 2023-06-18 06:45:19,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=145980.0, ans=0.0 2023-06-18 06:45:24,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=146040.0, ans=0.125 2023-06-18 06:45:32,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=146040.0, ans=0.125 2023-06-18 06:45:34,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=146040.0, ans=0.95 2023-06-18 06:45:43,151 INFO [train.py:996] (3/4) Epoch 1, batch 24350, loss[loss=0.4029, simple_loss=0.4385, pruned_loss=0.1837, over 21893.00 frames. ], tot_loss[loss=0.3288, simple_loss=0.3803, pruned_loss=0.1386, over 4286977.09 frames. ], batch size: 118, lr: 2.44e-02, grad_scale: 16.0 2023-06-18 06:46:43,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=146220.0, ans=0.125 2023-06-18 06:47:08,921 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=15.0 2023-06-18 06:47:32,112 INFO [train.py:996] (3/4) Epoch 1, batch 24400, loss[loss=0.3398, simple_loss=0.376, pruned_loss=0.1519, over 21801.00 frames. ], tot_loss[loss=0.3403, simple_loss=0.389, pruned_loss=0.1457, over 4285657.82 frames. ], batch size: 107, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 06:47:47,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=146460.0, ans=0.09899494936611666 2023-06-18 06:48:25,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=146520.0, ans=0.0 2023-06-18 06:48:41,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=146580.0, ans=0.0 2023-06-18 06:48:43,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=146580.0, ans=0.1 2023-06-18 06:48:44,672 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.695e+02 4.218e+02 5.437e+02 7.202e+02 1.402e+03, threshold=1.087e+03, percent-clipped=21.0 2023-06-18 06:48:46,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=146580.0, ans=0.2 2023-06-18 06:48:53,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=146640.0, ans=0.125 2023-06-18 06:49:08,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=146640.0, ans=0.07 2023-06-18 06:49:14,790 INFO [train.py:996] (3/4) Epoch 1, batch 24450, loss[loss=0.3499, simple_loss=0.4151, pruned_loss=0.1424, over 21713.00 frames. ], tot_loss[loss=0.3447, simple_loss=0.3927, pruned_loss=0.1483, over 4287674.58 frames. ], batch size: 389, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 06:49:18,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=146700.0, ans=0.0 2023-06-18 06:49:48,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=146760.0, ans=0.1 2023-06-18 06:50:40,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=146940.0, ans=0.1 2023-06-18 06:50:46,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=146940.0, ans=0.07 2023-06-18 06:50:50,652 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-18 06:50:51,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=146940.0, ans=0.125 2023-06-18 06:50:56,332 INFO [train.py:996] (3/4) Epoch 1, batch 24500, loss[loss=0.3295, simple_loss=0.3898, pruned_loss=0.1346, over 21401.00 frames. ], tot_loss[loss=0.3413, simple_loss=0.3903, pruned_loss=0.1462, over 4286042.86 frames. ], batch size: 144, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:51:22,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=147060.0, ans=0.1 2023-06-18 06:51:29,577 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.33 vs. limit=15.0 2023-06-18 06:51:40,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=147120.0, ans=0.5 2023-06-18 06:52:14,943 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.363e+02 3.850e+02 5.028e+02 6.051e+02 9.604e+02, threshold=1.006e+03, percent-clipped=0.0 2023-06-18 06:52:44,361 INFO [train.py:996] (3/4) Epoch 1, batch 24550, loss[loss=0.3901, simple_loss=0.435, pruned_loss=0.1726, over 21547.00 frames. ], tot_loss[loss=0.3445, simple_loss=0.3924, pruned_loss=0.1483, over 4284823.33 frames. ], batch size: 389, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:52:51,607 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-18 06:53:18,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=147360.0, ans=0.0 2023-06-18 06:53:35,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=147420.0, ans=0.125 2023-06-18 06:54:26,335 INFO [train.py:996] (3/4) Epoch 1, batch 24600, loss[loss=0.3502, simple_loss=0.3715, pruned_loss=0.1644, over 19970.00 frames. ], tot_loss[loss=0.3438, simple_loss=0.3885, pruned_loss=0.1495, over 4278513.80 frames. ], batch size: 703, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:54:33,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=147600.0, ans=0.125 2023-06-18 06:55:38,400 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.603e+02 4.230e+02 5.450e+02 1.074e+03, threshold=8.460e+02, percent-clipped=1.0 2023-06-18 06:56:08,473 INFO [train.py:996] (3/4) Epoch 1, batch 24650, loss[loss=0.3141, simple_loss=0.3476, pruned_loss=0.1403, over 21511.00 frames. ], tot_loss[loss=0.3371, simple_loss=0.3796, pruned_loss=0.1473, over 4277037.81 frames. ], batch size: 441, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:56:21,141 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.67 vs. limit=22.5 2023-06-18 06:57:41,106 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=12.0 2023-06-18 06:57:50,915 INFO [train.py:996] (3/4) Epoch 1, batch 24700, loss[loss=0.3489, simple_loss=0.3804, pruned_loss=0.1587, over 21244.00 frames. ], tot_loss[loss=0.3336, simple_loss=0.3773, pruned_loss=0.1449, over 4269864.78 frames. ], batch size: 471, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:58:02,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=148200.0, ans=0.0 2023-06-18 06:58:21,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=148260.0, ans=0.125 2023-06-18 06:58:33,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=148320.0, ans=0.0 2023-06-18 06:58:35,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=148320.0, ans=0.0 2023-06-18 06:58:53,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=148380.0, ans=10.0 2023-06-18 06:59:03,008 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.227e+02 3.816e+02 4.904e+02 7.765e+02, threshold=7.633e+02, percent-clipped=0.0 2023-06-18 06:59:11,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=148440.0, ans=0.2 2023-06-18 06:59:16,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=148440.0, ans=0.125 2023-06-18 06:59:24,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=148440.0, ans=0.125 2023-06-18 06:59:28,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=148440.0, ans=0.0 2023-06-18 06:59:29,236 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.98 vs. limit=6.0 2023-06-18 06:59:32,553 INFO [train.py:996] (3/4) Epoch 1, batch 24750, loss[loss=0.2903, simple_loss=0.3223, pruned_loss=0.1292, over 21453.00 frames. ], tot_loss[loss=0.3272, simple_loss=0.3702, pruned_loss=0.1421, over 4260955.81 frames. ], batch size: 212, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 07:00:37,295 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=22.5 2023-06-18 07:00:54,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=148740.0, ans=0.125 2023-06-18 07:00:54,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=148740.0, ans=0.2 2023-06-18 07:01:14,213 INFO [train.py:996] (3/4) Epoch 1, batch 24800, loss[loss=0.3767, simple_loss=0.3927, pruned_loss=0.1804, over 21626.00 frames. ], tot_loss[loss=0.3239, simple_loss=0.3659, pruned_loss=0.1409, over 4260276.27 frames. ], batch size: 508, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 07:01:44,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=148860.0, ans=0.125 2023-06-18 07:02:10,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=148920.0, ans=0.125 2023-06-18 07:02:27,060 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 3.589e+02 4.591e+02 5.888e+02 8.855e+02, threshold=9.183e+02, percent-clipped=11.0 2023-06-18 07:02:40,886 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=15.0 2023-06-18 07:02:42,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=149040.0, ans=0.0 2023-06-18 07:02:56,553 INFO [train.py:996] (3/4) Epoch 1, batch 24850, loss[loss=0.2853, simple_loss=0.3371, pruned_loss=0.1167, over 21647.00 frames. ], tot_loss[loss=0.3258, simple_loss=0.3669, pruned_loss=0.1424, over 4262919.98 frames. ], batch size: 263, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 07:03:20,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=149160.0, ans=0.05 2023-06-18 07:03:30,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=149160.0, ans=0.1 2023-06-18 07:04:16,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=149340.0, ans=0.125 2023-06-18 07:04:16,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=149340.0, ans=0.1 2023-06-18 07:04:39,611 INFO [train.py:996] (3/4) Epoch 1, batch 24900, loss[loss=0.3949, simple_loss=0.4313, pruned_loss=0.1793, over 21784.00 frames. ], tot_loss[loss=0.328, simple_loss=0.3694, pruned_loss=0.1433, over 4263303.22 frames. ], batch size: 124, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 07:05:07,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=149460.0, ans=0.1 2023-06-18 07:05:28,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=149520.0, ans=0.125 2023-06-18 07:05:53,719 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.681e+02 3.811e+02 4.758e+02 6.118e+02 1.056e+03, threshold=9.515e+02, percent-clipped=2.0 2023-06-18 07:05:54,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.70 vs. limit=15.0 2023-06-18 07:06:04,766 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.20 vs. limit=10.0 2023-06-18 07:06:09,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=149640.0, ans=0.125 2023-06-18 07:06:23,749 INFO [train.py:996] (3/4) Epoch 1, batch 24950, loss[loss=0.4625, simple_loss=0.4768, pruned_loss=0.2242, over 21792.00 frames. ], tot_loss[loss=0.3399, simple_loss=0.38, pruned_loss=0.1499, over 4270466.94 frames. ], batch size: 441, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:06:28,266 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=12.0 2023-06-18 07:07:08,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=149820.0, ans=0.125 2023-06-18 07:07:19,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=149820.0, ans=0.125 2023-06-18 07:07:50,419 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.189e-01 2023-06-18 07:08:07,182 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:08:08,487 INFO [train.py:996] (3/4) Epoch 1, batch 25000, loss[loss=0.3575, simple_loss=0.4132, pruned_loss=0.1509, over 21612.00 frames. ], tot_loss[loss=0.3461, simple_loss=0.3872, pruned_loss=0.1525, over 4273109.58 frames. ], batch size: 263, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:08:10,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=150000.0, ans=0.125 2023-06-18 07:08:40,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=150060.0, ans=0.125 2023-06-18 07:08:42,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=150060.0, ans=0.125 2023-06-18 07:08:44,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=150060.0, ans=0.125 2023-06-18 07:09:19,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=150180.0, ans=0.1 2023-06-18 07:09:27,597 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.657e+02 3.412e+02 4.030e+02 5.230e+02 1.013e+03, threshold=8.059e+02, percent-clipped=2.0 2023-06-18 07:09:41,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150240.0, ans=0.1 2023-06-18 07:09:57,180 INFO [train.py:996] (3/4) Epoch 1, batch 25050, loss[loss=0.319, simple_loss=0.3522, pruned_loss=0.1429, over 21442.00 frames. ], tot_loss[loss=0.3396, simple_loss=0.3798, pruned_loss=0.1497, over 4271967.13 frames. ], batch size: 441, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:10:20,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=150360.0, ans=0.07 2023-06-18 07:10:26,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=150360.0, ans=0.125 2023-06-18 07:11:12,776 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=12.0 2023-06-18 07:11:18,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=150540.0, ans=0.04949747468305833 2023-06-18 07:11:40,493 INFO [train.py:996] (3/4) Epoch 1, batch 25100, loss[loss=0.2821, simple_loss=0.3175, pruned_loss=0.1233, over 20746.00 frames. ], tot_loss[loss=0.3336, simple_loss=0.3727, pruned_loss=0.1472, over 4267471.86 frames. ], batch size: 608, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:11:49,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=150600.0, ans=0.125 2023-06-18 07:12:21,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150720.0, ans=0.1 2023-06-18 07:12:44,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=150780.0, ans=0.5 2023-06-18 07:12:52,240 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 3.624e+02 4.936e+02 6.636e+02 1.221e+03, threshold=9.872e+02, percent-clipped=16.0 2023-06-18 07:13:16,003 INFO [train.py:996] (3/4) Epoch 1, batch 25150, loss[loss=0.3001, simple_loss=0.3817, pruned_loss=0.1093, over 21811.00 frames. ], tot_loss[loss=0.3296, simple_loss=0.3735, pruned_loss=0.1429, over 4259817.85 frames. ], batch size: 282, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:13:18,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=150900.0, ans=0.125 2023-06-18 07:13:42,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=150960.0, ans=0.2 2023-06-18 07:14:19,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=151080.0, ans=0.2 2023-06-18 07:14:56,919 INFO [train.py:996] (3/4) Epoch 1, batch 25200, loss[loss=0.287, simple_loss=0.3557, pruned_loss=0.1091, over 21667.00 frames. ], tot_loss[loss=0.3249, simple_loss=0.3715, pruned_loss=0.1392, over 4259447.66 frames. ], batch size: 230, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 07:15:26,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=151260.0, ans=0.07 2023-06-18 07:16:14,332 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.108e+02 3.238e+02 4.132e+02 5.215e+02 8.390e+02, threshold=8.263e+02, percent-clipped=0.0 2023-06-18 07:16:29,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=151440.0, ans=0.125 2023-06-18 07:16:36,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=151440.0, ans=10.0 2023-06-18 07:16:38,437 INFO [train.py:996] (3/4) Epoch 1, batch 25250, loss[loss=0.2772, simple_loss=0.3247, pruned_loss=0.1148, over 21219.00 frames. ], tot_loss[loss=0.3214, simple_loss=0.3691, pruned_loss=0.1368, over 4255220.11 frames. ], batch size: 176, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 07:16:52,446 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.34 vs. limit=22.5 2023-06-18 07:16:58,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=151560.0, ans=0.125 2023-06-18 07:17:59,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-18 07:18:02,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=151680.0, ans=0.0 2023-06-18 07:18:21,295 INFO [train.py:996] (3/4) Epoch 1, batch 25300, loss[loss=0.34, simple_loss=0.3455, pruned_loss=0.1673, over 20296.00 frames. ], tot_loss[loss=0.3205, simple_loss=0.3664, pruned_loss=0.1373, over 4252640.19 frames. ], batch size: 703, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 07:18:35,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=151800.0, ans=0.2 2023-06-18 07:18:38,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=151800.0, ans=0.125 2023-06-18 07:18:58,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=151860.0, ans=0.2 2023-06-18 07:19:23,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=151920.0, ans=0.125 2023-06-18 07:19:26,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=151980.0, ans=0.5 2023-06-18 07:19:39,719 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.535e+02 3.569e+02 4.461e+02 5.778e+02 9.355e+02, threshold=8.922e+02, percent-clipped=5.0 2023-06-18 07:19:42,193 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-18 07:19:51,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=152040.0, ans=0.0 2023-06-18 07:20:03,966 INFO [train.py:996] (3/4) Epoch 1, batch 25350, loss[loss=0.3707, simple_loss=0.4032, pruned_loss=0.1691, over 21393.00 frames. ], tot_loss[loss=0.3238, simple_loss=0.3715, pruned_loss=0.1381, over 4256861.17 frames. ], batch size: 507, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 07:20:14,872 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0 2023-06-18 07:20:31,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=152160.0, ans=0.125 2023-06-18 07:20:32,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=152160.0, ans=0.0 2023-06-18 07:20:35,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=152160.0, ans=0.1 2023-06-18 07:20:37,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=152160.0, ans=0.1 2023-06-18 07:20:37,414 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.879e-03 2023-06-18 07:20:43,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=152160.0, ans=0.2 2023-06-18 07:20:52,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=152220.0, ans=0.1 2023-06-18 07:21:39,297 INFO [train.py:996] (3/4) Epoch 1, batch 25400, loss[loss=0.2993, simple_loss=0.3424, pruned_loss=0.1281, over 21272.00 frames. ], tot_loss[loss=0.3194, simple_loss=0.3671, pruned_loss=0.1359, over 4248275.76 frames. ], batch size: 159, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:22:56,310 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.405e+02 3.535e+02 4.232e+02 5.710e+02 1.225e+03, threshold=8.465e+02, percent-clipped=5.0 2023-06-18 07:23:09,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=152640.0, ans=0.125 2023-06-18 07:23:11,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=152640.0, ans=0.1 2023-06-18 07:23:11,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=152640.0, ans=0.2 2023-06-18 07:23:20,763 INFO [train.py:996] (3/4) Epoch 1, batch 25450, loss[loss=0.3448, simple_loss=0.3801, pruned_loss=0.1547, over 21862.00 frames. ], tot_loss[loss=0.3217, simple_loss=0.3675, pruned_loss=0.138, over 4243286.67 frames. ], batch size: 107, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:23:23,735 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=15.0 2023-06-18 07:23:41,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=152760.0, ans=0.2 2023-06-18 07:23:43,639 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.13 vs. limit=6.0 2023-06-18 07:23:56,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=152760.0, ans=0.0 2023-06-18 07:24:13,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=15.0 2023-06-18 07:24:55,290 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.13 vs. limit=5.0 2023-06-18 07:25:04,164 INFO [train.py:996] (3/4) Epoch 1, batch 25500, loss[loss=0.3794, simple_loss=0.4235, pruned_loss=0.1677, over 21888.00 frames. ], tot_loss[loss=0.316, simple_loss=0.3649, pruned_loss=0.1336, over 4237847.49 frames. ], batch size: 372, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:25:48,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=153120.0, ans=0.09899494936611666 2023-06-18 07:25:49,290 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-18 07:26:12,180 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.81 vs. limit=10.0 2023-06-18 07:26:22,533 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 3.487e+02 4.551e+02 5.429e+02 1.003e+03, threshold=9.102e+02, percent-clipped=2.0 2023-06-18 07:26:52,227 INFO [train.py:996] (3/4) Epoch 1, batch 25550, loss[loss=0.2931, simple_loss=0.3465, pruned_loss=0.1199, over 21433.00 frames. ], tot_loss[loss=0.3206, simple_loss=0.372, pruned_loss=0.1346, over 4248114.53 frames. ], batch size: 131, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:27:14,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=153360.0, ans=0.0 2023-06-18 07:27:40,629 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:27:45,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=153420.0, ans=0.1 2023-06-18 07:28:34,436 INFO [train.py:996] (3/4) Epoch 1, batch 25600, loss[loss=0.3761, simple_loss=0.415, pruned_loss=0.1686, over 21367.00 frames. ], tot_loss[loss=0.3262, simple_loss=0.3789, pruned_loss=0.1367, over 4258967.30 frames. ], batch size: 548, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:28:49,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=153660.0, ans=0.2 2023-06-18 07:29:06,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=153660.0, ans=0.125 2023-06-18 07:29:14,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=153720.0, ans=0.0 2023-06-18 07:29:33,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=153780.0, ans=0.125 2023-06-18 07:29:36,692 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.393e+02 3.502e+02 4.172e+02 4.983e+02 8.051e+02, threshold=8.344e+02, percent-clipped=0.0 2023-06-18 07:29:47,272 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=15.0 2023-06-18 07:29:48,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=153840.0, ans=0.125 2023-06-18 07:30:10,881 INFO [train.py:996] (3/4) Epoch 1, batch 25650, loss[loss=0.3133, simple_loss=0.346, pruned_loss=0.1404, over 21689.00 frames. ], tot_loss[loss=0.3331, simple_loss=0.3821, pruned_loss=0.1421, over 4260088.30 frames. ], batch size: 333, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:31:00,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=154020.0, ans=0.125 2023-06-18 07:31:00,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=154020.0, ans=0.1 2023-06-18 07:31:02,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=154080.0, ans=0.125 2023-06-18 07:31:33,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=154140.0, ans=0.125 2023-06-18 07:31:46,140 INFO [train.py:996] (3/4) Epoch 1, batch 25700, loss[loss=0.3478, simple_loss=0.3879, pruned_loss=0.1539, over 21771.00 frames. ], tot_loss[loss=0.3363, simple_loss=0.3818, pruned_loss=0.1454, over 4263012.08 frames. ], batch size: 112, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:31:52,590 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=15.0 2023-06-18 07:31:53,949 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-18 07:32:04,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=154260.0, ans=0.125 2023-06-18 07:32:43,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=154320.0, ans=0.2 2023-06-18 07:32:53,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=154380.0, ans=0.125 2023-06-18 07:33:00,062 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.737e+02 4.412e+02 5.511e+02 6.649e+02 1.111e+03, threshold=1.102e+03, percent-clipped=12.0 2023-06-18 07:33:05,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=154380.0, ans=0.125 2023-06-18 07:33:30,715 INFO [train.py:996] (3/4) Epoch 1, batch 25750, loss[loss=0.4243, simple_loss=0.4637, pruned_loss=0.1924, over 21564.00 frames. ], tot_loss[loss=0.3446, simple_loss=0.389, pruned_loss=0.1501, over 4261370.56 frames. ], batch size: 230, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:33:52,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=154560.0, ans=0.125 2023-06-18 07:34:21,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=154620.0, ans=10.0 2023-06-18 07:34:28,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=154620.0, ans=0.0 2023-06-18 07:34:40,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=154680.0, ans=0.2 2023-06-18 07:34:44,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=154680.0, ans=0.035 2023-06-18 07:35:16,170 INFO [train.py:996] (3/4) Epoch 1, batch 25800, loss[loss=0.406, simple_loss=0.4466, pruned_loss=0.1827, over 21332.00 frames. ], tot_loss[loss=0.3564, simple_loss=0.4007, pruned_loss=0.156, over 4266378.07 frames. ], batch size: 548, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:36:17,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=154980.0, ans=0.1 2023-06-18 07:36:27,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=154980.0, ans=0.125 2023-06-18 07:36:28,463 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.698e+02 3.746e+02 4.301e+02 5.401e+02 1.441e+03, threshold=8.601e+02, percent-clipped=2.0 2023-06-18 07:36:42,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=155040.0, ans=0.125 2023-06-18 07:36:57,492 INFO [train.py:996] (3/4) Epoch 1, batch 25850, loss[loss=0.348, simple_loss=0.4017, pruned_loss=0.1472, over 19940.00 frames. ], tot_loss[loss=0.3557, simple_loss=0.4031, pruned_loss=0.1542, over 4268601.76 frames. ], batch size: 702, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:37:16,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=155100.0, ans=0.0 2023-06-18 07:37:50,466 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=15.0 2023-06-18 07:38:33,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=155340.0, ans=0.04949747468305833 2023-06-18 07:38:46,014 INFO [train.py:996] (3/4) Epoch 1, batch 25900, loss[loss=0.4987, simple_loss=0.5294, pruned_loss=0.234, over 21572.00 frames. ], tot_loss[loss=0.3562, simple_loss=0.404, pruned_loss=0.1542, over 4275318.63 frames. ], batch size: 507, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 07:38:53,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=155400.0, ans=0.125 2023-06-18 07:39:10,932 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-18 07:39:22,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=155520.0, ans=0.125 2023-06-18 07:39:32,296 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-06-18 07:39:55,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=155580.0, ans=0.125 2023-06-18 07:39:58,797 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.678e+02 3.705e+02 4.419e+02 5.739e+02 1.257e+03, threshold=8.839e+02, percent-clipped=5.0 2023-06-18 07:40:02,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=155580.0, ans=0.1 2023-06-18 07:40:28,218 INFO [train.py:996] (3/4) Epoch 1, batch 25950, loss[loss=0.3678, simple_loss=0.4049, pruned_loss=0.1653, over 21376.00 frames. ], tot_loss[loss=0.3603, simple_loss=0.4076, pruned_loss=0.1565, over 4274910.01 frames. ], batch size: 159, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 07:40:31,101 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.87 vs. limit=6.0 2023-06-18 07:41:24,091 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=12.0 2023-06-18 07:41:28,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=155820.0, ans=0.125 2023-06-18 07:42:07,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=155940.0, ans=0.1 2023-06-18 07:42:10,840 INFO [train.py:996] (3/4) Epoch 1, batch 26000, loss[loss=0.4032, simple_loss=0.4444, pruned_loss=0.181, over 21990.00 frames. ], tot_loss[loss=0.3595, simple_loss=0.4096, pruned_loss=0.1547, over 4274142.44 frames. ], batch size: 317, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 07:42:39,666 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.17 vs. limit=22.5 2023-06-18 07:42:54,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=156060.0, ans=0.125 2023-06-18 07:43:02,353 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.40 vs. limit=15.0 2023-06-18 07:43:25,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=156180.0, ans=0.125 2023-06-18 07:43:27,126 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 3.511e+02 4.125e+02 5.678e+02 8.372e+02, threshold=8.249e+02, percent-clipped=0.0 2023-06-18 07:43:32,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=156180.0, ans=0.125 2023-06-18 07:43:51,420 INFO [train.py:996] (3/4) Epoch 1, batch 26050, loss[loss=0.356, simple_loss=0.4156, pruned_loss=0.1482, over 17754.00 frames. ], tot_loss[loss=0.3601, simple_loss=0.4088, pruned_loss=0.1557, over 4271079.58 frames. ], batch size: 60, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 07:44:08,614 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=12.0 2023-06-18 07:44:37,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=156420.0, ans=0.125 2023-06-18 07:44:55,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=156480.0, ans=0.0 2023-06-18 07:45:31,160 INFO [train.py:996] (3/4) Epoch 1, batch 26100, loss[loss=0.2972, simple_loss=0.3484, pruned_loss=0.123, over 20122.00 frames. ], tot_loss[loss=0.357, simple_loss=0.4033, pruned_loss=0.1553, over 4268372.45 frames. ], batch size: 702, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 07:45:34,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=156600.0, ans=0.125 2023-06-18 07:45:53,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=156660.0, ans=0.125 2023-06-18 07:46:01,550 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-06-18 07:46:43,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=156780.0, ans=0.0 2023-06-18 07:46:48,614 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.762e+02 4.665e+02 5.349e+02 1.153e+03, threshold=9.330e+02, percent-clipped=6.0 2023-06-18 07:47:00,805 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.57 vs. limit=15.0 2023-06-18 07:47:12,610 INFO [train.py:996] (3/4) Epoch 1, batch 26150, loss[loss=0.3229, simple_loss=0.3577, pruned_loss=0.144, over 19991.00 frames. ], tot_loss[loss=0.3538, simple_loss=0.3983, pruned_loss=0.1547, over 4266055.50 frames. ], batch size: 702, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 07:47:13,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=156900.0, ans=0.125 2023-06-18 07:47:28,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=156900.0, ans=0.125 2023-06-18 07:47:55,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=157020.0, ans=0.1 2023-06-18 07:48:19,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=157080.0, ans=0.1 2023-06-18 07:48:30,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=157080.0, ans=0.125 2023-06-18 07:48:56,347 INFO [train.py:996] (3/4) Epoch 1, batch 26200, loss[loss=0.3085, simple_loss=0.368, pruned_loss=0.1245, over 21848.00 frames. ], tot_loss[loss=0.35, simple_loss=0.3978, pruned_loss=0.1511, over 4271572.29 frames. ], batch size: 118, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 07:49:23,186 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:50:09,829 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.511e+02 3.383e+02 4.279e+02 5.483e+02 1.348e+03, threshold=8.558e+02, percent-clipped=4.0 2023-06-18 07:50:23,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=157440.0, ans=0.0 2023-06-18 07:50:28,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=157440.0, ans=0.2 2023-06-18 07:50:50,258 INFO [train.py:996] (3/4) Epoch 1, batch 26250, loss[loss=0.3881, simple_loss=0.4227, pruned_loss=0.1768, over 21968.00 frames. ], tot_loss[loss=0.3513, simple_loss=0.403, pruned_loss=0.1498, over 4276481.21 frames. ], batch size: 124, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 07:50:52,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=157500.0, ans=0.0 2023-06-18 07:50:59,209 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-18 07:51:09,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=157560.0, ans=0.0 2023-06-18 07:51:11,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=157560.0, ans=0.2 2023-06-18 07:51:21,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=22.5 2023-06-18 07:51:32,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=157620.0, ans=0.1 2023-06-18 07:51:45,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=157680.0, ans=0.0 2023-06-18 07:52:31,379 INFO [train.py:996] (3/4) Epoch 1, batch 26300, loss[loss=0.3143, simple_loss=0.359, pruned_loss=0.1348, over 21739.00 frames. ], tot_loss[loss=0.3508, simple_loss=0.3994, pruned_loss=0.1511, over 4282800.94 frames. ], batch size: 112, lr: 2.36e-02, grad_scale: 64.0 2023-06-18 07:53:38,865 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.479e+02 3.639e+02 4.284e+02 5.347e+02 9.355e+02, threshold=8.568e+02, percent-clipped=1.0 2023-06-18 07:54:13,130 INFO [train.py:996] (3/4) Epoch 1, batch 26350, loss[loss=0.3631, simple_loss=0.4167, pruned_loss=0.1548, over 21822.00 frames. ], tot_loss[loss=0.3493, simple_loss=0.3964, pruned_loss=0.1511, over 4290633.57 frames. ], batch size: 118, lr: 2.35e-02, grad_scale: 64.0 2023-06-18 07:54:30,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.62 vs. limit=15.0 2023-06-18 07:54:34,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=158160.0, ans=0.125 2023-06-18 07:54:36,886 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.41 vs. limit=15.0 2023-06-18 07:54:41,305 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-18 07:54:41,313 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.81 vs. limit=10.0 2023-06-18 07:54:58,277 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:55:03,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=158220.0, ans=0.1 2023-06-18 07:55:54,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=158400.0, ans=0.0 2023-06-18 07:55:55,489 INFO [train.py:996] (3/4) Epoch 1, batch 26400, loss[loss=0.3691, simple_loss=0.3748, pruned_loss=0.1817, over 21521.00 frames. ], tot_loss[loss=0.3463, simple_loss=0.3906, pruned_loss=0.151, over 4274667.71 frames. ], batch size: 441, lr: 2.35e-02, grad_scale: 64.0 2023-06-18 07:57:16,073 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.384e+02 3.627e+02 4.358e+02 5.298e+02 1.261e+03, threshold=8.716e+02, percent-clipped=4.0 2023-06-18 07:57:19,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=158580.0, ans=0.0 2023-06-18 07:57:44,272 INFO [train.py:996] (3/4) Epoch 1, batch 26450, loss[loss=0.2935, simple_loss=0.3247, pruned_loss=0.1312, over 21833.00 frames. ], tot_loss[loss=0.3463, simple_loss=0.3899, pruned_loss=0.1513, over 4268205.30 frames. ], batch size: 107, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 07:57:46,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=158700.0, ans=0.0 2023-06-18 07:59:02,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-18 07:59:25,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=158940.0, ans=0.125 2023-06-18 07:59:28,068 INFO [train.py:996] (3/4) Epoch 1, batch 26500, loss[loss=0.2717, simple_loss=0.3286, pruned_loss=0.1074, over 21782.00 frames. ], tot_loss[loss=0.3455, simple_loss=0.3924, pruned_loss=0.1493, over 4268465.82 frames. ], batch size: 247, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 07:59:41,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=159000.0, ans=0.0 2023-06-18 08:00:18,240 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:00:49,630 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 3.803e+02 4.749e+02 6.034e+02 1.314e+03, threshold=9.498e+02, percent-clipped=6.0 2023-06-18 08:01:13,483 INFO [train.py:996] (3/4) Epoch 1, batch 26550, loss[loss=0.2426, simple_loss=0.3076, pruned_loss=0.08876, over 21520.00 frames. ], tot_loss[loss=0.3365, simple_loss=0.3863, pruned_loss=0.1434, over 4266170.79 frames. ], batch size: 212, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 08:01:43,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=159300.0, ans=0.2 2023-06-18 08:01:46,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=159360.0, ans=0.2 2023-06-18 08:02:10,950 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-18 08:02:11,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=159420.0, ans=0.2 2023-06-18 08:02:12,367 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.74 vs. limit=22.5 2023-06-18 08:02:49,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=159540.0, ans=0.125 2023-06-18 08:03:00,174 INFO [train.py:996] (3/4) Epoch 1, batch 26600, loss[loss=0.4107, simple_loss=0.4298, pruned_loss=0.1958, over 21520.00 frames. ], tot_loss[loss=0.3315, simple_loss=0.3844, pruned_loss=0.1392, over 4266027.49 frames. ], batch size: 441, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:03:12,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=159600.0, ans=0.0 2023-06-18 08:03:26,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=159660.0, ans=0.025 2023-06-18 08:03:30,276 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=15.0 2023-06-18 08:04:08,041 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.521e+02 4.224e+02 5.242e+02 1.118e+03, threshold=8.449e+02, percent-clipped=1.0 2023-06-18 08:04:21,860 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.76 vs. limit=22.5 2023-06-18 08:04:30,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=159840.0, ans=0.125 2023-06-18 08:04:36,025 INFO [train.py:996] (3/4) Epoch 1, batch 26650, loss[loss=0.2128, simple_loss=0.2887, pruned_loss=0.06843, over 21530.00 frames. ], tot_loss[loss=0.3265, simple_loss=0.3771, pruned_loss=0.1379, over 4257189.01 frames. ], batch size: 230, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:04:53,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=159900.0, ans=0.125 2023-06-18 08:05:05,515 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.11 vs. limit=15.0 2023-06-18 08:05:29,055 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-18 08:05:32,745 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-18 08:05:41,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=160080.0, ans=0.1 2023-06-18 08:05:54,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=160140.0, ans=0.2 2023-06-18 08:06:15,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=160200.0, ans=0.125 2023-06-18 08:06:16,342 INFO [train.py:996] (3/4) Epoch 1, batch 26700, loss[loss=0.3708, simple_loss=0.397, pruned_loss=0.1723, over 21721.00 frames. ], tot_loss[loss=0.3193, simple_loss=0.3697, pruned_loss=0.1345, over 4255228.66 frames. ], batch size: 473, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:06:27,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=160200.0, ans=0.1 2023-06-18 08:07:13,902 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-18 08:07:23,782 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.935e+02 3.510e+02 4.681e+02 9.206e+02, threshold=7.020e+02, percent-clipped=3.0 2023-06-18 08:07:30,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=160440.0, ans=0.125 2023-06-18 08:07:30,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=160440.0, ans=0.5 2023-06-18 08:08:03,391 INFO [train.py:996] (3/4) Epoch 1, batch 26750, loss[loss=0.2702, simple_loss=0.3544, pruned_loss=0.09297, over 21739.00 frames. ], tot_loss[loss=0.3164, simple_loss=0.3683, pruned_loss=0.1323, over 4259878.02 frames. ], batch size: 298, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:08:20,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=160500.0, ans=0.2 2023-06-18 08:08:36,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=160560.0, ans=0.0 2023-06-18 08:08:43,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=160620.0, ans=0.025 2023-06-18 08:09:08,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=160680.0, ans=0.125 2023-06-18 08:09:30,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=160740.0, ans=0.1 2023-06-18 08:09:49,654 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.29 vs. limit=15.0 2023-06-18 08:09:52,193 INFO [train.py:996] (3/4) Epoch 1, batch 26800, loss[loss=0.3503, simple_loss=0.3908, pruned_loss=0.1549, over 20676.00 frames. ], tot_loss[loss=0.3292, simple_loss=0.3785, pruned_loss=0.1399, over 4269971.86 frames. ], batch size: 607, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:10:05,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=160800.0, ans=0.125 2023-06-18 08:10:52,598 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.52 vs. limit=8.0 2023-06-18 08:11:01,909 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.593e+02 3.549e+02 4.364e+02 5.200e+02 1.402e+03, threshold=8.728e+02, percent-clipped=9.0 2023-06-18 08:11:02,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=160980.0, ans=0.125 2023-06-18 08:11:02,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=160980.0, ans=0.04949747468305833 2023-06-18 08:11:03,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=160980.0, ans=0.2 2023-06-18 08:11:05,675 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-06-18 08:11:25,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=161040.0, ans=0.125 2023-06-18 08:11:27,757 INFO [train.py:996] (3/4) Epoch 1, batch 26850, loss[loss=0.3323, simple_loss=0.3571, pruned_loss=0.1537, over 21642.00 frames. ], tot_loss[loss=0.335, simple_loss=0.3815, pruned_loss=0.1442, over 4273929.91 frames. ], batch size: 415, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:11:50,756 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.99 vs. limit=6.0 2023-06-18 08:12:43,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=161340.0, ans=0.05 2023-06-18 08:13:02,116 INFO [train.py:996] (3/4) Epoch 1, batch 26900, loss[loss=0.2986, simple_loss=0.3327, pruned_loss=0.1322, over 21699.00 frames. ], tot_loss[loss=0.3284, simple_loss=0.3722, pruned_loss=0.1423, over 4278488.38 frames. ], batch size: 417, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:14:03,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=161580.0, ans=0.125 2023-06-18 08:14:05,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=161580.0, ans=0.0 2023-06-18 08:14:06,392 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.335e+02 3.385e+02 4.142e+02 4.911e+02 9.199e+02, threshold=8.284e+02, percent-clipped=1.0 2023-06-18 08:14:37,080 INFO [train.py:996] (3/4) Epoch 1, batch 26950, loss[loss=0.4033, simple_loss=0.4557, pruned_loss=0.1754, over 21614.00 frames. ], tot_loss[loss=0.3287, simple_loss=0.3734, pruned_loss=0.142, over 4281050.60 frames. ], batch size: 441, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:14:53,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=161760.0, ans=0.125 2023-06-18 08:15:10,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=161820.0, ans=0.125 2023-06-18 08:15:24,701 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-18 08:16:08,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=161940.0, ans=0.0 2023-06-18 08:16:13,276 INFO [train.py:996] (3/4) Epoch 1, batch 27000, loss[loss=0.3264, simple_loss=0.3922, pruned_loss=0.1303, over 21661.00 frames. ], tot_loss[loss=0.324, simple_loss=0.3723, pruned_loss=0.1379, over 4276944.45 frames. ], batch size: 414, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:16:13,276 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 08:16:29,107 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.2828, simple_loss=0.3784, pruned_loss=0.09358, over 1796401.00 frames. 2023-06-18 08:16:29,107 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-18 08:17:12,794 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.39 vs. limit=22.5 2023-06-18 08:17:20,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=162120.0, ans=0.025 2023-06-18 08:17:39,547 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 3.208e+02 3.737e+02 4.814e+02 7.556e+02, threshold=7.473e+02, percent-clipped=0.0 2023-06-18 08:17:59,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=162300.0, ans=0.125 2023-06-18 08:18:01,096 INFO [train.py:996] (3/4) Epoch 1, batch 27050, loss[loss=0.4017, simple_loss=0.4361, pruned_loss=0.1836, over 21718.00 frames. ], tot_loss[loss=0.3194, simple_loss=0.3726, pruned_loss=0.1331, over 4274693.13 frames. ], batch size: 389, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:18:03,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=162300.0, ans=0.0 2023-06-18 08:18:45,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=162420.0, ans=0.0 2023-06-18 08:18:55,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=162420.0, ans=0.125 2023-06-18 08:19:20,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=162540.0, ans=10.0 2023-06-18 08:19:35,731 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.04 vs. limit=15.0 2023-06-18 08:19:37,654 INFO [train.py:996] (3/4) Epoch 1, batch 27100, loss[loss=0.3, simple_loss=0.3748, pruned_loss=0.1126, over 21490.00 frames. ], tot_loss[loss=0.3242, simple_loss=0.3756, pruned_loss=0.1364, over 4283724.69 frames. ], batch size: 211, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:19:56,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=162660.0, ans=0.125 2023-06-18 08:20:26,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=162720.0, ans=0.1 2023-06-18 08:20:43,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=162780.0, ans=0.0 2023-06-18 08:20:52,386 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.734e+02 4.835e+02 6.632e+02 1.398e+03, threshold=9.671e+02, percent-clipped=18.0 2023-06-18 08:21:14,195 INFO [train.py:996] (3/4) Epoch 1, batch 27150, loss[loss=0.3664, simple_loss=0.4226, pruned_loss=0.1551, over 21668.00 frames. ], tot_loss[loss=0.335, simple_loss=0.3877, pruned_loss=0.1411, over 4285245.12 frames. ], batch size: 263, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:22:12,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=163020.0, ans=0.0 2023-06-18 08:22:28,364 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.08 vs. limit=22.5 2023-06-18 08:22:30,917 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:22:55,442 INFO [train.py:996] (3/4) Epoch 1, batch 27200, loss[loss=0.3576, simple_loss=0.4014, pruned_loss=0.1569, over 21414.00 frames. ], tot_loss[loss=0.3431, simple_loss=0.3967, pruned_loss=0.1448, over 4279777.27 frames. ], batch size: 131, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:23:24,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=163260.0, ans=0.0 2023-06-18 08:24:05,761 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 3.595e+02 4.705e+02 6.129e+02 1.080e+03, threshold=9.409e+02, percent-clipped=7.0 2023-06-18 08:24:41,935 INFO [train.py:996] (3/4) Epoch 1, batch 27250, loss[loss=0.3393, simple_loss=0.3706, pruned_loss=0.154, over 19986.00 frames. ], tot_loss[loss=0.3507, simple_loss=0.4007, pruned_loss=0.1504, over 4282308.95 frames. ], batch size: 703, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:25:23,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=163620.0, ans=0.125 2023-06-18 08:25:46,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=163680.0, ans=0.125 2023-06-18 08:26:20,975 INFO [train.py:996] (3/4) Epoch 1, batch 27300, loss[loss=0.3615, simple_loss=0.4143, pruned_loss=0.1544, over 21802.00 frames. ], tot_loss[loss=0.3531, simple_loss=0.4031, pruned_loss=0.1515, over 4287273.17 frames. ], batch size: 282, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:26:38,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=163800.0, ans=0.125 2023-06-18 08:27:15,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=163920.0, ans=0.125 2023-06-18 08:27:16,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=163920.0, ans=0.125 2023-06-18 08:27:18,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=163920.0, ans=0.125 2023-06-18 08:27:33,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=163980.0, ans=0.125 2023-06-18 08:27:36,274 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.630e+02 3.615e+02 4.138e+02 5.244e+02 1.044e+03, threshold=8.277e+02, percent-clipped=1.0 2023-06-18 08:28:01,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=164100.0, ans=0.125 2023-06-18 08:28:02,465 INFO [train.py:996] (3/4) Epoch 1, batch 27350, loss[loss=0.3154, simple_loss=0.3872, pruned_loss=0.1218, over 21581.00 frames. ], tot_loss[loss=0.3576, simple_loss=0.4069, pruned_loss=0.1542, over 4283192.50 frames. ], batch size: 230, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:28:13,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=164100.0, ans=0.2 2023-06-18 08:29:08,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=164280.0, ans=0.125 2023-06-18 08:29:14,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=164280.0, ans=0.0 2023-06-18 08:29:37,991 INFO [train.py:996] (3/4) Epoch 1, batch 27400, loss[loss=0.3168, simple_loss=0.3535, pruned_loss=0.1401, over 21719.00 frames. ], tot_loss[loss=0.3532, simple_loss=0.4008, pruned_loss=0.1528, over 4289760.19 frames. ], batch size: 230, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:30:13,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=164460.0, ans=0.2 2023-06-18 08:30:47,962 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.591e+02 4.552e+02 5.428e+02 9.216e+02, threshold=9.104e+02, percent-clipped=2.0 2023-06-18 08:31:13,686 INFO [train.py:996] (3/4) Epoch 1, batch 27450, loss[loss=0.2893, simple_loss=0.3396, pruned_loss=0.1195, over 21768.00 frames. ], tot_loss[loss=0.3464, simple_loss=0.3934, pruned_loss=0.1497, over 4291301.79 frames. ], batch size: 124, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:31:33,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=164760.0, ans=0.125 2023-06-18 08:31:56,635 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=22.5 2023-06-18 08:32:20,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=164880.0, ans=0.0 2023-06-18 08:32:26,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=164880.0, ans=0.0 2023-06-18 08:32:45,398 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.80 vs. limit=8.0 2023-06-18 08:32:49,996 INFO [train.py:996] (3/4) Epoch 1, batch 27500, loss[loss=0.3156, simple_loss=0.3602, pruned_loss=0.1355, over 21315.00 frames. ], tot_loss[loss=0.3449, simple_loss=0.3908, pruned_loss=0.1495, over 4292752.84 frames. ], batch size: 143, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:33:08,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=165060.0, ans=0.1 2023-06-18 08:33:10,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=165060.0, ans=0.95 2023-06-18 08:34:03,945 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 3.309e+02 3.875e+02 5.024e+02 1.518e+03, threshold=7.749e+02, percent-clipped=3.0 2023-06-18 08:34:16,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=165240.0, ans=0.125 2023-06-18 08:34:22,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=165240.0, ans=0.125 2023-06-18 08:34:24,889 INFO [train.py:996] (3/4) Epoch 1, batch 27550, loss[loss=0.3349, simple_loss=0.3641, pruned_loss=0.1528, over 21381.00 frames. ], tot_loss[loss=0.3396, simple_loss=0.387, pruned_loss=0.1461, over 4290784.33 frames. ], batch size: 131, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:34:29,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=165300.0, ans=0.125 2023-06-18 08:35:12,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=165420.0, ans=0.125 2023-06-18 08:35:23,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=165480.0, ans=0.1 2023-06-18 08:35:46,885 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.01 vs. limit=22.5 2023-06-18 08:35:48,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=165540.0, ans=0.125 2023-06-18 08:35:59,072 INFO [train.py:996] (3/4) Epoch 1, batch 27600, loss[loss=0.2925, simple_loss=0.3266, pruned_loss=0.1291, over 21543.00 frames. ], tot_loss[loss=0.3345, simple_loss=0.3797, pruned_loss=0.1447, over 4283639.77 frames. ], batch size: 247, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:36:03,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=165600.0, ans=0.2 2023-06-18 08:36:21,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=165660.0, ans=0.125 2023-06-18 08:37:05,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=165780.0, ans=0.125 2023-06-18 08:37:06,690 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.205e+02 4.118e+02 5.523e+02 1.130e+03, threshold=8.236e+02, percent-clipped=6.0 2023-06-18 08:37:32,570 INFO [train.py:996] (3/4) Epoch 1, batch 27650, loss[loss=0.2983, simple_loss=0.3623, pruned_loss=0.1171, over 21210.00 frames. ], tot_loss[loss=0.3292, simple_loss=0.3724, pruned_loss=0.143, over 4277766.22 frames. ], batch size: 159, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:38:52,107 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.89 vs. limit=10.0 2023-06-18 08:38:59,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=166140.0, ans=0.125 2023-06-18 08:39:03,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=166140.0, ans=0.2 2023-06-18 08:39:06,501 INFO [train.py:996] (3/4) Epoch 1, batch 27700, loss[loss=0.3898, simple_loss=0.4352, pruned_loss=0.1721, over 21712.00 frames. ], tot_loss[loss=0.3243, simple_loss=0.3703, pruned_loss=0.1392, over 4281189.58 frames. ], batch size: 351, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:39:30,790 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=22.5 2023-06-18 08:39:33,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=166260.0, ans=0.0 2023-06-18 08:39:42,056 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.86 vs. limit=10.0 2023-06-18 08:40:08,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=166380.0, ans=0.125 2023-06-18 08:40:08,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=166380.0, ans=0.125 2023-06-18 08:40:20,599 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.538e+02 3.610e+02 4.452e+02 5.999e+02 1.124e+03, threshold=8.903e+02, percent-clipped=7.0 2023-06-18 08:40:41,452 INFO [train.py:996] (3/4) Epoch 1, batch 27750, loss[loss=0.3448, simple_loss=0.3946, pruned_loss=0.1475, over 21745.00 frames. ], tot_loss[loss=0.3259, simple_loss=0.3745, pruned_loss=0.1386, over 4281006.33 frames. ], batch size: 441, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:41:36,381 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-18 08:41:46,904 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=22.5 2023-06-18 08:41:50,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=166680.0, ans=0.0 2023-06-18 08:42:00,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=166740.0, ans=0.1 2023-06-18 08:42:16,140 INFO [train.py:996] (3/4) Epoch 1, batch 27800, loss[loss=0.3812, simple_loss=0.4025, pruned_loss=0.1799, over 21634.00 frames. ], tot_loss[loss=0.3257, simple_loss=0.3727, pruned_loss=0.1394, over 4274103.57 frames. ], batch size: 471, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:42:25,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=166800.0, ans=0.125 2023-06-18 08:42:34,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=166860.0, ans=0.125 2023-06-18 08:42:39,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=166860.0, ans=0.125 2023-06-18 08:42:39,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=166860.0, ans=0.125 2023-06-18 08:43:20,171 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.998e+02 3.396e+02 4.185e+02 5.590e+02 8.815e+02, threshold=8.371e+02, percent-clipped=0.0 2023-06-18 08:43:46,588 INFO [train.py:996] (3/4) Epoch 1, batch 27850, loss[loss=0.323, simple_loss=0.3762, pruned_loss=0.1349, over 21499.00 frames. ], tot_loss[loss=0.3272, simple_loss=0.3727, pruned_loss=0.1409, over 4281127.20 frames. ], batch size: 131, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:44:09,973 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=22.5 2023-06-18 08:44:19,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=167160.0, ans=0.0 2023-06-18 08:44:19,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=167160.0, ans=0.0 2023-06-18 08:44:41,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=167220.0, ans=0.0 2023-06-18 08:45:03,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=167340.0, ans=0.04949747468305833 2023-06-18 08:45:15,241 INFO [train.py:996] (3/4) Epoch 1, batch 27900, loss[loss=0.3235, simple_loss=0.4011, pruned_loss=0.123, over 21862.00 frames. ], tot_loss[loss=0.332, simple_loss=0.3803, pruned_loss=0.1418, over 4279723.62 frames. ], batch size: 372, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:45:43,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=167460.0, ans=0.125 2023-06-18 08:45:55,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=167520.0, ans=0.2 2023-06-18 08:46:09,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=167520.0, ans=0.125 2023-06-18 08:46:16,777 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.90 vs. limit=15.0 2023-06-18 08:46:26,448 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.517e+02 3.916e+02 4.626e+02 5.836e+02 1.013e+03, threshold=9.252e+02, percent-clipped=5.0 2023-06-18 08:46:50,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=167640.0, ans=0.1 2023-06-18 08:46:58,158 INFO [train.py:996] (3/4) Epoch 1, batch 27950, loss[loss=0.321, simple_loss=0.3843, pruned_loss=0.1289, over 21646.00 frames. ], tot_loss[loss=0.3267, simple_loss=0.3797, pruned_loss=0.1368, over 4279047.10 frames. ], batch size: 263, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:47:10,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=167700.0, ans=0.125 2023-06-18 08:47:59,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=167880.0, ans=0.125 2023-06-18 08:48:31,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=167940.0, ans=0.04949747468305833 2023-06-18 08:48:35,919 INFO [train.py:996] (3/4) Epoch 1, batch 28000, loss[loss=0.2426, simple_loss=0.3093, pruned_loss=0.08794, over 21853.00 frames. ], tot_loss[loss=0.3218, simple_loss=0.3775, pruned_loss=0.1331, over 4283746.70 frames. ], batch size: 98, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:48:45,776 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-18 08:49:20,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=168120.0, ans=0.0 2023-06-18 08:49:24,811 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:49:28,862 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=12.0 2023-06-18 08:49:35,507 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 3.441e+02 4.640e+02 5.582e+02 1.043e+03, threshold=9.281e+02, percent-clipped=2.0 2023-06-18 08:49:40,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=168240.0, ans=0.125 2023-06-18 08:50:05,687 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:50:11,360 INFO [train.py:996] (3/4) Epoch 1, batch 28050, loss[loss=0.3534, simple_loss=0.4025, pruned_loss=0.1522, over 21651.00 frames. ], tot_loss[loss=0.324, simple_loss=0.3763, pruned_loss=0.1359, over 4287065.59 frames. ], batch size: 441, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:50:16,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=168300.0, ans=0.125 2023-06-18 08:50:38,842 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.90 vs. limit=15.0 2023-06-18 08:50:50,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.72 vs. limit=6.0 2023-06-18 08:50:53,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=168420.0, ans=0.05 2023-06-18 08:50:56,141 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.557e-02 2023-06-18 08:51:30,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=168540.0, ans=0.07 2023-06-18 08:51:41,514 INFO [train.py:996] (3/4) Epoch 1, batch 28100, loss[loss=0.3005, simple_loss=0.3338, pruned_loss=0.1336, over 21528.00 frames. ], tot_loss[loss=0.3234, simple_loss=0.3742, pruned_loss=0.1363, over 4281034.03 frames. ], batch size: 263, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:51:54,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=168600.0, ans=0.0 2023-06-18 08:51:55,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=168600.0, ans=0.125 2023-06-18 08:52:01,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=168660.0, ans=0.1 2023-06-18 08:52:01,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=168660.0, ans=0.125 2023-06-18 08:52:10,838 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-18 08:52:51,337 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 3.651e+02 4.541e+02 5.753e+02 9.912e+02, threshold=9.083e+02, percent-clipped=1.0 2023-06-18 08:53:16,346 INFO [train.py:996] (3/4) Epoch 1, batch 28150, loss[loss=0.2723, simple_loss=0.2987, pruned_loss=0.1229, over 20696.00 frames. ], tot_loss[loss=0.3193, simple_loss=0.367, pruned_loss=0.1358, over 4274300.08 frames. ], batch size: 608, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 08:53:26,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=168900.0, ans=0.125 2023-06-18 08:53:43,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=168960.0, ans=0.125 2023-06-18 08:54:26,366 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-18 08:54:45,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=169140.0, ans=0.1 2023-06-18 08:54:53,286 INFO [train.py:996] (3/4) Epoch 1, batch 28200, loss[loss=0.2842, simple_loss=0.32, pruned_loss=0.1242, over 21550.00 frames. ], tot_loss[loss=0.3222, simple_loss=0.3673, pruned_loss=0.1385, over 4270474.07 frames. ], batch size: 263, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 08:54:56,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=169200.0, ans=0.1 2023-06-18 08:55:12,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=169260.0, ans=0.125 2023-06-18 08:55:22,930 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:56:00,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=169380.0, ans=0.125 2023-06-18 08:56:04,068 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 3.823e+02 5.073e+02 6.497e+02 1.031e+03, threshold=1.015e+03, percent-clipped=3.0 2023-06-18 08:56:15,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=169440.0, ans=0.1 2023-06-18 08:56:28,013 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=15.0 2023-06-18 08:56:28,461 INFO [train.py:996] (3/4) Epoch 1, batch 28250, loss[loss=0.3629, simple_loss=0.3878, pruned_loss=0.169, over 21641.00 frames. ], tot_loss[loss=0.3302, simple_loss=0.3729, pruned_loss=0.1438, over 4271826.25 frames. ], batch size: 298, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 08:56:33,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=169500.0, ans=0.2 2023-06-18 08:56:59,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=169620.0, ans=0.125 2023-06-18 08:57:22,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=169620.0, ans=0.125 2023-06-18 08:57:51,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=169740.0, ans=0.2 2023-06-18 08:58:00,536 INFO [train.py:996] (3/4) Epoch 1, batch 28300, loss[loss=0.227, simple_loss=0.2963, pruned_loss=0.07891, over 21364.00 frames. ], tot_loss[loss=0.3238, simple_loss=0.3687, pruned_loss=0.1395, over 4272555.75 frames. ], batch size: 194, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 08:58:08,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=169800.0, ans=0.125 2023-06-18 08:58:15,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=169860.0, ans=0.125 2023-06-18 08:58:16,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=169860.0, ans=0.125 2023-06-18 08:58:18,837 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2023-06-18 08:58:20,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=169860.0, ans=0.125 2023-06-18 08:58:35,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=169920.0, ans=0.125 2023-06-18 08:58:37,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=169920.0, ans=0.2 2023-06-18 08:58:45,434 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=15.0 2023-06-18 08:58:57,590 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=12.0 2023-06-18 08:59:10,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=169980.0, ans=15.0 2023-06-18 08:59:11,377 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 3.420e+02 4.323e+02 5.538e+02 1.121e+03, threshold=8.647e+02, percent-clipped=1.0 2023-06-18 08:59:31,050 INFO [train.py:996] (3/4) Epoch 1, batch 28350, loss[loss=0.2936, simple_loss=0.3405, pruned_loss=0.1233, over 21623.00 frames. ], tot_loss[loss=0.3146, simple_loss=0.3652, pruned_loss=0.132, over 4270407.01 frames. ], batch size: 282, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 09:00:50,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=170280.0, ans=0.125 2023-06-18 09:01:09,031 INFO [train.py:996] (3/4) Epoch 1, batch 28400, loss[loss=0.3073, simple_loss=0.3446, pruned_loss=0.135, over 21537.00 frames. ], tot_loss[loss=0.3125, simple_loss=0.3608, pruned_loss=0.1321, over 4267968.70 frames. ], batch size: 263, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:01:15,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=170400.0, ans=0.1 2023-06-18 09:01:18,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=170400.0, ans=0.125 2023-06-18 09:02:24,603 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.128e+02 3.627e+02 4.521e+02 5.478e+02 1.024e+03, threshold=9.042e+02, percent-clipped=4.0 2023-06-18 09:02:28,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=170640.0, ans=0.0 2023-06-18 09:02:44,591 INFO [train.py:996] (3/4) Epoch 1, batch 28450, loss[loss=0.2992, simple_loss=0.3234, pruned_loss=0.1375, over 20010.00 frames. ], tot_loss[loss=0.3203, simple_loss=0.3669, pruned_loss=0.1368, over 4266057.96 frames. ], batch size: 703, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:03:17,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=170760.0, ans=0.0 2023-06-18 09:03:51,715 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:03:56,976 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.78 vs. limit=22.5 2023-06-18 09:04:20,712 INFO [train.py:996] (3/4) Epoch 1, batch 28500, loss[loss=0.363, simple_loss=0.3972, pruned_loss=0.1644, over 21757.00 frames. ], tot_loss[loss=0.3246, simple_loss=0.3696, pruned_loss=0.1398, over 4274657.16 frames. ], batch size: 298, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:04:22,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=171000.0, ans=0.5 2023-06-18 09:04:57,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=171060.0, ans=0.125 2023-06-18 09:05:00,742 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-18 09:05:15,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=171120.0, ans=0.2 2023-06-18 09:05:17,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=171120.0, ans=0.125 2023-06-18 09:05:17,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=171120.0, ans=0.0 2023-06-18 09:05:37,214 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.453e+02 3.668e+02 4.799e+02 6.213e+02 1.260e+03, threshold=9.598e+02, percent-clipped=4.0 2023-06-18 09:06:07,520 INFO [train.py:996] (3/4) Epoch 1, batch 28550, loss[loss=0.4281, simple_loss=0.4756, pruned_loss=0.1903, over 21727.00 frames. ], tot_loss[loss=0.3357, simple_loss=0.3808, pruned_loss=0.1453, over 4276851.21 frames. ], batch size: 441, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:06:27,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=171360.0, ans=0.0 2023-06-18 09:07:01,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=171480.0, ans=0.125 2023-06-18 09:07:36,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=171540.0, ans=0.125 2023-06-18 09:07:47,164 INFO [train.py:996] (3/4) Epoch 1, batch 28600, loss[loss=0.4072, simple_loss=0.4403, pruned_loss=0.1871, over 21780.00 frames. ], tot_loss[loss=0.3412, simple_loss=0.3883, pruned_loss=0.147, over 4275677.02 frames. ], batch size: 441, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:08:02,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=171600.0, ans=0.125 2023-06-18 09:08:45,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=171780.0, ans=0.0 2023-06-18 09:08:47,902 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.471e+02 3.320e+02 4.208e+02 5.237e+02 8.981e+02, threshold=8.415e+02, percent-clipped=0.0 2023-06-18 09:09:02,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=171840.0, ans=0.125 2023-06-18 09:09:05,131 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.91 vs. limit=6.0 2023-06-18 09:09:22,052 INFO [train.py:996] (3/4) Epoch 1, batch 28650, loss[loss=0.2977, simple_loss=0.333, pruned_loss=0.1312, over 21704.00 frames. ], tot_loss[loss=0.338, simple_loss=0.3826, pruned_loss=0.1467, over 4274072.12 frames. ], batch size: 334, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:09:41,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=171960.0, ans=0.125 2023-06-18 09:09:50,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=171960.0, ans=0.2 2023-06-18 09:10:37,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=172140.0, ans=15.0 2023-06-18 09:10:56,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=172200.0, ans=0.125 2023-06-18 09:10:58,006 INFO [train.py:996] (3/4) Epoch 1, batch 28700, loss[loss=0.3265, simple_loss=0.3496, pruned_loss=0.1517, over 20207.00 frames. ], tot_loss[loss=0.3397, simple_loss=0.3825, pruned_loss=0.1484, over 4265745.81 frames. ], batch size: 707, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:11:09,629 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-18 09:11:25,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=172260.0, ans=0.0 2023-06-18 09:11:30,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=172260.0, ans=0.1 2023-06-18 09:11:31,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=172320.0, ans=0.2 2023-06-18 09:12:00,557 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-06-18 09:12:09,229 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 3.359e+02 4.224e+02 5.589e+02 9.530e+02, threshold=8.447e+02, percent-clipped=4.0 2023-06-18 09:12:25,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=172440.0, ans=0.2 2023-06-18 09:12:38,133 INFO [train.py:996] (3/4) Epoch 1, batch 28750, loss[loss=0.4174, simple_loss=0.454, pruned_loss=0.1904, over 21545.00 frames. ], tot_loss[loss=0.3393, simple_loss=0.3809, pruned_loss=0.1488, over 4273057.22 frames. ], batch size: 507, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:12:38,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=172500.0, ans=0.1 2023-06-18 09:13:13,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=172620.0, ans=0.125 2023-06-18 09:13:47,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=172680.0, ans=0.0 2023-06-18 09:14:00,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=172740.0, ans=0.125 2023-06-18 09:14:11,452 INFO [train.py:996] (3/4) Epoch 1, batch 28800, loss[loss=0.4409, simple_loss=0.4718, pruned_loss=0.205, over 21834.00 frames. ], tot_loss[loss=0.3418, simple_loss=0.3855, pruned_loss=0.149, over 4269662.18 frames. ], batch size: 124, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:14:37,047 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-18 09:15:25,212 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.483e+02 3.326e+02 4.027e+02 5.437e+02 1.151e+03, threshold=8.055e+02, percent-clipped=4.0 2023-06-18 09:15:49,260 INFO [train.py:996] (3/4) Epoch 1, batch 28850, loss[loss=0.3036, simple_loss=0.3486, pruned_loss=0.1293, over 21809.00 frames. ], tot_loss[loss=0.3441, simple_loss=0.3869, pruned_loss=0.1507, over 4279636.09 frames. ], batch size: 247, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:17:25,740 INFO [train.py:996] (3/4) Epoch 1, batch 28900, loss[loss=0.3655, simple_loss=0.4125, pruned_loss=0.1593, over 21746.00 frames. ], tot_loss[loss=0.348, simple_loss=0.39, pruned_loss=0.153, over 4283007.45 frames. ], batch size: 332, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:18:04,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=173520.0, ans=0.1 2023-06-18 09:18:33,877 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:18:38,139 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.639e+02 3.786e+02 4.531e+02 6.034e+02 1.219e+03, threshold=9.062e+02, percent-clipped=7.0 2023-06-18 09:18:39,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-06-18 09:18:49,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=173640.0, ans=0.07 2023-06-18 09:18:58,992 INFO [train.py:996] (3/4) Epoch 1, batch 28950, loss[loss=0.5248, simple_loss=0.5719, pruned_loss=0.2389, over 19705.00 frames. ], tot_loss[loss=0.3488, simple_loss=0.3919, pruned_loss=0.1528, over 4278031.45 frames. ], batch size: 702, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:19:02,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=173700.0, ans=0.95 2023-06-18 09:19:28,553 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-06-18 09:19:43,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=173820.0, ans=0.2 2023-06-18 09:20:30,542 INFO [train.py:996] (3/4) Epoch 1, batch 29000, loss[loss=0.3647, simple_loss=0.3994, pruned_loss=0.165, over 19889.00 frames. ], tot_loss[loss=0.3509, simple_loss=0.3974, pruned_loss=0.1521, over 4275963.26 frames. ], batch size: 702, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:20:42,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=174000.0, ans=0.125 2023-06-18 09:20:52,460 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=15.0 2023-06-18 09:21:15,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=174060.0, ans=0.07 2023-06-18 09:21:43,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=174180.0, ans=0.2 2023-06-18 09:21:45,738 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 3.309e+02 4.233e+02 5.463e+02 9.741e+02, threshold=8.465e+02, percent-clipped=3.0 2023-06-18 09:22:05,671 INFO [train.py:996] (3/4) Epoch 1, batch 29050, loss[loss=0.3252, simple_loss=0.3628, pruned_loss=0.1438, over 21862.00 frames. ], tot_loss[loss=0.3504, simple_loss=0.3949, pruned_loss=0.1529, over 4278324.75 frames. ], batch size: 298, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:23:03,097 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:23:07,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.97 vs. limit=6.0 2023-06-18 09:23:38,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=174540.0, ans=0.0 2023-06-18 09:23:40,547 INFO [train.py:996] (3/4) Epoch 1, batch 29100, loss[loss=0.3052, simple_loss=0.3394, pruned_loss=0.1355, over 21751.00 frames. ], tot_loss[loss=0.341, simple_loss=0.384, pruned_loss=0.149, over 4279850.52 frames. ], batch size: 351, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:23:41,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.64 vs. limit=10.0 2023-06-18 09:24:12,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=174660.0, ans=0.0 2023-06-18 09:24:48,539 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 3.440e+02 4.060e+02 5.417e+02 8.880e+02, threshold=8.120e+02, percent-clipped=2.0 2023-06-18 09:25:17,842 INFO [train.py:996] (3/4) Epoch 1, batch 29150, loss[loss=0.3036, simple_loss=0.3446, pruned_loss=0.1312, over 21954.00 frames. ], tot_loss[loss=0.3352, simple_loss=0.3798, pruned_loss=0.1452, over 4275290.53 frames. ], batch size: 103, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:25:22,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=174900.0, ans=0.125 2023-06-18 09:25:39,843 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=22.5 2023-06-18 09:26:00,994 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=15.0 2023-06-18 09:26:48,390 INFO [train.py:996] (3/4) Epoch 1, batch 29200, loss[loss=0.3066, simple_loss=0.3491, pruned_loss=0.132, over 21767.00 frames. ], tot_loss[loss=0.3304, simple_loss=0.374, pruned_loss=0.1434, over 4267017.04 frames. ], batch size: 371, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:27:24,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=175320.0, ans=0.1 2023-06-18 09:27:25,344 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=15.0 2023-06-18 09:27:29,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=175320.0, ans=0.1 2023-06-18 09:27:29,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=175320.0, ans=0.1 2023-06-18 09:27:45,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=175380.0, ans=0.0 2023-06-18 09:27:50,859 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 3.297e+02 4.275e+02 5.517e+02 1.101e+03, threshold=8.550e+02, percent-clipped=8.0 2023-06-18 09:27:51,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=175380.0, ans=0.125 2023-06-18 09:27:51,779 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-18 09:28:04,471 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.24 vs. limit=10.0 2023-06-18 09:28:25,596 INFO [train.py:996] (3/4) Epoch 1, batch 29250, loss[loss=0.2582, simple_loss=0.3132, pruned_loss=0.1016, over 21700.00 frames. ], tot_loss[loss=0.3251, simple_loss=0.3713, pruned_loss=0.1395, over 4255239.04 frames. ], batch size: 112, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:29:00,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=175620.0, ans=0.125 2023-06-18 09:29:24,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=175680.0, ans=0.125 2023-06-18 09:30:05,175 INFO [train.py:996] (3/4) Epoch 1, batch 29300, loss[loss=0.3283, simple_loss=0.3759, pruned_loss=0.1403, over 21584.00 frames. ], tot_loss[loss=0.3234, simple_loss=0.3719, pruned_loss=0.1374, over 4258186.58 frames. ], batch size: 441, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:30:11,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=175800.0, ans=0.1 2023-06-18 09:30:18,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=12.0 2023-06-18 09:30:32,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=175860.0, ans=0.125 2023-06-18 09:30:51,055 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.57 vs. limit=12.0 2023-06-18 09:31:07,339 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.203e+02 3.858e+02 5.248e+02 6.398e+02 1.119e+03, threshold=1.050e+03, percent-clipped=2.0 2023-06-18 09:31:37,303 INFO [train.py:996] (3/4) Epoch 1, batch 29350, loss[loss=0.3352, simple_loss=0.3929, pruned_loss=0.1388, over 21530.00 frames. ], tot_loss[loss=0.3218, simple_loss=0.3687, pruned_loss=0.1374, over 4261355.93 frames. ], batch size: 441, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:32:01,420 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-06-18 09:32:04,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=176160.0, ans=0.0 2023-06-18 09:32:18,718 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.64 vs. limit=6.0 2023-06-18 09:32:53,385 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:33:05,634 INFO [train.py:996] (3/4) Epoch 1, batch 29400, loss[loss=0.261, simple_loss=0.3193, pruned_loss=0.1013, over 21704.00 frames. ], tot_loss[loss=0.3165, simple_loss=0.3665, pruned_loss=0.1333, over 4262888.23 frames. ], batch size: 247, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:33:07,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=176400.0, ans=0.0 2023-06-18 09:33:33,132 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-18 09:34:22,280 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.444e+02 3.521e+02 4.187e+02 5.229e+02 1.148e+03, threshold=8.373e+02, percent-clipped=2.0 2023-06-18 09:34:41,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=176700.0, ans=0.2 2023-06-18 09:34:42,248 INFO [train.py:996] (3/4) Epoch 1, batch 29450, loss[loss=0.3424, simple_loss=0.386, pruned_loss=0.1494, over 21593.00 frames. ], tot_loss[loss=0.311, simple_loss=0.3618, pruned_loss=0.1301, over 4266692.71 frames. ], batch size: 263, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:34:47,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=176700.0, ans=0.125 2023-06-18 09:34:59,454 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=15.0 2023-06-18 09:35:11,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=176820.0, ans=0.0 2023-06-18 09:35:45,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=176880.0, ans=0.1 2023-06-18 09:36:18,585 INFO [train.py:996] (3/4) Epoch 1, batch 29500, loss[loss=0.412, simple_loss=0.4289, pruned_loss=0.1976, over 21830.00 frames. ], tot_loss[loss=0.3193, simple_loss=0.3682, pruned_loss=0.1352, over 4272906.83 frames. ], batch size: 441, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:36:37,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=177060.0, ans=0.2 2023-06-18 09:37:29,653 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 3.303e+02 3.922e+02 4.990e+02 9.245e+02, threshold=7.844e+02, percent-clipped=1.0 2023-06-18 09:37:54,374 INFO [train.py:996] (3/4) Epoch 1, batch 29550, loss[loss=0.3141, simple_loss=0.358, pruned_loss=0.1351, over 21857.00 frames. ], tot_loss[loss=0.3242, simple_loss=0.3704, pruned_loss=0.139, over 4283246.73 frames. ], batch size: 298, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:38:04,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=177300.0, ans=0.1 2023-06-18 09:38:11,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=177360.0, ans=0.125 2023-06-18 09:38:34,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=177420.0, ans=0.125 2023-06-18 09:39:01,116 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=12.0 2023-06-18 09:39:06,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=177480.0, ans=0.0 2023-06-18 09:39:30,927 INFO [train.py:996] (3/4) Epoch 1, batch 29600, loss[loss=0.3985, simple_loss=0.4408, pruned_loss=0.1781, over 21629.00 frames. ], tot_loss[loss=0.3328, simple_loss=0.3793, pruned_loss=0.1432, over 4285692.43 frames. ], batch size: 263, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:39:45,150 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:40:02,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=177660.0, ans=0.1 2023-06-18 09:40:03,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=177660.0, ans=0.0 2023-06-18 09:40:03,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=177660.0, ans=0.1 2023-06-18 09:40:41,178 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 3.582e+02 4.234e+02 5.701e+02 1.045e+03, threshold=8.469e+02, percent-clipped=5.0 2023-06-18 09:40:58,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=177840.0, ans=0.0 2023-06-18 09:41:01,396 INFO [train.py:996] (3/4) Epoch 1, batch 29650, loss[loss=0.2676, simple_loss=0.3185, pruned_loss=0.1084, over 21139.00 frames. ], tot_loss[loss=0.3254, simple_loss=0.3755, pruned_loss=0.1377, over 4276147.41 frames. ], batch size: 159, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:41:09,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=177900.0, ans=0.0 2023-06-18 09:41:10,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=177900.0, ans=15.0 2023-06-18 09:41:17,845 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-18 09:42:37,429 INFO [train.py:996] (3/4) Epoch 1, batch 29700, loss[loss=0.2735, simple_loss=0.3347, pruned_loss=0.1061, over 21775.00 frames. ], tot_loss[loss=0.3249, simple_loss=0.3758, pruned_loss=0.137, over 4278036.02 frames. ], batch size: 298, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:43:43,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=178380.0, ans=0.5 2023-06-18 09:43:51,718 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.547e+02 3.946e+02 4.847e+02 6.771e+02 1.201e+03, threshold=9.693e+02, percent-clipped=9.0 2023-06-18 09:44:04,398 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.11 vs. limit=22.5 2023-06-18 09:44:07,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=178440.0, ans=15.0 2023-06-18 09:44:11,026 INFO [train.py:996] (3/4) Epoch 1, batch 29750, loss[loss=0.3763, simple_loss=0.4219, pruned_loss=0.1654, over 21570.00 frames. ], tot_loss[loss=0.3265, simple_loss=0.3796, pruned_loss=0.1367, over 4276470.97 frames. ], batch size: 471, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:45:24,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=178740.0, ans=0.125 2023-06-18 09:45:27,870 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:45:31,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2023-06-18 09:45:41,154 INFO [train.py:996] (3/4) Epoch 1, batch 29800, loss[loss=0.3256, simple_loss=0.4123, pruned_loss=0.1195, over 20873.00 frames. ], tot_loss[loss=0.3295, simple_loss=0.382, pruned_loss=0.1385, over 4284503.78 frames. ], batch size: 608, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:45:43,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=178800.0, ans=0.125 2023-06-18 09:46:57,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.47 vs. limit=6.0 2023-06-18 09:46:57,387 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 3.287e+02 3.786e+02 4.602e+02 9.209e+02, threshold=7.572e+02, percent-clipped=0.0 2023-06-18 09:47:01,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.81 vs. limit=8.0 2023-06-18 09:47:07,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=179040.0, ans=0.1 2023-06-18 09:47:17,280 INFO [train.py:996] (3/4) Epoch 1, batch 29850, loss[loss=0.3324, simple_loss=0.3756, pruned_loss=0.1445, over 21854.00 frames. ], tot_loss[loss=0.3246, simple_loss=0.3779, pruned_loss=0.1357, over 4278977.64 frames. ], batch size: 414, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:48:26,195 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=12.0 2023-06-18 09:48:52,472 INFO [train.py:996] (3/4) Epoch 1, batch 29900, loss[loss=0.3435, simple_loss=0.381, pruned_loss=0.153, over 21251.00 frames. ], tot_loss[loss=0.3237, simple_loss=0.3748, pruned_loss=0.1363, over 4288644.09 frames. ], batch size: 143, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:50:04,445 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-18 09:50:09,529 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 3.414e+02 4.153e+02 5.110e+02 1.047e+03, threshold=8.306e+02, percent-clipped=3.0 2023-06-18 09:50:34,861 INFO [train.py:996] (3/4) Epoch 1, batch 29950, loss[loss=0.3351, simple_loss=0.3844, pruned_loss=0.1429, over 21267.00 frames. ], tot_loss[loss=0.3317, simple_loss=0.38, pruned_loss=0.1417, over 4285362.35 frames. ], batch size: 143, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:50:49,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=179700.0, ans=0.125 2023-06-18 09:51:44,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=179880.0, ans=0.2 2023-06-18 09:52:17,249 INFO [train.py:996] (3/4) Epoch 1, batch 30000, loss[loss=0.2835, simple_loss=0.3523, pruned_loss=0.1073, over 21448.00 frames. ], tot_loss[loss=0.3357, simple_loss=0.3843, pruned_loss=0.1435, over 4287484.78 frames. ], batch size: 131, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:52:17,249 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 09:52:31,517 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.4144, 1.8641, 2.1033, 2.2386, 1.1989, 2.5546, 2.5059, 1.1830], device='cuda:3') 2023-06-18 09:52:35,096 INFO [train.py:1028] (3/4) Epoch 1, validation: loss=0.2819, simple_loss=0.3813, pruned_loss=0.09129, over 1796401.00 frames. 2023-06-18 09:52:35,097 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-18 09:52:35,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=180000.0, ans=0.0 2023-06-18 09:52:59,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=180060.0, ans=0.1 2023-06-18 09:53:00,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-06-18 09:53:26,447 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=15.0 2023-06-18 09:53:54,183 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.532e+02 3.360e+02 4.099e+02 5.189e+02 8.987e+02, threshold=8.197e+02, percent-clipped=1.0 2023-06-18 09:54:19,747 INFO [train.py:996] (3/4) Epoch 1, batch 30050, loss[loss=0.4797, simple_loss=0.5349, pruned_loss=0.2122, over 21543.00 frames. ], tot_loss[loss=0.3353, simple_loss=0.3888, pruned_loss=0.1409, over 4288416.01 frames. ], batch size: 471, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 09:55:23,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=180480.0, ans=0.125 2023-06-18 09:55:28,871 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-06-18 09:55:37,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=180480.0, ans=0.125 2023-06-18 09:55:43,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=180540.0, ans=0.125 2023-06-18 09:55:56,182 INFO [train.py:996] (3/4) Epoch 1, batch 30100, loss[loss=0.3049, simple_loss=0.3376, pruned_loss=0.1361, over 21166.00 frames. ], tot_loss[loss=0.3337, simple_loss=0.3869, pruned_loss=0.1402, over 4278752.39 frames. ], batch size: 159, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 09:56:14,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=180660.0, ans=0.125 2023-06-18 09:56:52,428 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.99 vs. limit=6.0 2023-06-18 09:56:56,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=180780.0, ans=0.09899494936611666 2023-06-18 09:56:56,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=180780.0, ans=0.0 2023-06-18 09:57:11,278 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.278e+02 3.434e+02 4.118e+02 5.111e+02 9.252e+02, threshold=8.235e+02, percent-clipped=1.0 2023-06-18 09:57:24,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=180840.0, ans=0.125 2023-06-18 09:57:31,604 INFO [train.py:996] (3/4) Epoch 1, batch 30150, loss[loss=0.315, simple_loss=0.3649, pruned_loss=0.1325, over 16248.00 frames. ], tot_loss[loss=0.3342, simple_loss=0.3835, pruned_loss=0.1425, over 4270766.95 frames. ], batch size: 62, lr: 2.21e-02, grad_scale: 64.0 2023-06-18 09:58:03,339 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.24 vs. limit=22.5 2023-06-18 09:58:33,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=181080.0, ans=0.0 2023-06-18 09:58:43,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=181080.0, ans=0.0 2023-06-18 09:58:53,031 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-18 09:59:05,008 INFO [train.py:996] (3/4) Epoch 1, batch 30200, loss[loss=0.2788, simple_loss=0.3214, pruned_loss=0.1181, over 21795.00 frames. ], tot_loss[loss=0.3323, simple_loss=0.3849, pruned_loss=0.1398, over 4267053.02 frames. ], batch size: 102, lr: 2.21e-02, grad_scale: 64.0 2023-06-18 09:59:08,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=181200.0, ans=0.0 2023-06-18 09:59:39,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=181260.0, ans=0.0 2023-06-18 10:00:03,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=181320.0, ans=10.0 2023-06-18 10:00:17,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=181380.0, ans=0.0 2023-06-18 10:00:23,209 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.642e+02 3.548e+02 4.753e+02 6.643e+02 1.324e+03, threshold=9.506e+02, percent-clipped=12.0 2023-06-18 10:00:31,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=181440.0, ans=0.1 2023-06-18 10:00:51,968 INFO [train.py:996] (3/4) Epoch 1, batch 30250, loss[loss=0.436, simple_loss=0.5092, pruned_loss=0.1814, over 20798.00 frames. ], tot_loss[loss=0.3421, simple_loss=0.3947, pruned_loss=0.1448, over 4264348.68 frames. ], batch size: 607, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 10:01:44,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=181620.0, ans=0.2 2023-06-18 10:01:44,604 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.41 vs. limit=15.0 2023-06-18 10:02:00,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=181680.0, ans=0.1 2023-06-18 10:02:32,619 INFO [train.py:996] (3/4) Epoch 1, batch 30300, loss[loss=0.285, simple_loss=0.3215, pruned_loss=0.1243, over 21163.00 frames. ], tot_loss[loss=0.3394, simple_loss=0.3904, pruned_loss=0.1442, over 4260371.97 frames. ], batch size: 176, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 10:02:39,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=181800.0, ans=0.125 2023-06-18 10:03:10,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-18 10:03:14,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=181920.0, ans=0.0 2023-06-18 10:03:36,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=181980.0, ans=0.125 2023-06-18 10:03:47,903 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 3.818e+02 4.498e+02 5.757e+02 1.296e+03, threshold=8.996e+02, percent-clipped=5.0 2023-06-18 10:04:11,439 INFO [train.py:996] (3/4) Epoch 1, batch 30350, loss[loss=0.3293, simple_loss=0.3837, pruned_loss=0.1375, over 21709.00 frames. ], tot_loss[loss=0.341, simple_loss=0.3908, pruned_loss=0.1456, over 4262282.35 frames. ], batch size: 298, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 10:05:26,611 INFO [train.py:996] (3/4) Epoch 1, batch 30400, loss[loss=0.3175, simple_loss=0.3393, pruned_loss=0.1479, over 20207.00 frames. ], tot_loss[loss=0.3324, simple_loss=0.3809, pruned_loss=0.142, over 4243333.85 frames. ], batch size: 702, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 10:05:44,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=182460.0, ans=0.05 2023-06-18 10:06:25,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=182580.0, ans=10.0 2023-06-18 10:06:31,072 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.202e+02 4.537e+02 5.674e+02 8.183e+02 2.727e+03, threshold=1.135e+03, percent-clipped=13.0 2023-06-18 10:06:34,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=182640.0, ans=0.2 2023-06-18 10:06:35,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=182640.0, ans=0.125 2023-06-18 10:06:47,825 INFO [train.py:996] (3/4) Epoch 1, batch 30450, loss[loss=0.4393, simple_loss=0.5328, pruned_loss=0.1729, over 19796.00 frames. ], tot_loss[loss=0.3388, simple_loss=0.387, pruned_loss=0.1453, over 4187314.68 frames. ], batch size: 702, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 10:06:53,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=182700.0, ans=0.125 2023-06-18 10:07:18,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=182820.0, ans=0.1 2023-06-18 10:07:21,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=182820.0, ans=0.1 2023-06-18 10:07:27,819 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=10.08 vs. limit=15.0 2023-06-18 10:07:31,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=182820.0, ans=0.2 2023-06-18 10:07:46,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=182940.0, ans=0.125 2023-06-18 10:09:26,528 INFO [train.py:996] (3/4) Epoch 2, batch 0, loss[loss=0.3639, simple_loss=0.3897, pruned_loss=0.1691, over 21742.00 frames. ], tot_loss[loss=0.3639, simple_loss=0.3897, pruned_loss=0.1691, over 21742.00 frames. ], batch size: 112, lr: 2.01e-02, grad_scale: 32.0 2023-06-18 10:09:26,528 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 10:09:43,669 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.3124, simple_loss=0.4068, pruned_loss=0.109, over 1796401.00 frames. 2023-06-18 10:09:43,670 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-18 10:10:18,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=183090.0, ans=0.125 2023-06-18 10:10:31,385 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-18 10:10:56,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=183210.0, ans=0.1 2023-06-18 10:11:01,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=183210.0, ans=0.0 2023-06-18 10:11:04,117 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.518e+02 4.740e+02 6.519e+02 1.031e+03 2.172e+03, threshold=1.304e+03, percent-clipped=18.0 2023-06-18 10:11:13,441 INFO [train.py:996] (3/4) Epoch 2, batch 50, loss[loss=0.3675, simple_loss=0.4176, pruned_loss=0.1587, over 19928.00 frames. ], tot_loss[loss=0.3421, simple_loss=0.3902, pruned_loss=0.1471, over 966496.83 frames. ], batch size: 702, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:11:40,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=183330.0, ans=0.0 2023-06-18 10:11:40,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=183330.0, ans=0.2 2023-06-18 10:12:02,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=183390.0, ans=0.2 2023-06-18 10:12:08,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=183450.0, ans=0.125 2023-06-18 10:12:32,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=183450.0, ans=0.2 2023-06-18 10:12:49,799 INFO [train.py:996] (3/4) Epoch 2, batch 100, loss[loss=0.4854, simple_loss=0.5197, pruned_loss=0.2255, over 21426.00 frames. ], tot_loss[loss=0.3497, simple_loss=0.4055, pruned_loss=0.1469, over 1696418.80 frames. ], batch size: 507, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:13:07,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=183570.0, ans=0.125 2023-06-18 10:13:24,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=183630.0, ans=0.125 2023-06-18 10:13:29,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=183690.0, ans=0.1 2023-06-18 10:14:16,094 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.204e+02 3.494e+02 4.383e+02 5.480e+02 8.773e+02, threshold=8.766e+02, percent-clipped=0.0 2023-06-18 10:14:22,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=183810.0, ans=0.125 2023-06-18 10:14:29,956 INFO [train.py:996] (3/4) Epoch 2, batch 150, loss[loss=0.2944, simple_loss=0.3692, pruned_loss=0.1098, over 21609.00 frames. ], tot_loss[loss=0.3503, simple_loss=0.4074, pruned_loss=0.1466, over 2270367.00 frames. ], batch size: 230, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:14:37,797 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:14:40,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=183870.0, ans=0.125 2023-06-18 10:15:59,064 INFO [train.py:996] (3/4) Epoch 2, batch 200, loss[loss=0.3689, simple_loss=0.4184, pruned_loss=0.1597, over 21573.00 frames. ], tot_loss[loss=0.3425, simple_loss=0.4005, pruned_loss=0.1423, over 2714655.83 frames. ], batch size: 389, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:16:23,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=184230.0, ans=0.125 2023-06-18 10:16:26,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=184230.0, ans=0.0 2023-06-18 10:16:31,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=184230.0, ans=0.125 2023-06-18 10:17:03,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=184350.0, ans=0.0 2023-06-18 10:17:24,367 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.289e+02 3.721e+02 4.278e+02 5.715e+02 9.625e+02, threshold=8.556e+02, percent-clipped=2.0 2023-06-18 10:17:33,199 INFO [train.py:996] (3/4) Epoch 2, batch 250, loss[loss=0.3596, simple_loss=0.4001, pruned_loss=0.1596, over 21246.00 frames. ], tot_loss[loss=0.3387, simple_loss=0.3946, pruned_loss=0.1414, over 3064748.78 frames. ], batch size: 176, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:17:52,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=184530.0, ans=0.125 2023-06-18 10:17:53,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=184530.0, ans=0.0 2023-06-18 10:18:15,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=184590.0, ans=0.0 2023-06-18 10:18:17,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=184590.0, ans=0.05 2023-06-18 10:18:52,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=184650.0, ans=0.2 2023-06-18 10:19:15,563 INFO [train.py:996] (3/4) Epoch 2, batch 300, loss[loss=0.3557, simple_loss=0.3936, pruned_loss=0.1589, over 21418.00 frames. ], tot_loss[loss=0.3376, simple_loss=0.3908, pruned_loss=0.1422, over 3336645.07 frames. ], batch size: 159, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:19:25,910 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=22.5 2023-06-18 10:20:05,325 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-18 10:20:39,250 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.502e+02 4.825e+02 6.381e+02 1.072e+03, threshold=9.650e+02, percent-clipped=6.0 2023-06-18 10:20:48,537 INFO [train.py:996] (3/4) Epoch 2, batch 350, loss[loss=0.3333, simple_loss=0.3605, pruned_loss=0.153, over 21595.00 frames. ], tot_loss[loss=0.3309, simple_loss=0.3825, pruned_loss=0.1396, over 3550088.36 frames. ], batch size: 415, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:21:35,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=185190.0, ans=0.125 2023-06-18 10:21:42,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=185190.0, ans=0.0 2023-06-18 10:22:05,596 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=15.0 2023-06-18 10:22:06,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=185250.0, ans=0.125 2023-06-18 10:22:12,921 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-18 10:22:20,405 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-18 10:22:26,773 INFO [train.py:996] (3/4) Epoch 2, batch 400, loss[loss=0.3301, simple_loss=0.3607, pruned_loss=0.1498, over 21624.00 frames. ], tot_loss[loss=0.3237, simple_loss=0.3738, pruned_loss=0.1368, over 3714429.13 frames. ], batch size: 298, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:23:09,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=185490.0, ans=0.0 2023-06-18 10:23:49,220 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 3.316e+02 4.407e+02 6.179e+02 1.311e+03, threshold=8.814e+02, percent-clipped=2.0 2023-06-18 10:23:58,109 INFO [train.py:996] (3/4) Epoch 2, batch 450, loss[loss=0.3212, simple_loss=0.3529, pruned_loss=0.1448, over 21546.00 frames. ], tot_loss[loss=0.3174, simple_loss=0.3679, pruned_loss=0.1335, over 3840646.10 frames. ], batch size: 442, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:24:35,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=185730.0, ans=0.125 2023-06-18 10:24:38,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=185790.0, ans=0.1 2023-06-18 10:24:53,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=185850.0, ans=0.1 2023-06-18 10:25:21,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=185910.0, ans=0.125 2023-06-18 10:25:33,374 INFO [train.py:996] (3/4) Epoch 2, batch 500, loss[loss=0.3386, simple_loss=0.372, pruned_loss=0.1526, over 21296.00 frames. ], tot_loss[loss=0.323, simple_loss=0.3752, pruned_loss=0.1354, over 3939692.32 frames. ], batch size: 143, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:26:12,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-18 10:26:41,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=186150.0, ans=0.125 2023-06-18 10:26:45,542 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.02 vs. limit=22.5 2023-06-18 10:26:55,378 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 3.837e+02 4.927e+02 6.704e+02 1.422e+03, threshold=9.853e+02, percent-clipped=11.0 2023-06-18 10:27:04,391 INFO [train.py:996] (3/4) Epoch 2, batch 550, loss[loss=0.3191, simple_loss=0.3603, pruned_loss=0.139, over 21737.00 frames. ], tot_loss[loss=0.3228, simple_loss=0.3772, pruned_loss=0.1342, over 4013588.63 frames. ], batch size: 112, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:27:37,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=186330.0, ans=0.125 2023-06-18 10:27:42,541 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-18 10:27:59,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=186450.0, ans=0.015 2023-06-18 10:28:46,180 INFO [train.py:996] (3/4) Epoch 2, batch 600, loss[loss=0.336, simple_loss=0.3687, pruned_loss=0.1516, over 21523.00 frames. ], tot_loss[loss=0.3236, simple_loss=0.3794, pruned_loss=0.1339, over 4078298.98 frames. ], batch size: 195, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:28:56,053 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-18 10:29:04,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=186630.0, ans=0.07 2023-06-18 10:29:06,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=186630.0, ans=0.1 2023-06-18 10:29:17,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=186630.0, ans=0.2 2023-06-18 10:29:35,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=186750.0, ans=0.0 2023-06-18 10:29:56,406 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-18 10:30:00,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=186810.0, ans=0.0 2023-06-18 10:30:07,488 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 3.420e+02 4.275e+02 5.622e+02 1.549e+03, threshold=8.550e+02, percent-clipped=4.0 2023-06-18 10:30:18,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=186810.0, ans=0.125 2023-06-18 10:30:21,292 INFO [train.py:996] (3/4) Epoch 2, batch 650, loss[loss=0.2908, simple_loss=0.3373, pruned_loss=0.1221, over 21860.00 frames. ], tot_loss[loss=0.3223, simple_loss=0.3784, pruned_loss=0.1331, over 4120548.03 frames. ], batch size: 98, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:31:09,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=186990.0, ans=0.09899494936611666 2023-06-18 10:31:56,864 INFO [train.py:996] (3/4) Epoch 2, batch 700, loss[loss=0.2517, simple_loss=0.3017, pruned_loss=0.1008, over 21436.00 frames. ], tot_loss[loss=0.3235, simple_loss=0.3787, pruned_loss=0.1341, over 4163426.60 frames. ], batch size: 212, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:32:13,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=187230.0, ans=0.125 2023-06-18 10:32:29,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=187230.0, ans=0.125 2023-06-18 10:32:36,063 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.44 vs. limit=22.5 2023-06-18 10:33:12,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=187410.0, ans=0.0 2023-06-18 10:33:18,457 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 3.924e+02 4.620e+02 5.995e+02 1.020e+03, threshold=9.239e+02, percent-clipped=3.0 2023-06-18 10:33:25,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=187410.0, ans=0.125 2023-06-18 10:33:32,496 INFO [train.py:996] (3/4) Epoch 2, batch 750, loss[loss=0.3561, simple_loss=0.3899, pruned_loss=0.1612, over 21660.00 frames. ], tot_loss[loss=0.3265, simple_loss=0.3804, pruned_loss=0.1363, over 4195362.96 frames. ], batch size: 389, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:33:43,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=187470.0, ans=0.0 2023-06-18 10:34:48,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=187710.0, ans=0.125 2023-06-18 10:35:07,274 INFO [train.py:996] (3/4) Epoch 2, batch 800, loss[loss=0.3157, simple_loss=0.3642, pruned_loss=0.1336, over 21523.00 frames. ], tot_loss[loss=0.3246, simple_loss=0.3778, pruned_loss=0.1357, over 4217810.39 frames. ], batch size: 441, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:35:20,914 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.32 vs. limit=12.0 2023-06-18 10:35:39,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=187830.0, ans=0.125 2023-06-18 10:36:20,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=188010.0, ans=0.1 2023-06-18 10:36:28,268 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.369e+02 3.568e+02 4.374e+02 5.699e+02 1.207e+03, threshold=8.749e+02, percent-clipped=3.0 2023-06-18 10:36:31,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=188010.0, ans=0.0 2023-06-18 10:36:41,774 INFO [train.py:996] (3/4) Epoch 2, batch 850, loss[loss=0.3771, simple_loss=0.3898, pruned_loss=0.1822, over 21668.00 frames. ], tot_loss[loss=0.3221, simple_loss=0.3749, pruned_loss=0.1346, over 4231693.71 frames. ], batch size: 508, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:36:44,155 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-18 10:37:03,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=188130.0, ans=0.125 2023-06-18 10:38:17,629 INFO [train.py:996] (3/4) Epoch 2, batch 900, loss[loss=0.3186, simple_loss=0.4379, pruned_loss=0.09962, over 19759.00 frames. ], tot_loss[loss=0.3192, simple_loss=0.371, pruned_loss=0.1336, over 4245987.40 frames. ], batch size: 703, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:38:19,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=188370.0, ans=0.0 2023-06-18 10:39:29,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=188550.0, ans=0.125 2023-06-18 10:39:29,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=188550.0, ans=0.125 2023-06-18 10:39:45,693 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.391e+02 3.152e+02 3.796e+02 5.190e+02 9.493e+02, threshold=7.592e+02, percent-clipped=1.0 2023-06-18 10:39:55,093 INFO [train.py:996] (3/4) Epoch 2, batch 950, loss[loss=0.2861, simple_loss=0.3586, pruned_loss=0.1068, over 21793.00 frames. ], tot_loss[loss=0.3158, simple_loss=0.3683, pruned_loss=0.1316, over 4258028.75 frames. ], batch size: 371, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:40:07,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=188670.0, ans=0.125 2023-06-18 10:40:20,862 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.88 vs. limit=15.0 2023-06-18 10:40:24,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=188730.0, ans=0.125 2023-06-18 10:40:50,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=188790.0, ans=0.025 2023-06-18 10:41:24,171 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.98 vs. limit=22.5 2023-06-18 10:41:30,567 INFO [train.py:996] (3/4) Epoch 2, batch 1000, loss[loss=0.3515, simple_loss=0.3891, pruned_loss=0.1569, over 21601.00 frames. ], tot_loss[loss=0.3194, simple_loss=0.3713, pruned_loss=0.1338, over 4262797.60 frames. ], batch size: 471, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:41:32,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=188970.0, ans=0.125 2023-06-18 10:42:42,554 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=22.5 2023-06-18 10:42:43,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=189150.0, ans=0.125 2023-06-18 10:42:49,221 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.55 vs. limit=22.5 2023-06-18 10:42:51,896 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-06-18 10:43:00,415 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.236e+02 4.022e+02 4.696e+02 7.726e+02, threshold=8.043e+02, percent-clipped=1.0 2023-06-18 10:43:09,839 INFO [train.py:996] (3/4) Epoch 2, batch 1050, loss[loss=0.3495, simple_loss=0.3868, pruned_loss=0.1561, over 21291.00 frames. ], tot_loss[loss=0.3196, simple_loss=0.3707, pruned_loss=0.1342, over 4269480.65 frames. ], batch size: 159, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:43:50,923 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.75 vs. limit=15.0 2023-06-18 10:43:58,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=189390.0, ans=0.125 2023-06-18 10:44:43,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.66 vs. limit=15.0 2023-06-18 10:44:45,535 INFO [train.py:996] (3/4) Epoch 2, batch 1100, loss[loss=0.3295, simple_loss=0.3723, pruned_loss=0.1433, over 21835.00 frames. ], tot_loss[loss=0.3212, simple_loss=0.3724, pruned_loss=0.135, over 4269543.45 frames. ], batch size: 298, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:45:14,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=189630.0, ans=0.0 2023-06-18 10:45:25,805 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-06-18 10:46:13,009 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.368e+02 4.116e+02 5.178e+02 8.115e+02 1.294e+03, threshold=1.036e+03, percent-clipped=24.0 2023-06-18 10:46:27,135 INFO [train.py:996] (3/4) Epoch 2, batch 1150, loss[loss=0.2579, simple_loss=0.3342, pruned_loss=0.09078, over 21728.00 frames. ], tot_loss[loss=0.3195, simple_loss=0.3719, pruned_loss=0.1336, over 4272267.52 frames. ], batch size: 247, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:46:36,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=189870.0, ans=0.0 2023-06-18 10:47:14,025 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:47:17,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=189990.0, ans=0.0 2023-06-18 10:48:04,097 INFO [train.py:996] (3/4) Epoch 2, batch 1200, loss[loss=0.3538, simple_loss=0.4, pruned_loss=0.1538, over 21901.00 frames. ], tot_loss[loss=0.3211, simple_loss=0.3734, pruned_loss=0.1344, over 4278134.00 frames. ], batch size: 316, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:48:04,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=190170.0, ans=0.1 2023-06-18 10:48:31,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=190230.0, ans=0.125 2023-06-18 10:49:29,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=190410.0, ans=0.1 2023-06-18 10:49:32,215 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 3.616e+02 4.499e+02 6.126e+02 1.054e+03, threshold=8.999e+02, percent-clipped=1.0 2023-06-18 10:49:41,265 INFO [train.py:996] (3/4) Epoch 2, batch 1250, loss[loss=0.3337, simple_loss=0.363, pruned_loss=0.1522, over 21493.00 frames. ], tot_loss[loss=0.3226, simple_loss=0.3743, pruned_loss=0.1355, over 4270473.31 frames. ], batch size: 194, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:49:41,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=190470.0, ans=0.125 2023-06-18 10:49:48,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=190470.0, ans=0.07 2023-06-18 10:50:05,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=190530.0, ans=0.04949747468305833 2023-06-18 10:50:21,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=190530.0, ans=0.1 2023-06-18 10:51:23,586 INFO [train.py:996] (3/4) Epoch 2, batch 1300, loss[loss=0.2945, simple_loss=0.3681, pruned_loss=0.1104, over 21829.00 frames. ], tot_loss[loss=0.3236, simple_loss=0.3754, pruned_loss=0.1358, over 4279495.18 frames. ], batch size: 332, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:51:39,913 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=12.0 2023-06-18 10:51:46,377 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-06-18 10:52:20,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=190950.0, ans=0.1 2023-06-18 10:52:27,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=190950.0, ans=12.0 2023-06-18 10:52:39,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=191010.0, ans=0.5 2023-06-18 10:52:45,323 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.347e+02 3.558e+02 4.499e+02 5.832e+02 1.027e+03, threshold=8.998e+02, percent-clipped=2.0 2023-06-18 10:53:00,044 INFO [train.py:996] (3/4) Epoch 2, batch 1350, loss[loss=0.2891, simple_loss=0.3644, pruned_loss=0.1069, over 21670.00 frames. ], tot_loss[loss=0.3251, simple_loss=0.3766, pruned_loss=0.1368, over 4285391.98 frames. ], batch size: 247, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:53:31,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=191130.0, ans=0.2 2023-06-18 10:54:14,566 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-18 10:54:15,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=191310.0, ans=0.125 2023-06-18 10:54:18,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=191310.0, ans=0.1 2023-06-18 10:54:20,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=191310.0, ans=0.125 2023-06-18 10:54:21,984 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=12.0 2023-06-18 10:54:41,370 INFO [train.py:996] (3/4) Epoch 2, batch 1400, loss[loss=0.3855, simple_loss=0.4282, pruned_loss=0.1714, over 21349.00 frames. ], tot_loss[loss=0.3225, simple_loss=0.3739, pruned_loss=0.1356, over 4289102.28 frames. ], batch size: 548, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 10:54:51,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=191370.0, ans=0.125 2023-06-18 10:55:11,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=191430.0, ans=0.2 2023-06-18 10:55:40,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=191550.0, ans=0.0 2023-06-18 10:55:51,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=191610.0, ans=0.0 2023-06-18 10:56:04,087 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.318e+02 3.616e+02 4.167e+02 4.895e+02 9.301e+02, threshold=8.333e+02, percent-clipped=3.0 2023-06-18 10:56:18,027 INFO [train.py:996] (3/4) Epoch 2, batch 1450, loss[loss=0.2958, simple_loss=0.351, pruned_loss=0.1203, over 21647.00 frames. ], tot_loss[loss=0.3229, simple_loss=0.3737, pruned_loss=0.1361, over 4288894.79 frames. ], batch size: 112, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 10:56:39,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=191730.0, ans=0.0 2023-06-18 10:57:03,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=191790.0, ans=0.0 2023-06-18 10:57:08,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=191850.0, ans=0.125 2023-06-18 10:57:54,113 INFO [train.py:996] (3/4) Epoch 2, batch 1500, loss[loss=0.2964, simple_loss=0.3858, pruned_loss=0.1035, over 19797.00 frames. ], tot_loss[loss=0.3279, simple_loss=0.3777, pruned_loss=0.139, over 4294199.39 frames. ], batch size: 702, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 10:58:13,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=192030.0, ans=0.0 2023-06-18 10:59:16,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=192210.0, ans=0.0 2023-06-18 10:59:22,600 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.381e+02 3.361e+02 4.007e+02 4.888e+02 8.078e+02, threshold=8.013e+02, percent-clipped=0.0 2023-06-18 10:59:32,391 INFO [train.py:996] (3/4) Epoch 2, batch 1550, loss[loss=0.2993, simple_loss=0.3567, pruned_loss=0.121, over 21011.00 frames. ], tot_loss[loss=0.3246, simple_loss=0.3751, pruned_loss=0.137, over 4292998.72 frames. ], batch size: 608, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 10:59:32,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=192270.0, ans=0.125 2023-06-18 10:59:50,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=192270.0, ans=0.0 2023-06-18 11:01:02,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=192510.0, ans=0.95 2023-06-18 11:01:16,372 INFO [train.py:996] (3/4) Epoch 2, batch 1600, loss[loss=0.2883, simple_loss=0.3599, pruned_loss=0.1084, over 21724.00 frames. ], tot_loss[loss=0.3222, simple_loss=0.3729, pruned_loss=0.1358, over 4284466.99 frames. ], batch size: 391, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 11:01:21,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=192570.0, ans=0.125 2023-06-18 11:02:05,461 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=12.0 2023-06-18 11:02:37,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=192810.0, ans=0.125 2023-06-18 11:02:44,871 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.226e+02 3.757e+02 4.591e+02 6.473e+02 1.240e+03, threshold=9.183e+02, percent-clipped=13.0 2023-06-18 11:02:54,024 INFO [train.py:996] (3/4) Epoch 2, batch 1650, loss[loss=0.3346, simple_loss=0.3818, pruned_loss=0.1436, over 21122.00 frames. ], tot_loss[loss=0.3213, simple_loss=0.3723, pruned_loss=0.1351, over 4283788.51 frames. ], batch size: 608, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 11:03:13,705 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.16 vs. limit=15.0 2023-06-18 11:04:11,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=193110.0, ans=0.125 2023-06-18 11:04:31,330 INFO [train.py:996] (3/4) Epoch 2, batch 1700, loss[loss=0.3364, simple_loss=0.4042, pruned_loss=0.1344, over 21854.00 frames. ], tot_loss[loss=0.3243, simple_loss=0.3757, pruned_loss=0.1365, over 4282858.16 frames. ], batch size: 316, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 11:04:57,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=193230.0, ans=0.125 2023-06-18 11:05:44,666 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.74 vs. limit=6.0 2023-06-18 11:05:53,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=193410.0, ans=0.0 2023-06-18 11:05:56,001 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.866e+02 3.867e+02 4.593e+02 5.670e+02 8.844e+02, threshold=9.185e+02, percent-clipped=0.0 2023-06-18 11:06:05,791 INFO [train.py:996] (3/4) Epoch 2, batch 1750, loss[loss=0.2365, simple_loss=0.3096, pruned_loss=0.08175, over 21400.00 frames. ], tot_loss[loss=0.3225, simple_loss=0.3753, pruned_loss=0.1348, over 4266601.60 frames. ], batch size: 194, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:06:09,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=193470.0, ans=0.09899494936611666 2023-06-18 11:06:37,166 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.29 vs. limit=10.0 2023-06-18 11:06:46,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=193590.0, ans=0.035 2023-06-18 11:06:50,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=193590.0, ans=0.0 2023-06-18 11:06:52,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=193590.0, ans=0.125 2023-06-18 11:07:31,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=193710.0, ans=0.125 2023-06-18 11:07:39,036 INFO [train.py:996] (3/4) Epoch 2, batch 1800, loss[loss=0.2061, simple_loss=0.2774, pruned_loss=0.06743, over 21429.00 frames. ], tot_loss[loss=0.3158, simple_loss=0.3714, pruned_loss=0.1301, over 4267271.20 frames. ], batch size: 211, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:07:41,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=193770.0, ans=0.1 2023-06-18 11:07:47,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=193770.0, ans=0.125 2023-06-18 11:08:20,838 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-18 11:08:21,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=193830.0, ans=0.0 2023-06-18 11:08:38,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=193890.0, ans=0.125 2023-06-18 11:08:52,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=193950.0, ans=0.1 2023-06-18 11:09:00,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=194010.0, ans=0.125 2023-06-18 11:09:02,474 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.17 vs. limit=12.0 2023-06-18 11:09:07,957 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 3.185e+02 3.782e+02 4.585e+02 7.556e+02, threshold=7.564e+02, percent-clipped=0.0 2023-06-18 11:09:17,312 INFO [train.py:996] (3/4) Epoch 2, batch 1850, loss[loss=0.355, simple_loss=0.4215, pruned_loss=0.1442, over 21453.00 frames. ], tot_loss[loss=0.3124, simple_loss=0.3706, pruned_loss=0.1271, over 4262232.55 frames. ], batch size: 507, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:09:33,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=194070.0, ans=0.125 2023-06-18 11:09:44,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=194130.0, ans=0.125 2023-06-18 11:09:54,209 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-18 11:09:55,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=194130.0, ans=0.2 2023-06-18 11:10:07,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=194190.0, ans=0.125 2023-06-18 11:10:11,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=194190.0, ans=0.0 2023-06-18 11:10:18,730 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=22.5 2023-06-18 11:10:21,288 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-18 11:10:21,375 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.57 vs. limit=15.0 2023-06-18 11:10:22,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=194250.0, ans=0.0 2023-06-18 11:10:48,407 INFO [train.py:996] (3/4) Epoch 2, batch 1900, loss[loss=0.3042, simple_loss=0.3634, pruned_loss=0.1225, over 21838.00 frames. ], tot_loss[loss=0.3153, simple_loss=0.3715, pruned_loss=0.1296, over 4273976.22 frames. ], batch size: 371, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:11:15,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=194430.0, ans=10.0 2023-06-18 11:11:17,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=194430.0, ans=0.2 2023-06-18 11:11:56,919 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.68 vs. limit=22.5 2023-06-18 11:12:15,718 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 3.697e+02 4.739e+02 6.641e+02 1.232e+03, threshold=9.479e+02, percent-clipped=18.0 2023-06-18 11:12:24,851 INFO [train.py:996] (3/4) Epoch 2, batch 1950, loss[loss=0.4048, simple_loss=0.4423, pruned_loss=0.1836, over 21742.00 frames. ], tot_loss[loss=0.3136, simple_loss=0.3673, pruned_loss=0.1299, over 4281478.60 frames. ], batch size: 441, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:12:48,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=194730.0, ans=0.125 2023-06-18 11:12:50,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=194730.0, ans=0.1 2023-06-18 11:12:55,713 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-18 11:13:26,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=194790.0, ans=0.05 2023-06-18 11:13:39,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=194850.0, ans=10.0 2023-06-18 11:14:14,860 INFO [train.py:996] (3/4) Epoch 2, batch 2000, loss[loss=0.2811, simple_loss=0.3428, pruned_loss=0.1096, over 21820.00 frames. ], tot_loss[loss=0.3109, simple_loss=0.364, pruned_loss=0.1288, over 4279771.55 frames. ], batch size: 333, lr: 1.95e-02, grad_scale: 64.0 2023-06-18 11:14:53,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=195090.0, ans=0.1 2023-06-18 11:15:31,709 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 3.372e+02 4.347e+02 5.379e+02 1.010e+03, threshold=8.694e+02, percent-clipped=3.0 2023-06-18 11:15:36,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=195210.0, ans=0.125 2023-06-18 11:15:38,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=195210.0, ans=0.0 2023-06-18 11:15:45,806 INFO [train.py:996] (3/4) Epoch 2, batch 2050, loss[loss=0.3287, simple_loss=0.3751, pruned_loss=0.1412, over 21881.00 frames. ], tot_loss[loss=0.3126, simple_loss=0.3667, pruned_loss=0.1293, over 4284683.62 frames. ], batch size: 351, lr: 1.95e-02, grad_scale: 64.0 2023-06-18 11:16:01,076 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-18 11:16:28,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=195390.0, ans=0.2 2023-06-18 11:17:22,769 INFO [train.py:996] (3/4) Epoch 2, batch 2100, loss[loss=0.3022, simple_loss=0.3657, pruned_loss=0.1193, over 21823.00 frames. ], tot_loss[loss=0.316, simple_loss=0.3703, pruned_loss=0.1309, over 4286266.54 frames. ], batch size: 102, lr: 1.94e-02, grad_scale: 64.0 2023-06-18 11:17:51,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=195630.0, ans=0.2 2023-06-18 11:17:59,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=195630.0, ans=0.125 2023-06-18 11:18:04,810 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:18:17,832 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:18:32,640 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.29 vs. limit=15.0 2023-06-18 11:18:36,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=195810.0, ans=0.2 2023-06-18 11:18:52,346 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.529e+02 3.854e+02 4.642e+02 6.317e+02 1.235e+03, threshold=9.284e+02, percent-clipped=5.0 2023-06-18 11:18:59,861 INFO [train.py:996] (3/4) Epoch 2, batch 2150, loss[loss=0.2876, simple_loss=0.3401, pruned_loss=0.1175, over 21608.00 frames. ], tot_loss[loss=0.3203, simple_loss=0.3727, pruned_loss=0.134, over 4292074.60 frames. ], batch size: 298, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:19:05,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=195870.0, ans=0.2 2023-06-18 11:19:11,726 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-18 11:20:32,185 INFO [train.py:996] (3/4) Epoch 2, batch 2200, loss[loss=0.2986, simple_loss=0.3691, pruned_loss=0.1141, over 21727.00 frames. ], tot_loss[loss=0.3223, simple_loss=0.3753, pruned_loss=0.1346, over 4288968.46 frames. ], batch size: 298, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:21:37,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=196410.0, ans=0.0 2023-06-18 11:21:51,284 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 3.406e+02 4.174e+02 5.326e+02 1.037e+03, threshold=8.349e+02, percent-clipped=3.0 2023-06-18 11:21:53,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=196410.0, ans=0.0 2023-06-18 11:21:59,030 INFO [train.py:996] (3/4) Epoch 2, batch 2250, loss[loss=0.2937, simple_loss=0.3415, pruned_loss=0.1229, over 21580.00 frames. ], tot_loss[loss=0.3144, simple_loss=0.3689, pruned_loss=0.1299, over 4281562.52 frames. ], batch size: 263, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:22:25,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=196530.0, ans=0.125 2023-06-18 11:22:42,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=196590.0, ans=0.0 2023-06-18 11:22:59,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=196650.0, ans=0.0 2023-06-18 11:23:00,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=196650.0, ans=0.2 2023-06-18 11:23:33,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=196770.0, ans=0.1 2023-06-18 11:23:34,867 INFO [train.py:996] (3/4) Epoch 2, batch 2300, loss[loss=0.2804, simple_loss=0.3233, pruned_loss=0.1188, over 21759.00 frames. ], tot_loss[loss=0.3125, simple_loss=0.3645, pruned_loss=0.1302, over 4286306.60 frames. ], batch size: 124, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:24:29,407 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.47 vs. limit=22.5 2023-06-18 11:24:35,863 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.39 vs. limit=15.0 2023-06-18 11:24:55,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=197010.0, ans=0.1 2023-06-18 11:24:58,610 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 3.491e+02 4.298e+02 5.253e+02 1.181e+03, threshold=8.597e+02, percent-clipped=4.0 2023-06-18 11:25:06,172 INFO [train.py:996] (3/4) Epoch 2, batch 2350, loss[loss=0.2799, simple_loss=0.3229, pruned_loss=0.1185, over 21613.00 frames. ], tot_loss[loss=0.3104, simple_loss=0.3593, pruned_loss=0.1308, over 4293806.16 frames. ], batch size: 298, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:25:39,706 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:25:47,987 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-18 11:25:55,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=197190.0, ans=0.125 2023-06-18 11:26:17,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=197250.0, ans=0.125 2023-06-18 11:26:43,924 INFO [train.py:996] (3/4) Epoch 2, batch 2400, loss[loss=0.3679, simple_loss=0.414, pruned_loss=0.1609, over 21577.00 frames. ], tot_loss[loss=0.3174, simple_loss=0.3662, pruned_loss=0.1343, over 4290329.45 frames. ], batch size: 415, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:26:58,967 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.74 vs. limit=22.5 2023-06-18 11:27:07,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=197430.0, ans=0.0 2023-06-18 11:27:39,164 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:27:49,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=197550.0, ans=0.1 2023-06-18 11:28:08,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=197610.0, ans=0.0 2023-06-18 11:28:18,676 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.754e+02 3.750e+02 4.331e+02 6.076e+02 1.202e+03, threshold=8.663e+02, percent-clipped=8.0 2023-06-18 11:28:31,373 INFO [train.py:996] (3/4) Epoch 2, batch 2450, loss[loss=0.2881, simple_loss=0.34, pruned_loss=0.1181, over 21337.00 frames. ], tot_loss[loss=0.3227, simple_loss=0.372, pruned_loss=0.1367, over 4292465.34 frames. ], batch size: 194, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:28:31,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=197670.0, ans=0.1 2023-06-18 11:28:45,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=197730.0, ans=0.125 2023-06-18 11:29:12,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=15.0 2023-06-18 11:29:13,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=197790.0, ans=0.0 2023-06-18 11:30:04,079 INFO [train.py:996] (3/4) Epoch 2, batch 2500, loss[loss=0.3406, simple_loss=0.4045, pruned_loss=0.1383, over 21706.00 frames. ], tot_loss[loss=0.3208, simple_loss=0.3693, pruned_loss=0.1361, over 4283562.84 frames. ], batch size: 247, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:30:48,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=198090.0, ans=0.125 2023-06-18 11:31:14,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=198150.0, ans=0.2 2023-06-18 11:31:34,336 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.350e+02 3.317e+02 4.381e+02 5.204e+02 7.754e+02, threshold=8.763e+02, percent-clipped=1.0 2023-06-18 11:31:46,955 INFO [train.py:996] (3/4) Epoch 2, batch 2550, loss[loss=0.3121, simple_loss=0.3504, pruned_loss=0.1369, over 21309.00 frames. ], tot_loss[loss=0.3187, simple_loss=0.368, pruned_loss=0.1347, over 4274407.65 frames. ], batch size: 471, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:31:53,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=198270.0, ans=0.125 2023-06-18 11:31:54,145 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=12.0 2023-06-18 11:31:59,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=198270.0, ans=0.125 2023-06-18 11:32:22,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=198390.0, ans=0.0 2023-06-18 11:32:32,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=198390.0, ans=0.125 2023-06-18 11:33:09,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=198510.0, ans=0.125 2023-06-18 11:33:18,191 INFO [train.py:996] (3/4) Epoch 2, batch 2600, loss[loss=0.3516, simple_loss=0.3962, pruned_loss=0.1535, over 21729.00 frames. ], tot_loss[loss=0.3197, simple_loss=0.3681, pruned_loss=0.1356, over 4265605.04 frames. ], batch size: 415, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:33:42,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=198630.0, ans=0.125 2023-06-18 11:34:29,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=198750.0, ans=0.1 2023-06-18 11:34:43,359 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:34:47,550 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.517e+02 3.436e+02 4.244e+02 5.240e+02 1.197e+03, threshold=8.488e+02, percent-clipped=2.0 2023-06-18 11:34:55,295 INFO [train.py:996] (3/4) Epoch 2, batch 2650, loss[loss=0.3167, simple_loss=0.3601, pruned_loss=0.1366, over 19996.00 frames. ], tot_loss[loss=0.3241, simple_loss=0.3717, pruned_loss=0.1382, over 4276330.79 frames. ], batch size: 702, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:36:39,073 INFO [train.py:996] (3/4) Epoch 2, batch 2700, loss[loss=0.324, simple_loss=0.4349, pruned_loss=0.1065, over 19845.00 frames. ], tot_loss[loss=0.321, simple_loss=0.3707, pruned_loss=0.1357, over 4270736.31 frames. ], batch size: 703, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:36:48,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=199170.0, ans=0.2 2023-06-18 11:37:00,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=199230.0, ans=0.0 2023-06-18 11:37:13,289 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=15.0 2023-06-18 11:37:34,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=199350.0, ans=0.125 2023-06-18 11:37:40,181 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-06-18 11:37:52,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=199350.0, ans=0.0 2023-06-18 11:38:02,898 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.843e+02 4.256e+02 5.020e+02 6.245e+02 1.096e+03, threshold=1.004e+03, percent-clipped=9.0 2023-06-18 11:38:14,782 INFO [train.py:996] (3/4) Epoch 2, batch 2750, loss[loss=0.355, simple_loss=0.3884, pruned_loss=0.1608, over 21878.00 frames. ], tot_loss[loss=0.3216, simple_loss=0.3715, pruned_loss=0.1359, over 4270842.13 frames. ], batch size: 351, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:38:15,749 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-18 11:38:26,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=199470.0, ans=0.125 2023-06-18 11:38:27,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=199470.0, ans=0.125 2023-06-18 11:39:11,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=199650.0, ans=0.0 2023-06-18 11:39:17,358 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.05 vs. limit=22.5 2023-06-18 11:39:55,066 INFO [train.py:996] (3/4) Epoch 2, batch 2800, loss[loss=0.4192, simple_loss=0.4694, pruned_loss=0.1845, over 21673.00 frames. ], tot_loss[loss=0.3238, simple_loss=0.3754, pruned_loss=0.1361, over 4272991.51 frames. ], batch size: 389, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:40:12,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-18 11:41:25,817 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.594e+02 3.620e+02 4.325e+02 5.387e+02 9.118e+02, threshold=8.651e+02, percent-clipped=0.0 2023-06-18 11:41:27,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=200010.0, ans=0.125 2023-06-18 11:41:33,771 INFO [train.py:996] (3/4) Epoch 2, batch 2850, loss[loss=0.3373, simple_loss=0.3903, pruned_loss=0.1422, over 21652.00 frames. ], tot_loss[loss=0.3243, simple_loss=0.3759, pruned_loss=0.1364, over 4273267.27 frames. ], batch size: 414, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:41:35,664 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.964e-01 2023-06-18 11:41:37,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=200070.0, ans=0.125 2023-06-18 11:42:21,512 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-18 11:42:25,958 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=15.0 2023-06-18 11:43:10,715 INFO [train.py:996] (3/4) Epoch 2, batch 2900, loss[loss=0.3858, simple_loss=0.4085, pruned_loss=0.1815, over 21853.00 frames. ], tot_loss[loss=0.3216, simple_loss=0.3731, pruned_loss=0.135, over 4275888.29 frames. ], batch size: 371, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:43:22,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.80 vs. limit=8.0 2023-06-18 11:43:25,456 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.76 vs. limit=15.0 2023-06-18 11:43:52,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=200490.0, ans=0.0 2023-06-18 11:43:54,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=200490.0, ans=0.125 2023-06-18 11:44:39,467 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.560e+02 4.023e+02 4.917e+02 6.862e+02 1.107e+03, threshold=9.834e+02, percent-clipped=8.0 2023-06-18 11:44:47,197 INFO [train.py:996] (3/4) Epoch 2, batch 2950, loss[loss=0.2891, simple_loss=0.3385, pruned_loss=0.1199, over 21613.00 frames. ], tot_loss[loss=0.3199, simple_loss=0.3714, pruned_loss=0.1343, over 4279300.73 frames. ], batch size: 263, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:45:03,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=200730.0, ans=0.125 2023-06-18 11:45:15,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=200730.0, ans=0.5 2023-06-18 11:46:03,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=200910.0, ans=0.125 2023-06-18 11:46:09,472 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.45 vs. limit=15.0 2023-06-18 11:46:10,502 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-18 11:46:17,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=200910.0, ans=0.125 2023-06-18 11:46:20,631 INFO [train.py:996] (3/4) Epoch 2, batch 3000, loss[loss=0.3071, simple_loss=0.3778, pruned_loss=0.1182, over 21653.00 frames. ], tot_loss[loss=0.3247, simple_loss=0.3767, pruned_loss=0.1364, over 4282508.27 frames. ], batch size: 230, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:46:20,632 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 11:46:33,347 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.3476, 1.7051, 3.7781, 2.6928], device='cuda:3') 2023-06-18 11:46:34,348 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.9903, 3.2270, 3.2414, 2.9034], device='cuda:3') 2023-06-18 11:46:36,277 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.2851, simple_loss=0.377, pruned_loss=0.09657, over 1796401.00 frames. 2023-06-18 11:46:36,278 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-18 11:47:03,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=201030.0, ans=0.125 2023-06-18 11:47:53,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=201150.0, ans=0.0 2023-06-18 11:48:07,016 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.149e+02 4.031e+02 5.261e+02 8.201e+02, threshold=8.061e+02, percent-clipped=0.0 2023-06-18 11:48:15,240 INFO [train.py:996] (3/4) Epoch 2, batch 3050, loss[loss=0.2564, simple_loss=0.3371, pruned_loss=0.08786, over 21745.00 frames. ], tot_loss[loss=0.3236, simple_loss=0.3765, pruned_loss=0.1354, over 4283511.28 frames. ], batch size: 298, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:48:22,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=201270.0, ans=0.125 2023-06-18 11:48:34,513 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-18 11:50:02,711 INFO [train.py:996] (3/4) Epoch 2, batch 3100, loss[loss=0.3081, simple_loss=0.3867, pruned_loss=0.1147, over 21680.00 frames. ], tot_loss[loss=0.3215, simple_loss=0.3756, pruned_loss=0.1337, over 4288057.51 frames. ], batch size: 441, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:50:03,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=201570.0, ans=0.0 2023-06-18 11:50:09,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=201570.0, ans=0.035 2023-06-18 11:50:52,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=201690.0, ans=0.125 2023-06-18 11:51:10,036 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.44 vs. limit=15.0 2023-06-18 11:51:31,172 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 3.312e+02 4.167e+02 4.991e+02 9.720e+02, threshold=8.334e+02, percent-clipped=2.0 2023-06-18 11:51:37,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=201870.0, ans=0.2 2023-06-18 11:51:39,060 INFO [train.py:996] (3/4) Epoch 2, batch 3150, loss[loss=0.3564, simple_loss=0.4129, pruned_loss=0.1499, over 21667.00 frames. ], tot_loss[loss=0.3205, simple_loss=0.3754, pruned_loss=0.1328, over 4286194.55 frames. ], batch size: 441, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:51:48,525 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=22.5 2023-06-18 11:52:17,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=201930.0, ans=0.0 2023-06-18 11:52:20,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=201990.0, ans=0.2 2023-06-18 11:52:33,772 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.77 vs. limit=6.0 2023-06-18 11:53:01,433 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-18 11:53:02,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=202110.0, ans=0.05 2023-06-18 11:53:03,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=202110.0, ans=0.07 2023-06-18 11:53:05,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=202110.0, ans=0.125 2023-06-18 11:53:11,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=202110.0, ans=0.1 2023-06-18 11:53:22,363 INFO [train.py:996] (3/4) Epoch 2, batch 3200, loss[loss=0.2638, simple_loss=0.333, pruned_loss=0.09726, over 21746.00 frames. ], tot_loss[loss=0.3216, simple_loss=0.3764, pruned_loss=0.1333, over 4287585.72 frames. ], batch size: 247, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:53:38,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=202170.0, ans=0.125 2023-06-18 11:54:32,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=202350.0, ans=0.0 2023-06-18 11:54:51,727 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 3.651e+02 4.650e+02 5.913e+02 1.032e+03, threshold=9.300e+02, percent-clipped=10.0 2023-06-18 11:55:04,313 INFO [train.py:996] (3/4) Epoch 2, batch 3250, loss[loss=0.3305, simple_loss=0.37, pruned_loss=0.1455, over 21491.00 frames. ], tot_loss[loss=0.3259, simple_loss=0.3796, pruned_loss=0.1361, over 4282273.11 frames. ], batch size: 389, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:55:20,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=202530.0, ans=0.0 2023-06-18 11:55:23,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=202530.0, ans=0.2 2023-06-18 11:55:24,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=202530.0, ans=0.1 2023-06-18 11:56:08,604 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=5.998e-02 2023-06-18 11:56:09,123 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-18 11:56:43,221 INFO [train.py:996] (3/4) Epoch 2, batch 3300, loss[loss=0.3825, simple_loss=0.3913, pruned_loss=0.1868, over 21315.00 frames. ], tot_loss[loss=0.3247, simple_loss=0.3757, pruned_loss=0.1368, over 4283377.56 frames. ], batch size: 507, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:56:52,273 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.49 vs. limit=15.0 2023-06-18 11:57:07,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=202830.0, ans=0.95 2023-06-18 11:57:27,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=202890.0, ans=0.2 2023-06-18 11:57:30,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=202890.0, ans=0.125 2023-06-18 11:57:32,094 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-18 11:58:08,734 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.252e+02 3.845e+02 4.643e+02 5.721e+02 1.092e+03, threshold=9.285e+02, percent-clipped=5.0 2023-06-18 11:58:16,403 INFO [train.py:996] (3/4) Epoch 2, batch 3350, loss[loss=0.3432, simple_loss=0.3793, pruned_loss=0.1536, over 21583.00 frames. ], tot_loss[loss=0.3259, simple_loss=0.3794, pruned_loss=0.1362, over 4276171.32 frames. ], batch size: 548, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:58:22,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=203070.0, ans=0.1 2023-06-18 11:59:04,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=203190.0, ans=0.125 2023-06-18 11:59:19,991 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:59:35,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=203250.0, ans=0.2 2023-06-18 11:59:53,092 INFO [train.py:996] (3/4) Epoch 2, batch 3400, loss[loss=0.3175, simple_loss=0.3679, pruned_loss=0.1335, over 21508.00 frames. ], tot_loss[loss=0.3273, simple_loss=0.3796, pruned_loss=0.1375, over 4286267.29 frames. ], batch size: 230, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:59:53,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=203370.0, ans=0.0 2023-06-18 12:00:31,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=203490.0, ans=0.2 2023-06-18 12:00:50,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=203550.0, ans=15.0 2023-06-18 12:01:02,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=203550.0, ans=0.2 2023-06-18 12:01:18,181 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 3.307e+02 4.139e+02 5.241e+02 1.031e+03, threshold=8.278e+02, percent-clipped=1.0 2023-06-18 12:01:18,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=203610.0, ans=0.125 2023-06-18 12:01:26,572 INFO [train.py:996] (3/4) Epoch 2, batch 3450, loss[loss=0.4865, simple_loss=0.5008, pruned_loss=0.2361, over 21383.00 frames. ], tot_loss[loss=0.3244, simple_loss=0.3741, pruned_loss=0.1374, over 4281879.13 frames. ], batch size: 507, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 12:01:28,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=203670.0, ans=0.125 2023-06-18 12:01:42,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=203730.0, ans=0.125 2023-06-18 12:02:17,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=203790.0, ans=0.125 2023-06-18 12:02:32,035 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-18 12:02:41,754 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:02:57,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=203910.0, ans=0.125 2023-06-18 12:03:01,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=203910.0, ans=10.0 2023-06-18 12:03:03,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=203910.0, ans=0.125 2023-06-18 12:03:06,550 INFO [train.py:996] (3/4) Epoch 2, batch 3500, loss[loss=0.4523, simple_loss=0.4821, pruned_loss=0.2113, over 21427.00 frames. ], tot_loss[loss=0.3339, simple_loss=0.3832, pruned_loss=0.1422, over 4278977.45 frames. ], batch size: 471, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 12:03:49,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=204030.0, ans=0.07 2023-06-18 12:04:35,404 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.505e+02 3.628e+02 4.522e+02 5.964e+02 1.068e+03, threshold=9.044e+02, percent-clipped=7.0 2023-06-18 12:04:35,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=204210.0, ans=0.0 2023-06-18 12:04:47,804 INFO [train.py:996] (3/4) Epoch 2, batch 3550, loss[loss=0.2891, simple_loss=0.3322, pruned_loss=0.123, over 21615.00 frames. ], tot_loss[loss=0.3368, simple_loss=0.3861, pruned_loss=0.1438, over 4283899.74 frames. ], batch size: 298, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:05:14,351 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.19 vs. limit=15.0 2023-06-18 12:05:32,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=204390.0, ans=0.0 2023-06-18 12:06:26,608 INFO [train.py:996] (3/4) Epoch 2, batch 3600, loss[loss=0.412, simple_loss=0.429, pruned_loss=0.1975, over 21365.00 frames. ], tot_loss[loss=0.3325, simple_loss=0.38, pruned_loss=0.1425, over 4280319.11 frames. ], batch size: 471, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:06:41,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=204570.0, ans=0.0 2023-06-18 12:06:43,053 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-18 12:07:09,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=204630.0, ans=0.125 2023-06-18 12:07:31,824 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=22.5 2023-06-18 12:07:45,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=204810.0, ans=0.1 2023-06-18 12:07:57,514 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.601e+02 3.619e+02 4.242e+02 5.545e+02 1.042e+03, threshold=8.484e+02, percent-clipped=1.0 2023-06-18 12:08:05,224 INFO [train.py:996] (3/4) Epoch 2, batch 3650, loss[loss=0.2923, simple_loss=0.3569, pruned_loss=0.1138, over 21772.00 frames. ], tot_loss[loss=0.3322, simple_loss=0.38, pruned_loss=0.1422, over 4281430.23 frames. ], batch size: 247, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:08:06,315 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.14 vs. limit=15.0 2023-06-18 12:08:19,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=204870.0, ans=0.0 2023-06-18 12:08:24,959 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.08 vs. limit=6.0 2023-06-18 12:08:37,957 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.52 vs. limit=10.0 2023-06-18 12:08:38,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=204930.0, ans=0.0 2023-06-18 12:09:11,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=205050.0, ans=0.0 2023-06-18 12:09:38,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=205110.0, ans=0.125 2023-06-18 12:09:41,417 INFO [train.py:996] (3/4) Epoch 2, batch 3700, loss[loss=0.3189, simple_loss=0.3777, pruned_loss=0.1301, over 21783.00 frames. ], tot_loss[loss=0.3302, simple_loss=0.3781, pruned_loss=0.1411, over 4274939.31 frames. ], batch size: 389, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:10:39,431 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.68 vs. limit=15.0 2023-06-18 12:11:09,713 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.222e+02 3.219e+02 3.762e+02 4.567e+02 1.013e+03, threshold=7.524e+02, percent-clipped=2.0 2023-06-18 12:11:10,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=205410.0, ans=0.2 2023-06-18 12:11:22,718 INFO [train.py:996] (3/4) Epoch 2, batch 3750, loss[loss=0.3088, simple_loss=0.3695, pruned_loss=0.1241, over 21569.00 frames. ], tot_loss[loss=0.3281, simple_loss=0.3761, pruned_loss=0.14, over 4278136.90 frames. ], batch size: 471, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:11:29,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=205470.0, ans=0.1 2023-06-18 12:11:29,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=205470.0, ans=0.2 2023-06-18 12:11:56,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.23 vs. limit=6.0 2023-06-18 12:12:00,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=205530.0, ans=0.1 2023-06-18 12:12:47,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=205710.0, ans=0.125 2023-06-18 12:12:59,779 INFO [train.py:996] (3/4) Epoch 2, batch 3800, loss[loss=0.3715, simple_loss=0.4098, pruned_loss=0.1666, over 21803.00 frames. ], tot_loss[loss=0.3241, simple_loss=0.3735, pruned_loss=0.1374, over 4278549.39 frames. ], batch size: 441, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:13:19,213 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=22.5 2023-06-18 12:13:23,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=205830.0, ans=0.0 2023-06-18 12:13:27,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-18 12:13:30,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.99 vs. limit=15.0 2023-06-18 12:14:28,916 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 3.435e+02 4.330e+02 5.504e+02 8.212e+02, threshold=8.659e+02, percent-clipped=4.0 2023-06-18 12:14:29,962 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-18 12:14:37,044 INFO [train.py:996] (3/4) Epoch 2, batch 3850, loss[loss=0.2614, simple_loss=0.3041, pruned_loss=0.1093, over 21566.00 frames. ], tot_loss[loss=0.323, simple_loss=0.3712, pruned_loss=0.1374, over 4276943.54 frames. ], batch size: 231, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:15:06,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=206130.0, ans=0.125 2023-06-18 12:15:18,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=206190.0, ans=0.1 2023-06-18 12:15:24,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=206190.0, ans=0.0 2023-06-18 12:16:13,512 INFO [train.py:996] (3/4) Epoch 2, batch 3900, loss[loss=0.3003, simple_loss=0.3504, pruned_loss=0.1251, over 21903.00 frames. ], tot_loss[loss=0.318, simple_loss=0.3648, pruned_loss=0.1357, over 4284267.28 frames. ], batch size: 351, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:16:16,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=206370.0, ans=0.125 2023-06-18 12:16:56,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=206490.0, ans=0.2 2023-06-18 12:17:30,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=206550.0, ans=0.0 2023-06-18 12:17:40,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=206610.0, ans=0.125 2023-06-18 12:17:42,969 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.558e+02 3.735e+02 4.662e+02 6.230e+02 1.205e+03, threshold=9.323e+02, percent-clipped=9.0 2023-06-18 12:17:47,976 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:17:50,613 INFO [train.py:996] (3/4) Epoch 2, batch 3950, loss[loss=0.2692, simple_loss=0.3469, pruned_loss=0.09573, over 21647.00 frames. ], tot_loss[loss=0.3192, simple_loss=0.368, pruned_loss=0.1352, over 4289415.73 frames. ], batch size: 414, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:17:57,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=206670.0, ans=0.125 2023-06-18 12:18:28,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=206730.0, ans=0.0 2023-06-18 12:18:29,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=206790.0, ans=0.0 2023-06-18 12:18:45,858 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.15 vs. limit=12.0 2023-06-18 12:19:14,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=206910.0, ans=0.2 2023-06-18 12:19:27,480 INFO [train.py:996] (3/4) Epoch 2, batch 4000, loss[loss=0.2338, simple_loss=0.2882, pruned_loss=0.08967, over 21377.00 frames. ], tot_loss[loss=0.3147, simple_loss=0.3644, pruned_loss=0.1325, over 4285168.87 frames. ], batch size: 131, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:19:46,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=207030.0, ans=0.1 2023-06-18 12:19:51,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=207030.0, ans=0.125 2023-06-18 12:20:04,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=207090.0, ans=0.0 2023-06-18 12:20:50,969 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.358e+02 3.512e+02 4.093e+02 5.285e+02 8.562e+02, threshold=8.187e+02, percent-clipped=0.0 2023-06-18 12:21:03,152 INFO [train.py:996] (3/4) Epoch 2, batch 4050, loss[loss=0.3072, simple_loss=0.3825, pruned_loss=0.1159, over 21514.00 frames. ], tot_loss[loss=0.3118, simple_loss=0.3635, pruned_loss=0.1301, over 4280183.29 frames. ], batch size: 471, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:21:18,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=207270.0, ans=0.125 2023-06-18 12:21:45,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=207390.0, ans=0.5 2023-06-18 12:22:10,343 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.56 vs. limit=6.0 2023-06-18 12:22:13,130 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=12.0 2023-06-18 12:22:14,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=207450.0, ans=0.125 2023-06-18 12:22:44,790 INFO [train.py:996] (3/4) Epoch 2, batch 4100, loss[loss=0.3717, simple_loss=0.3988, pruned_loss=0.1724, over 20087.00 frames. ], tot_loss[loss=0.3118, simple_loss=0.3636, pruned_loss=0.13, over 4281894.33 frames. ], batch size: 703, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:23:07,931 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-06-18 12:23:47,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=207750.0, ans=0.07 2023-06-18 12:24:06,455 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.72 vs. limit=22.5 2023-06-18 12:24:08,754 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.979e+02 3.498e+02 4.033e+02 7.129e+02, threshold=6.997e+02, percent-clipped=0.0 2023-06-18 12:24:20,656 INFO [train.py:996] (3/4) Epoch 2, batch 4150, loss[loss=0.3047, simple_loss=0.3467, pruned_loss=0.1314, over 16040.00 frames. ], tot_loss[loss=0.3075, simple_loss=0.3626, pruned_loss=0.1262, over 4266798.30 frames. ], batch size: 60, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:24:33,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=207870.0, ans=10.0 2023-06-18 12:24:36,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=207930.0, ans=0.0 2023-06-18 12:25:13,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=207990.0, ans=0.0 2023-06-18 12:25:58,994 INFO [train.py:996] (3/4) Epoch 2, batch 4200, loss[loss=0.296, simple_loss=0.337, pruned_loss=0.1275, over 21265.00 frames. ], tot_loss[loss=0.3053, simple_loss=0.3612, pruned_loss=0.1247, over 4270552.11 frames. ], batch size: 176, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:27:28,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=208410.0, ans=0.125 2023-06-18 12:27:31,459 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.591e+02 4.541e+02 5.629e+02 1.049e+03, threshold=9.081e+02, percent-clipped=10.0 2023-06-18 12:27:38,033 INFO [train.py:996] (3/4) Epoch 2, batch 4250, loss[loss=0.3125, simple_loss=0.3705, pruned_loss=0.1272, over 21740.00 frames. ], tot_loss[loss=0.3148, simple_loss=0.3709, pruned_loss=0.1294, over 4270962.43 frames. ], batch size: 247, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:27:44,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=208470.0, ans=0.125 2023-06-18 12:28:12,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=208530.0, ans=0.125 2023-06-18 12:29:25,628 INFO [train.py:996] (3/4) Epoch 2, batch 4300, loss[loss=0.3325, simple_loss=0.3691, pruned_loss=0.148, over 20094.00 frames. ], tot_loss[loss=0.3215, simple_loss=0.3775, pruned_loss=0.1327, over 4273430.70 frames. ], batch size: 702, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:29:42,203 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=15.0 2023-06-18 12:30:12,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=208890.0, ans=0.125 2023-06-18 12:31:04,001 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.525e+02 3.305e+02 3.923e+02 4.922e+02 1.064e+03, threshold=7.846e+02, percent-clipped=2.0 2023-06-18 12:31:10,103 INFO [train.py:996] (3/4) Epoch 2, batch 4350, loss[loss=0.3393, simple_loss=0.3946, pruned_loss=0.142, over 21054.00 frames. ], tot_loss[loss=0.317, simple_loss=0.3736, pruned_loss=0.1302, over 4271399.23 frames. ], batch size: 608, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:31:33,215 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.53 vs. limit=15.0 2023-06-18 12:31:34,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=209130.0, ans=0.0 2023-06-18 12:32:05,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=209250.0, ans=0.0 2023-06-18 12:32:37,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=209310.0, ans=0.015 2023-06-18 12:32:47,835 INFO [train.py:996] (3/4) Epoch 2, batch 4400, loss[loss=0.3601, simple_loss=0.4296, pruned_loss=0.1453, over 21606.00 frames. ], tot_loss[loss=0.3145, simple_loss=0.3699, pruned_loss=0.1296, over 4261987.46 frames. ], batch size: 414, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:33:29,918 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2023-06-18 12:33:35,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=209490.0, ans=0.1 2023-06-18 12:33:55,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=209550.0, ans=0.0 2023-06-18 12:34:08,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=209610.0, ans=0.0 2023-06-18 12:34:19,643 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.653e+02 4.609e+02 5.842e+02 1.096e+03, threshold=9.217e+02, percent-clipped=5.0 2023-06-18 12:34:29,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=209670.0, ans=10.0 2023-06-18 12:34:30,760 INFO [train.py:996] (3/4) Epoch 2, batch 4450, loss[loss=0.3274, simple_loss=0.3798, pruned_loss=0.1375, over 21199.00 frames. ], tot_loss[loss=0.3204, simple_loss=0.3774, pruned_loss=0.1317, over 4260010.43 frames. ], batch size: 159, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:34:47,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=209670.0, ans=0.125 2023-06-18 12:35:01,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=209730.0, ans=0.125 2023-06-18 12:35:15,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=209790.0, ans=0.0 2023-06-18 12:35:40,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=209850.0, ans=0.125 2023-06-18 12:35:41,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.99 vs. limit=6.0 2023-06-18 12:35:41,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=209850.0, ans=0.2 2023-06-18 12:35:58,055 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.32 vs. limit=15.0 2023-06-18 12:36:06,126 INFO [train.py:996] (3/4) Epoch 2, batch 4500, loss[loss=0.3172, simple_loss=0.3902, pruned_loss=0.1221, over 21412.00 frames. ], tot_loss[loss=0.3232, simple_loss=0.3791, pruned_loss=0.1337, over 4264977.28 frames. ], batch size: 211, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:36:56,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=210090.0, ans=0.2 2023-06-18 12:37:05,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=210090.0, ans=0.0 2023-06-18 12:37:08,033 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:37:37,911 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.570e+02 3.555e+02 4.065e+02 4.925e+02 8.814e+02, threshold=8.131e+02, percent-clipped=0.0 2023-06-18 12:37:48,819 INFO [train.py:996] (3/4) Epoch 2, batch 4550, loss[loss=0.2957, simple_loss=0.3708, pruned_loss=0.1103, over 21796.00 frames. ], tot_loss[loss=0.3258, simple_loss=0.3832, pruned_loss=0.1342, over 4268021.72 frames. ], batch size: 282, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:38:09,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=210330.0, ans=0.125 2023-06-18 12:38:39,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=210390.0, ans=0.125 2023-06-18 12:38:50,444 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=12.0 2023-06-18 12:39:13,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=210510.0, ans=10.0 2023-06-18 12:39:14,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=210510.0, ans=0.125 2023-06-18 12:39:25,104 INFO [train.py:996] (3/4) Epoch 2, batch 4600, loss[loss=0.3046, simple_loss=0.3757, pruned_loss=0.1168, over 21685.00 frames. ], tot_loss[loss=0.3315, simple_loss=0.3884, pruned_loss=0.1373, over 4272976.21 frames. ], batch size: 389, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:39:44,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=210630.0, ans=0.125 2023-06-18 12:40:10,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=210690.0, ans=0.0 2023-06-18 12:40:12,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=210690.0, ans=0.125 2023-06-18 12:40:56,142 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.810e+02 3.566e+02 4.300e+02 5.335e+02 1.700e+03, threshold=8.600e+02, percent-clipped=8.0 2023-06-18 12:41:02,494 INFO [train.py:996] (3/4) Epoch 2, batch 4650, loss[loss=0.3359, simple_loss=0.3863, pruned_loss=0.1428, over 19974.00 frames. ], tot_loss[loss=0.3247, simple_loss=0.3821, pruned_loss=0.1336, over 4276045.43 frames. ], batch size: 702, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:41:05,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=210870.0, ans=0.0 2023-06-18 12:41:48,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=210990.0, ans=0.0 2023-06-18 12:42:32,547 INFO [train.py:996] (3/4) Epoch 2, batch 4700, loss[loss=0.2876, simple_loss=0.3268, pruned_loss=0.1242, over 21494.00 frames. ], tot_loss[loss=0.3162, simple_loss=0.3717, pruned_loss=0.1304, over 4269184.94 frames. ], batch size: 230, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:42:34,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=211170.0, ans=0.0 2023-06-18 12:42:43,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=211170.0, ans=0.0 2023-06-18 12:42:51,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=211230.0, ans=0.1 2023-06-18 12:43:12,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=211290.0, ans=0.07 2023-06-18 12:44:00,734 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.242e+02 4.193e+02 5.636e+02 1.011e+03, threshold=8.385e+02, percent-clipped=2.0 2023-06-18 12:44:06,697 INFO [train.py:996] (3/4) Epoch 2, batch 4750, loss[loss=0.3136, simple_loss=0.35, pruned_loss=0.1386, over 21862.00 frames. ], tot_loss[loss=0.3128, simple_loss=0.3667, pruned_loss=0.1295, over 4266127.01 frames. ], batch size: 373, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:44:19,937 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.90 vs. limit=6.0 2023-06-18 12:44:27,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=211530.0, ans=0.125 2023-06-18 12:44:52,122 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=15.0 2023-06-18 12:45:02,657 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-06-18 12:45:34,280 INFO [train.py:996] (3/4) Epoch 2, batch 4800, loss[loss=0.2971, simple_loss=0.344, pruned_loss=0.125, over 21646.00 frames. ], tot_loss[loss=0.3157, simple_loss=0.3683, pruned_loss=0.1315, over 4271779.20 frames. ], batch size: 263, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:45:49,146 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=12.0 2023-06-18 12:46:09,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.76 vs. limit=10.0 2023-06-18 12:46:11,254 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.23 vs. limit=22.5 2023-06-18 12:46:30,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=211950.0, ans=0.1 2023-06-18 12:46:42,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=211950.0, ans=0.2 2023-06-18 12:47:01,223 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.590e+02 4.523e+02 5.544e+02 1.095e+03, threshold=9.046e+02, percent-clipped=1.0 2023-06-18 12:47:07,535 INFO [train.py:996] (3/4) Epoch 2, batch 4850, loss[loss=0.2962, simple_loss=0.3424, pruned_loss=0.125, over 21469.00 frames. ], tot_loss[loss=0.3146, simple_loss=0.3674, pruned_loss=0.1309, over 4274439.22 frames. ], batch size: 212, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:47:11,653 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-18 12:47:21,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=212130.0, ans=0.125 2023-06-18 12:47:30,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=212130.0, ans=0.05 2023-06-18 12:47:38,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-18 12:48:19,213 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2023-06-18 12:48:30,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=212310.0, ans=0.02 2023-06-18 12:48:30,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=212310.0, ans=0.125 2023-06-18 12:48:35,059 INFO [train.py:996] (3/4) Epoch 2, batch 4900, loss[loss=0.3296, simple_loss=0.4004, pruned_loss=0.1294, over 21803.00 frames. ], tot_loss[loss=0.3177, simple_loss=0.3707, pruned_loss=0.1324, over 4286238.39 frames. ], batch size: 282, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:49:55,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=212610.0, ans=0.125 2023-06-18 12:50:03,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=212610.0, ans=0.1 2023-06-18 12:50:06,439 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 3.496e+02 4.489e+02 5.539e+02 1.137e+03, threshold=8.978e+02, percent-clipped=3.0 2023-06-18 12:50:12,994 INFO [train.py:996] (3/4) Epoch 2, batch 4950, loss[loss=0.2723, simple_loss=0.3584, pruned_loss=0.09308, over 21592.00 frames. ], tot_loss[loss=0.3163, simple_loss=0.3726, pruned_loss=0.13, over 4283854.58 frames. ], batch size: 230, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:50:36,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=212730.0, ans=0.0 2023-06-18 12:50:45,891 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.28 vs. limit=15.0 2023-06-18 12:51:37,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=212910.0, ans=0.125 2023-06-18 12:51:42,865 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.41 vs. limit=6.0 2023-06-18 12:51:46,441 INFO [train.py:996] (3/4) Epoch 2, batch 5000, loss[loss=0.249, simple_loss=0.3216, pruned_loss=0.08826, over 21218.00 frames. ], tot_loss[loss=0.3084, simple_loss=0.3681, pruned_loss=0.1243, over 4282666.91 frames. ], batch size: 176, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:52:48,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=213150.0, ans=0.125 2023-06-18 12:53:06,324 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.969e+02 3.666e+02 4.897e+02 8.510e+02, threshold=7.332e+02, percent-clipped=0.0 2023-06-18 12:53:12,738 INFO [train.py:996] (3/4) Epoch 2, batch 5050, loss[loss=0.3091, simple_loss=0.3514, pruned_loss=0.1334, over 21363.00 frames. ], tot_loss[loss=0.3119, simple_loss=0.3685, pruned_loss=0.1277, over 4282993.21 frames. ], batch size: 143, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:54:43,408 INFO [train.py:996] (3/4) Epoch 2, batch 5100, loss[loss=0.3213, simple_loss=0.3701, pruned_loss=0.1363, over 21742.00 frames. ], tot_loss[loss=0.3112, simple_loss=0.3659, pruned_loss=0.1282, over 4291280.17 frames. ], batch size: 441, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 12:54:50,785 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-18 12:56:13,624 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.391e+02 3.389e+02 4.046e+02 5.054e+02 9.083e+02, threshold=8.093e+02, percent-clipped=6.0 2023-06-18 12:56:19,859 INFO [train.py:996] (3/4) Epoch 2, batch 5150, loss[loss=0.4249, simple_loss=0.4531, pruned_loss=0.1983, over 21625.00 frames. ], tot_loss[loss=0.3115, simple_loss=0.3649, pruned_loss=0.129, over 4297437.43 frames. ], batch size: 508, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 12:56:26,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=213870.0, ans=0.0 2023-06-18 12:57:11,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=213990.0, ans=15.0 2023-06-18 12:57:48,198 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.94 vs. limit=15.0 2023-06-18 12:57:56,077 INFO [train.py:996] (3/4) Epoch 2, batch 5200, loss[loss=0.3582, simple_loss=0.4362, pruned_loss=0.1401, over 21692.00 frames. ], tot_loss[loss=0.3102, simple_loss=0.3635, pruned_loss=0.1284, over 4293117.72 frames. ], batch size: 414, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 12:58:55,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=214290.0, ans=0.125 2023-06-18 12:59:25,781 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.469e+02 3.610e+02 4.791e+02 6.505e+02 1.223e+03, threshold=9.582e+02, percent-clipped=11.0 2023-06-18 12:59:32,052 INFO [train.py:996] (3/4) Epoch 2, batch 5250, loss[loss=0.276, simple_loss=0.3428, pruned_loss=0.1045, over 21495.00 frames. ], tot_loss[loss=0.3088, simple_loss=0.3654, pruned_loss=0.1261, over 4284758.34 frames. ], batch size: 195, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 12:59:52,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=214470.0, ans=0.1 2023-06-18 13:00:02,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=214530.0, ans=0.2 2023-06-18 13:00:12,475 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-18 13:01:12,301 INFO [train.py:996] (3/4) Epoch 2, batch 5300, loss[loss=0.3298, simple_loss=0.3748, pruned_loss=0.1424, over 21828.00 frames. ], tot_loss[loss=0.3118, simple_loss=0.3664, pruned_loss=0.1286, over 4290053.10 frames. ], batch size: 441, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 13:01:29,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=214770.0, ans=0.125 2023-06-18 13:01:37,304 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.05 vs. limit=22.5 2023-06-18 13:01:51,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=214890.0, ans=0.0 2023-06-18 13:02:12,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=214950.0, ans=0.125 2023-06-18 13:02:17,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=214950.0, ans=0.2 2023-06-18 13:02:28,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=215010.0, ans=0.1 2023-06-18 13:02:28,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.87 vs. limit=22.5 2023-06-18 13:02:35,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 3.046e+02 3.546e+02 4.539e+02 8.571e+02, threshold=7.092e+02, percent-clipped=0.0 2023-06-18 13:02:41,301 INFO [train.py:996] (3/4) Epoch 2, batch 5350, loss[loss=0.3912, simple_loss=0.407, pruned_loss=0.1877, over 21637.00 frames. ], tot_loss[loss=0.3139, simple_loss=0.3664, pruned_loss=0.1307, over 4298416.18 frames. ], batch size: 471, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 13:03:01,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.19 vs. limit=22.5 2023-06-18 13:03:09,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=215130.0, ans=0.125 2023-06-18 13:03:13,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=215130.0, ans=0.125 2023-06-18 13:03:27,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=215190.0, ans=0.125 2023-06-18 13:03:50,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=215250.0, ans=0.125 2023-06-18 13:03:55,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=215250.0, ans=0.125 2023-06-18 13:04:03,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=215310.0, ans=0.0 2023-06-18 13:04:16,991 INFO [train.py:996] (3/4) Epoch 2, batch 5400, loss[loss=0.3806, simple_loss=0.4649, pruned_loss=0.1482, over 19856.00 frames. ], tot_loss[loss=0.316, simple_loss=0.3664, pruned_loss=0.1328, over 4304948.07 frames. ], batch size: 702, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 13:04:52,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=215430.0, ans=0.0 2023-06-18 13:05:50,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=215610.0, ans=0.2 2023-06-18 13:05:58,049 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 3.179e+02 4.117e+02 5.254e+02 8.433e+02, threshold=8.234e+02, percent-clipped=2.0 2023-06-18 13:06:04,227 INFO [train.py:996] (3/4) Epoch 2, batch 5450, loss[loss=0.3114, simple_loss=0.402, pruned_loss=0.1104, over 20757.00 frames. ], tot_loss[loss=0.3113, simple_loss=0.3662, pruned_loss=0.1282, over 4298568.12 frames. ], batch size: 607, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 13:06:16,375 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.68 vs. limit=10.0 2023-06-18 13:06:33,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=215730.0, ans=0.125 2023-06-18 13:06:35,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=215730.0, ans=0.125 2023-06-18 13:06:44,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=215790.0, ans=0.0 2023-06-18 13:07:18,403 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:07:36,798 INFO [train.py:996] (3/4) Epoch 2, batch 5500, loss[loss=0.3787, simple_loss=0.4404, pruned_loss=0.1585, over 21453.00 frames. ], tot_loss[loss=0.3111, simple_loss=0.3729, pruned_loss=0.1246, over 4285341.41 frames. ], batch size: 507, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:07:52,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=215970.0, ans=0.015 2023-06-18 13:08:09,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=216030.0, ans=0.125 2023-06-18 13:08:17,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=216090.0, ans=0.1 2023-06-18 13:08:23,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=216090.0, ans=0.0 2023-06-18 13:08:28,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=216090.0, ans=0.1 2023-06-18 13:08:49,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=216150.0, ans=0.125 2023-06-18 13:08:52,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=216150.0, ans=0.0 2023-06-18 13:09:13,839 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 3.111e+02 3.765e+02 4.593e+02 1.085e+03, threshold=7.530e+02, percent-clipped=3.0 2023-06-18 13:09:20,387 INFO [train.py:996] (3/4) Epoch 2, batch 5550, loss[loss=0.2072, simple_loss=0.2806, pruned_loss=0.06692, over 21174.00 frames. ], tot_loss[loss=0.3025, simple_loss=0.3666, pruned_loss=0.1192, over 4279063.17 frames. ], batch size: 159, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:09:40,316 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.53 vs. limit=12.0 2023-06-18 13:10:12,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=216390.0, ans=0.125 2023-06-18 13:10:16,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=216390.0, ans=0.1 2023-06-18 13:10:37,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=216450.0, ans=0.0 2023-06-18 13:10:58,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=216510.0, ans=0.125 2023-06-18 13:11:01,291 INFO [train.py:996] (3/4) Epoch 2, batch 5600, loss[loss=0.3527, simple_loss=0.4319, pruned_loss=0.1368, over 21651.00 frames. ], tot_loss[loss=0.2969, simple_loss=0.3624, pruned_loss=0.1157, over 4284369.29 frames. ], batch size: 414, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:11:10,009 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=22.5 2023-06-18 13:11:24,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=216630.0, ans=0.1 2023-06-18 13:12:08,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=216750.0, ans=0.1 2023-06-18 13:12:19,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=216810.0, ans=0.125 2023-06-18 13:12:25,399 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 3.142e+02 3.780e+02 5.340e+02 1.337e+03, threshold=7.560e+02, percent-clipped=11.0 2023-06-18 13:12:36,214 INFO [train.py:996] (3/4) Epoch 2, batch 5650, loss[loss=0.3321, simple_loss=0.3717, pruned_loss=0.1463, over 21867.00 frames. ], tot_loss[loss=0.304, simple_loss=0.3692, pruned_loss=0.1194, over 4282770.19 frames. ], batch size: 124, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:12:52,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=216930.0, ans=0.035 2023-06-18 13:13:44,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=217050.0, ans=10.0 2023-06-18 13:14:11,969 INFO [train.py:996] (3/4) Epoch 2, batch 5700, loss[loss=0.3154, simple_loss=0.3911, pruned_loss=0.1198, over 21782.00 frames. ], tot_loss[loss=0.3071, simple_loss=0.3702, pruned_loss=0.122, over 4286136.25 frames. ], batch size: 371, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:15:23,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=217350.0, ans=0.2 2023-06-18 13:15:43,845 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 3.148e+02 3.823e+02 4.974e+02 1.006e+03, threshold=7.646e+02, percent-clipped=5.0 2023-06-18 13:15:49,886 INFO [train.py:996] (3/4) Epoch 2, batch 5750, loss[loss=0.412, simple_loss=0.5123, pruned_loss=0.1559, over 19739.00 frames. ], tot_loss[loss=0.3033, simple_loss=0.3685, pruned_loss=0.1191, over 4275374.87 frames. ], batch size: 702, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:17:20,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=217710.0, ans=0.125 2023-06-18 13:17:40,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=217770.0, ans=0.125 2023-06-18 13:17:41,292 INFO [train.py:996] (3/4) Epoch 2, batch 5800, loss[loss=0.3774, simple_loss=0.4492, pruned_loss=0.1528, over 21656.00 frames. ], tot_loss[loss=0.3012, simple_loss=0.3674, pruned_loss=0.1175, over 4267856.77 frames. ], batch size: 414, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:18:08,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=217830.0, ans=0.2 2023-06-18 13:18:08,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=217830.0, ans=0.125 2023-06-18 13:19:13,427 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 3.013e+02 4.071e+02 4.851e+02 8.760e+02, threshold=8.142e+02, percent-clipped=2.0 2023-06-18 13:19:19,674 INFO [train.py:996] (3/4) Epoch 2, batch 5850, loss[loss=0.2194, simple_loss=0.3124, pruned_loss=0.06324, over 21732.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3611, pruned_loss=0.111, over 4268599.79 frames. ], batch size: 298, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:19:43,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=218130.0, ans=0.0 2023-06-18 13:19:52,126 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=22.5 2023-06-18 13:19:57,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=218130.0, ans=0.125 2023-06-18 13:20:13,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=218250.0, ans=0.05 2023-06-18 13:20:50,779 INFO [train.py:996] (3/4) Epoch 2, batch 5900, loss[loss=0.2813, simple_loss=0.3473, pruned_loss=0.1076, over 21878.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.3501, pruned_loss=0.1025, over 4272191.82 frames. ], batch size: 371, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:21:09,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=218430.0, ans=6.0 2023-06-18 13:21:16,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=218430.0, ans=0.2 2023-06-18 13:21:33,496 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.627e-02 2023-06-18 13:22:15,834 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=19.08 vs. limit=15.0 2023-06-18 13:22:19,398 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 3.163e+02 4.084e+02 5.462e+02 1.507e+03, threshold=8.168e+02, percent-clipped=5.0 2023-06-18 13:22:25,383 INFO [train.py:996] (3/4) Epoch 2, batch 5950, loss[loss=0.3058, simple_loss=0.3436, pruned_loss=0.134, over 21845.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.3518, pruned_loss=0.1079, over 4280080.64 frames. ], batch size: 351, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:22:33,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=218670.0, ans=0.035 2023-06-18 13:23:03,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=218790.0, ans=0.125 2023-06-18 13:23:23,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=218850.0, ans=0.0 2023-06-18 13:23:26,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=218850.0, ans=0.0 2023-06-18 13:23:32,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=218850.0, ans=0.2 2023-06-18 13:23:51,887 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.60 vs. limit=8.0 2023-06-18 13:24:04,241 INFO [train.py:996] (3/4) Epoch 2, batch 6000, loss[loss=0.2991, simple_loss=0.3401, pruned_loss=0.1291, over 21788.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3494, pruned_loss=0.1129, over 4282635.57 frames. ], batch size: 102, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:24:04,241 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 13:24:20,117 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.2916, simple_loss=0.3878, pruned_loss=0.09771, over 1796401.00 frames. 2023-06-18 13:24:20,118 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-18 13:24:56,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=219090.0, ans=10.0 2023-06-18 13:25:12,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=219090.0, ans=0.09899494936611666 2023-06-18 13:25:28,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=219150.0, ans=0.1 2023-06-18 13:25:47,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=219210.0, ans=0.0 2023-06-18 13:25:51,082 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.59 vs. limit=15.0 2023-06-18 13:25:51,503 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.782e+02 3.962e+02 4.700e+02 6.169e+02 1.115e+03, threshold=9.400e+02, percent-clipped=12.0 2023-06-18 13:25:57,775 INFO [train.py:996] (3/4) Epoch 2, batch 6050, loss[loss=0.2266, simple_loss=0.2848, pruned_loss=0.08415, over 21459.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.3448, pruned_loss=0.1153, over 4276435.08 frames. ], batch size: 132, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:26:14,465 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.59 vs. limit=6.0 2023-06-18 13:26:49,624 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.40 vs. limit=6.0 2023-06-18 13:26:55,709 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-18 13:27:21,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=219510.0, ans=0.125 2023-06-18 13:27:25,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=219510.0, ans=0.035 2023-06-18 13:27:32,991 INFO [train.py:996] (3/4) Epoch 2, batch 6100, loss[loss=0.2729, simple_loss=0.3302, pruned_loss=0.1078, over 21456.00 frames. ], tot_loss[loss=0.2861, simple_loss=0.3441, pruned_loss=0.1141, over 4278769.54 frames. ], batch size: 194, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:27:52,427 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-18 13:27:52,531 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-18 13:28:01,432 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-18 13:28:21,946 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-18 13:28:37,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=219750.0, ans=0.0 2023-06-18 13:28:41,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=219750.0, ans=0.125 2023-06-18 13:29:03,475 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.959e+02 3.646e+02 4.733e+02 1.048e+03, threshold=7.291e+02, percent-clipped=1.0 2023-06-18 13:29:09,573 INFO [train.py:996] (3/4) Epoch 2, batch 6150, loss[loss=0.2988, simple_loss=0.344, pruned_loss=0.1268, over 21519.00 frames. ], tot_loss[loss=0.2925, simple_loss=0.3485, pruned_loss=0.1183, over 4275512.38 frames. ], batch size: 212, lr: 1.84e-02, grad_scale: 64.0 2023-06-18 13:29:22,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=219870.0, ans=0.1 2023-06-18 13:29:26,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=219870.0, ans=0.1 2023-06-18 13:29:38,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-18 13:30:28,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=220110.0, ans=0.125 2023-06-18 13:30:48,828 INFO [train.py:996] (3/4) Epoch 2, batch 6200, loss[loss=0.38, simple_loss=0.4304, pruned_loss=0.1648, over 21894.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3526, pruned_loss=0.12, over 4273834.66 frames. ], batch size: 416, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:31:53,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=220350.0, ans=0.125 2023-06-18 13:32:20,917 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.417e+02 3.171e+02 4.018e+02 5.849e+02 1.001e+03, threshold=8.035e+02, percent-clipped=11.0 2023-06-18 13:32:25,557 INFO [train.py:996] (3/4) Epoch 2, batch 6250, loss[loss=0.2707, simple_loss=0.3631, pruned_loss=0.08915, over 21632.00 frames. ], tot_loss[loss=0.2965, simple_loss=0.3561, pruned_loss=0.1185, over 4272561.50 frames. ], batch size: 230, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:33:14,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=220590.0, ans=0.125 2023-06-18 13:33:39,992 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:33:47,872 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.05 vs. limit=10.0 2023-06-18 13:33:55,764 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-18 13:33:59,269 INFO [train.py:996] (3/4) Epoch 2, batch 6300, loss[loss=0.3085, simple_loss=0.3567, pruned_loss=0.1301, over 21852.00 frames. ], tot_loss[loss=0.2986, simple_loss=0.3604, pruned_loss=0.1184, over 4277547.40 frames. ], batch size: 298, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:34:00,151 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.13 vs. limit=22.5 2023-06-18 13:34:11,107 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=17.27 vs. limit=15.0 2023-06-18 13:34:16,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=220830.0, ans=0.0 2023-06-18 13:34:31,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=220830.0, ans=0.0 2023-06-18 13:34:37,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=220830.0, ans=0.1 2023-06-18 13:34:58,011 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=22.5 2023-06-18 13:35:29,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=221010.0, ans=0.125 2023-06-18 13:35:30,775 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 3.155e+02 3.817e+02 5.474e+02 1.365e+03, threshold=7.634e+02, percent-clipped=9.0 2023-06-18 13:35:35,447 INFO [train.py:996] (3/4) Epoch 2, batch 6350, loss[loss=0.3319, simple_loss=0.385, pruned_loss=0.1395, over 21813.00 frames. ], tot_loss[loss=0.3062, simple_loss=0.3651, pruned_loss=0.1236, over 4277897.04 frames. ], batch size: 282, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:36:28,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=221190.0, ans=0.0 2023-06-18 13:36:46,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=221250.0, ans=0.1 2023-06-18 13:37:08,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=221310.0, ans=0.0 2023-06-18 13:37:17,729 INFO [train.py:996] (3/4) Epoch 2, batch 6400, loss[loss=0.3538, simple_loss=0.3994, pruned_loss=0.1541, over 21784.00 frames. ], tot_loss[loss=0.3165, simple_loss=0.3726, pruned_loss=0.1302, over 4278976.20 frames. ], batch size: 441, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:37:30,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=221370.0, ans=0.125 2023-06-18 13:37:33,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=221370.0, ans=0.2 2023-06-18 13:38:11,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=221490.0, ans=0.125 2023-06-18 13:38:12,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=221490.0, ans=0.125 2023-06-18 13:38:18,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=221550.0, ans=0.07 2023-06-18 13:38:30,122 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.04 vs. limit=10.0 2023-06-18 13:38:33,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-18 13:38:53,759 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.475e+02 3.333e+02 3.952e+02 5.090e+02 9.873e+02, threshold=7.903e+02, percent-clipped=3.0 2023-06-18 13:38:58,446 INFO [train.py:996] (3/4) Epoch 2, batch 6450, loss[loss=0.2666, simple_loss=0.3361, pruned_loss=0.09857, over 21260.00 frames. ], tot_loss[loss=0.3151, simple_loss=0.3734, pruned_loss=0.1284, over 4282734.30 frames. ], batch size: 548, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:39:49,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=221790.0, ans=0.0 2023-06-18 13:40:30,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=221910.0, ans=0.1 2023-06-18 13:40:35,149 INFO [train.py:996] (3/4) Epoch 2, batch 6500, loss[loss=0.2553, simple_loss=0.3053, pruned_loss=0.1027, over 21564.00 frames. ], tot_loss[loss=0.3084, simple_loss=0.3638, pruned_loss=0.1265, over 4280890.53 frames. ], batch size: 263, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:40:55,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=222030.0, ans=0.2 2023-06-18 13:42:03,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=222210.0, ans=0.125 2023-06-18 13:42:06,228 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.480e+02 3.085e+02 3.485e+02 4.361e+02 6.672e+02, threshold=6.971e+02, percent-clipped=0.0 2023-06-18 13:42:10,622 INFO [train.py:996] (3/4) Epoch 2, batch 6550, loss[loss=0.368, simple_loss=0.4125, pruned_loss=0.1617, over 21712.00 frames. ], tot_loss[loss=0.3053, simple_loss=0.3613, pruned_loss=0.1247, over 4289061.18 frames. ], batch size: 441, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:42:48,938 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=15.0 2023-06-18 13:43:36,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=222510.0, ans=0.0 2023-06-18 13:43:48,121 INFO [train.py:996] (3/4) Epoch 2, batch 6600, loss[loss=0.2455, simple_loss=0.2911, pruned_loss=0.09999, over 21522.00 frames. ], tot_loss[loss=0.3044, simple_loss=0.3586, pruned_loss=0.1251, over 4278433.18 frames. ], batch size: 230, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:44:23,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=222630.0, ans=0.0 2023-06-18 13:44:36,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=222690.0, ans=0.0 2023-06-18 13:44:56,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=222750.0, ans=0.1 2023-06-18 13:45:04,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=222750.0, ans=0.2 2023-06-18 13:45:04,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=222750.0, ans=0.125 2023-06-18 13:45:19,121 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 3.099e+02 3.990e+02 5.465e+02 1.147e+03, threshold=7.980e+02, percent-clipped=13.0 2023-06-18 13:45:28,360 INFO [train.py:996] (3/4) Epoch 2, batch 6650, loss[loss=0.2596, simple_loss=0.3137, pruned_loss=0.1028, over 21974.00 frames. ], tot_loss[loss=0.2971, simple_loss=0.3508, pruned_loss=0.1217, over 4271260.25 frames. ], batch size: 103, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:45:59,207 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.74 vs. limit=15.0 2023-06-18 13:46:01,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=222930.0, ans=0.125 2023-06-18 13:46:01,985 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-06-18 13:46:04,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=222990.0, ans=0.125 2023-06-18 13:46:06,643 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.48 vs. limit=22.5 2023-06-18 13:46:07,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=222990.0, ans=0.0 2023-06-18 13:46:38,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=223050.0, ans=0.04949747468305833 2023-06-18 13:46:38,864 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:46:53,273 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.42 vs. limit=22.5 2023-06-18 13:47:01,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=223110.0, ans=0.125 2023-06-18 13:47:05,511 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-18 13:47:06,250 INFO [train.py:996] (3/4) Epoch 2, batch 6700, loss[loss=0.3406, simple_loss=0.3837, pruned_loss=0.1488, over 21540.00 frames. ], tot_loss[loss=0.294, simple_loss=0.3459, pruned_loss=0.121, over 4278514.95 frames. ], batch size: 442, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:47:42,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=223290.0, ans=0.1 2023-06-18 13:48:32,193 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.452e+02 3.773e+02 4.498e+02 5.331e+02 9.291e+02, threshold=8.996e+02, percent-clipped=2.0 2023-06-18 13:48:41,535 INFO [train.py:996] (3/4) Epoch 2, batch 6750, loss[loss=0.3274, simple_loss=0.3541, pruned_loss=0.1503, over 21346.00 frames. ], tot_loss[loss=0.2946, simple_loss=0.3454, pruned_loss=0.1219, over 4277776.93 frames. ], batch size: 473, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:48:48,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=223470.0, ans=0.1 2023-06-18 13:49:06,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=223530.0, ans=10.0 2023-06-18 13:49:26,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=223590.0, ans=0.0 2023-06-18 13:49:40,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=223650.0, ans=0.0 2023-06-18 13:50:05,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=223710.0, ans=0.125 2023-06-18 13:50:14,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=223710.0, ans=0.0 2023-06-18 13:50:17,681 INFO [train.py:996] (3/4) Epoch 2, batch 6800, loss[loss=0.3004, simple_loss=0.3415, pruned_loss=0.1296, over 21790.00 frames. ], tot_loss[loss=0.2981, simple_loss=0.3481, pruned_loss=0.124, over 4273420.66 frames. ], batch size: 300, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:51:43,139 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.239e+02 3.005e+02 3.766e+02 4.478e+02 7.220e+02, threshold=7.533e+02, percent-clipped=0.0 2023-06-18 13:51:45,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=224010.0, ans=0.2 2023-06-18 13:51:52,441 INFO [train.py:996] (3/4) Epoch 2, batch 6850, loss[loss=0.3191, simple_loss=0.3458, pruned_loss=0.1462, over 21276.00 frames. ], tot_loss[loss=0.298, simple_loss=0.3457, pruned_loss=0.1251, over 4280655.05 frames. ], batch size: 176, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:52:20,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=224130.0, ans=0.0 2023-06-18 13:52:56,890 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=22.5 2023-06-18 13:53:23,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=224370.0, ans=0.125 2023-06-18 13:53:28,606 INFO [train.py:996] (3/4) Epoch 2, batch 6900, loss[loss=0.2716, simple_loss=0.3233, pruned_loss=0.11, over 21428.00 frames. ], tot_loss[loss=0.2994, simple_loss=0.3473, pruned_loss=0.1257, over 4289309.21 frames. ], batch size: 194, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:53:46,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=224430.0, ans=0.2 2023-06-18 13:53:55,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=224430.0, ans=0.0 2023-06-18 13:53:57,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=224430.0, ans=0.0 2023-06-18 13:54:38,785 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2023-06-18 13:54:59,754 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:55:00,771 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 3.118e+02 3.769e+02 5.122e+02 8.656e+02, threshold=7.539e+02, percent-clipped=2.0 2023-06-18 13:55:05,537 INFO [train.py:996] (3/4) Epoch 2, batch 6950, loss[loss=0.3425, simple_loss=0.3952, pruned_loss=0.1449, over 21691.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.349, pruned_loss=0.1223, over 4292544.96 frames. ], batch size: 298, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:56:09,108 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.83 vs. limit=12.0 2023-06-18 13:56:10,835 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.42 vs. limit=6.0 2023-06-18 13:56:37,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=224910.0, ans=0.125 2023-06-18 13:56:39,907 INFO [train.py:996] (3/4) Epoch 2, batch 7000, loss[loss=0.3364, simple_loss=0.3684, pruned_loss=0.1522, over 21459.00 frames. ], tot_loss[loss=0.3036, simple_loss=0.3544, pruned_loss=0.1264, over 4291327.44 frames. ], batch size: 389, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:56:45,533 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.48 vs. limit=22.5 2023-06-18 13:57:56,268 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.41 vs. limit=6.0 2023-06-18 13:58:06,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=225210.0, ans=0.0 2023-06-18 13:58:12,222 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.411e+02 3.463e+02 4.256e+02 5.508e+02 8.252e+02, threshold=8.512e+02, percent-clipped=6.0 2023-06-18 13:58:16,987 INFO [train.py:996] (3/4) Epoch 2, batch 7050, loss[loss=0.3716, simple_loss=0.47, pruned_loss=0.1366, over 19726.00 frames. ], tot_loss[loss=0.2987, simple_loss=0.351, pruned_loss=0.1231, over 4281073.63 frames. ], batch size: 702, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:58:35,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=225330.0, ans=0.125 2023-06-18 13:58:56,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=225390.0, ans=0.025 2023-06-18 13:59:37,495 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=17.91 vs. limit=15.0 2023-06-18 13:59:41,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=225510.0, ans=0.125 2023-06-18 13:59:53,608 INFO [train.py:996] (3/4) Epoch 2, batch 7100, loss[loss=0.3167, simple_loss=0.3739, pruned_loss=0.1298, over 21814.00 frames. ], tot_loss[loss=0.3041, simple_loss=0.3569, pruned_loss=0.1256, over 4284758.19 frames. ], batch size: 118, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 14:00:23,613 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=12.0 2023-06-18 14:00:40,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=225690.0, ans=0.05 2023-06-18 14:00:40,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=225690.0, ans=0.125 2023-06-18 14:01:00,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=225750.0, ans=0.2 2023-06-18 14:01:17,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=225810.0, ans=0.0 2023-06-18 14:01:18,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=225810.0, ans=0.0 2023-06-18 14:01:28,272 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.318e+02 3.245e+02 4.248e+02 6.112e+02 1.073e+03, threshold=8.497e+02, percent-clipped=3.0 2023-06-18 14:01:31,269 INFO [train.py:996] (3/4) Epoch 2, batch 7150, loss[loss=0.3469, simple_loss=0.3961, pruned_loss=0.1489, over 21594.00 frames. ], tot_loss[loss=0.2994, simple_loss=0.3537, pruned_loss=0.1225, over 4275585.24 frames. ], batch size: 389, lr: 1.81e-02, grad_scale: 16.0 2023-06-18 14:01:43,905 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:02:44,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=226050.0, ans=0.2 2023-06-18 14:02:48,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=226050.0, ans=0.1 2023-06-18 14:02:59,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=226110.0, ans=0.125 2023-06-18 14:03:08,077 INFO [train.py:996] (3/4) Epoch 2, batch 7200, loss[loss=0.3018, simple_loss=0.338, pruned_loss=0.1328, over 21628.00 frames. ], tot_loss[loss=0.3039, simple_loss=0.3563, pruned_loss=0.1258, over 4283110.14 frames. ], batch size: 247, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:04:40,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.552e+02 3.455e+02 4.315e+02 5.205e+02 7.912e+02, threshold=8.629e+02, percent-clipped=0.0 2023-06-18 14:04:47,725 INFO [train.py:996] (3/4) Epoch 2, batch 7250, loss[loss=0.3691, simple_loss=0.38, pruned_loss=0.1791, over 21226.00 frames. ], tot_loss[loss=0.302, simple_loss=0.3516, pruned_loss=0.1263, over 4285302.73 frames. ], batch size: 471, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:05:40,234 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:05:56,302 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=12.0 2023-06-18 14:06:03,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=226710.0, ans=0.125 2023-06-18 14:06:28,507 INFO [train.py:996] (3/4) Epoch 2, batch 7300, loss[loss=0.2423, simple_loss=0.2944, pruned_loss=0.09507, over 21764.00 frames. ], tot_loss[loss=0.2962, simple_loss=0.3441, pruned_loss=0.1242, over 4287020.56 frames. ], batch size: 317, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:06:45,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=226770.0, ans=0.07 2023-06-18 14:07:32,963 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-18 14:08:03,952 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 3.035e+02 3.518e+02 4.361e+02 7.798e+02, threshold=7.035e+02, percent-clipped=0.0 2023-06-18 14:08:05,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=227070.0, ans=0.125 2023-06-18 14:08:07,006 INFO [train.py:996] (3/4) Epoch 2, batch 7350, loss[loss=0.2662, simple_loss=0.3091, pruned_loss=0.1117, over 21243.00 frames. ], tot_loss[loss=0.2962, simple_loss=0.343, pruned_loss=0.1247, over 4286198.25 frames. ], batch size: 608, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:08:49,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=227190.0, ans=0.125 2023-06-18 14:09:00,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=227190.0, ans=0.1 2023-06-18 14:09:35,111 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=15.0 2023-06-18 14:09:46,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=227310.0, ans=0.0 2023-06-18 14:09:49,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=227370.0, ans=0.1 2023-06-18 14:09:50,802 INFO [train.py:996] (3/4) Epoch 2, batch 7400, loss[loss=0.3743, simple_loss=0.4307, pruned_loss=0.159, over 21515.00 frames. ], tot_loss[loss=0.3022, simple_loss=0.351, pruned_loss=0.1267, over 4283482.02 frames. ], batch size: 509, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:10:17,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=227430.0, ans=0.125 2023-06-18 14:10:36,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=227490.0, ans=0.0 2023-06-18 14:11:21,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=227610.0, ans=0.125 2023-06-18 14:11:22,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=227610.0, ans=0.1 2023-06-18 14:11:24,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=227610.0, ans=0.0 2023-06-18 14:11:25,764 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.736e+02 3.593e+02 4.529e+02 5.644e+02 1.003e+03, threshold=9.058e+02, percent-clipped=10.0 2023-06-18 14:11:29,109 INFO [train.py:996] (3/4) Epoch 2, batch 7450, loss[loss=0.2864, simple_loss=0.332, pruned_loss=0.1204, over 21609.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3505, pruned_loss=0.1247, over 4278135.66 frames. ], batch size: 231, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:11:38,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=227670.0, ans=0.0 2023-06-18 14:11:58,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=227730.0, ans=0.125 2023-06-18 14:12:37,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=227850.0, ans=0.125 2023-06-18 14:12:48,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=227910.0, ans=0.1 2023-06-18 14:13:01,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=227910.0, ans=0.125 2023-06-18 14:13:07,153 INFO [train.py:996] (3/4) Epoch 2, batch 7500, loss[loss=0.2979, simple_loss=0.3907, pruned_loss=0.1025, over 21605.00 frames. ], tot_loss[loss=0.3043, simple_loss=0.3566, pruned_loss=0.126, over 4279450.78 frames. ], batch size: 230, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:13:33,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=228030.0, ans=0.125 2023-06-18 14:14:29,648 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:14:37,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=228210.0, ans=0.2 2023-06-18 14:14:42,783 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.477e+02 3.252e+02 3.894e+02 4.815e+02 8.018e+02, threshold=7.787e+02, percent-clipped=0.0 2023-06-18 14:14:44,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=228270.0, ans=0.1 2023-06-18 14:14:45,726 INFO [train.py:996] (3/4) Epoch 2, batch 7550, loss[loss=0.2658, simple_loss=0.3071, pruned_loss=0.1123, over 20247.00 frames. ], tot_loss[loss=0.3054, simple_loss=0.3631, pruned_loss=0.1238, over 4278784.70 frames. ], batch size: 703, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:16:22,912 INFO [train.py:996] (3/4) Epoch 2, batch 7600, loss[loss=0.3234, simple_loss=0.3661, pruned_loss=0.1404, over 21926.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.3606, pruned_loss=0.1229, over 4276363.91 frames. ], batch size: 316, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:16:23,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=228570.0, ans=0.125 2023-06-18 14:16:23,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=228570.0, ans=0.125 2023-06-18 14:16:25,394 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-06-18 14:16:31,983 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=15.0 2023-06-18 14:16:42,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=228630.0, ans=0.0 2023-06-18 14:16:45,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=228630.0, ans=0.2 2023-06-18 14:17:24,592 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=12.0 2023-06-18 14:17:45,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=228810.0, ans=0.0 2023-06-18 14:17:48,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=228810.0, ans=0.125 2023-06-18 14:17:51,317 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 3.725e+02 4.604e+02 5.632e+02 9.928e+02, threshold=9.208e+02, percent-clipped=8.0 2023-06-18 14:17:54,578 INFO [train.py:996] (3/4) Epoch 2, batch 7650, loss[loss=0.3328, simple_loss=0.3664, pruned_loss=0.1496, over 21595.00 frames. ], tot_loss[loss=0.3058, simple_loss=0.3609, pruned_loss=0.1253, over 4286904.03 frames. ], batch size: 195, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:17:58,784 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.78 vs. limit=6.0 2023-06-18 14:18:13,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=228930.0, ans=0.125 2023-06-18 14:18:19,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=228930.0, ans=0.125 2023-06-18 14:18:19,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=228930.0, ans=0.125 2023-06-18 14:18:45,890 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.79 vs. limit=15.0 2023-06-18 14:19:00,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=229050.0, ans=0.0 2023-06-18 14:19:05,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=229050.0, ans=0.125 2023-06-18 14:19:27,521 INFO [train.py:996] (3/4) Epoch 2, batch 7700, loss[loss=0.3236, simple_loss=0.3788, pruned_loss=0.1342, over 21763.00 frames. ], tot_loss[loss=0.3131, simple_loss=0.3649, pruned_loss=0.1306, over 4288381.05 frames. ], batch size: 332, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:19:36,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=229170.0, ans=0.0 2023-06-18 14:19:37,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=229170.0, ans=0.2 2023-06-18 14:19:40,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=229170.0, ans=0.2 2023-06-18 14:19:44,110 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:19:47,925 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.70 vs. limit=10.0 2023-06-18 14:20:52,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=229410.0, ans=0.1 2023-06-18 14:20:53,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=229410.0, ans=0.02 2023-06-18 14:20:59,195 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.14 vs. limit=10.0 2023-06-18 14:20:59,784 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.151e+02 3.706e+02 4.565e+02 6.512e+02 1.080e+03, threshold=9.129e+02, percent-clipped=5.0 2023-06-18 14:21:02,940 INFO [train.py:996] (3/4) Epoch 2, batch 7750, loss[loss=0.3346, simple_loss=0.4174, pruned_loss=0.1259, over 21690.00 frames. ], tot_loss[loss=0.3176, simple_loss=0.3716, pruned_loss=0.1318, over 4285659.05 frames. ], batch size: 298, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:21:43,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=229530.0, ans=0.125 2023-06-18 14:22:12,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=229650.0, ans=0.1 2023-06-18 14:22:40,240 INFO [train.py:996] (3/4) Epoch 2, batch 7800, loss[loss=0.2878, simple_loss=0.3424, pruned_loss=0.1166, over 21815.00 frames. ], tot_loss[loss=0.3151, simple_loss=0.3702, pruned_loss=0.13, over 4277758.06 frames. ], batch size: 317, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:22:50,763 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.67 vs. limit=15.0 2023-06-18 14:22:51,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=229770.0, ans=0.0 2023-06-18 14:23:57,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=229950.0, ans=0.0 2023-06-18 14:24:10,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=230010.0, ans=0.04949747468305833 2023-06-18 14:24:13,084 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.566e+02 4.138e+02 5.286e+02 1.209e+03, threshold=8.275e+02, percent-clipped=5.0 2023-06-18 14:24:16,390 INFO [train.py:996] (3/4) Epoch 2, batch 7850, loss[loss=0.3176, simple_loss=0.3475, pruned_loss=0.1439, over 21373.00 frames. ], tot_loss[loss=0.3106, simple_loss=0.3634, pruned_loss=0.129, over 4267591.80 frames. ], batch size: 473, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:24:19,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=230070.0, ans=0.125 2023-06-18 14:24:19,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=230070.0, ans=0.0 2023-06-18 14:24:28,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=230070.0, ans=0.125 2023-06-18 14:24:32,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=230130.0, ans=0.125 2023-06-18 14:25:25,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=230250.0, ans=0.2 2023-06-18 14:25:55,464 INFO [train.py:996] (3/4) Epoch 2, batch 7900, loss[loss=0.2308, simple_loss=0.2933, pruned_loss=0.08419, over 21171.00 frames. ], tot_loss[loss=0.306, simple_loss=0.3584, pruned_loss=0.1268, over 4267582.86 frames. ], batch size: 159, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:26:02,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=230370.0, ans=0.1 2023-06-18 14:26:51,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=230490.0, ans=0.0 2023-06-18 14:26:54,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=230490.0, ans=0.0 2023-06-18 14:27:29,823 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 3.545e+02 4.486e+02 5.981e+02 1.155e+03, threshold=8.972e+02, percent-clipped=9.0 2023-06-18 14:27:32,831 INFO [train.py:996] (3/4) Epoch 2, batch 7950, loss[loss=0.3715, simple_loss=0.4276, pruned_loss=0.1577, over 21374.00 frames. ], tot_loss[loss=0.3089, simple_loss=0.3635, pruned_loss=0.1271, over 4268902.46 frames. ], batch size: 548, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:27:53,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=230670.0, ans=0.09899494936611666 2023-06-18 14:28:04,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=230670.0, ans=0.125 2023-06-18 14:29:26,754 INFO [train.py:996] (3/4) Epoch 2, batch 8000, loss[loss=0.301, simple_loss=0.3537, pruned_loss=0.1241, over 21593.00 frames. ], tot_loss[loss=0.3171, simple_loss=0.3714, pruned_loss=0.1314, over 4268739.16 frames. ], batch size: 112, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:30:58,748 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.385e+02 3.226e+02 3.981e+02 5.095e+02 8.184e+02, threshold=7.963e+02, percent-clipped=0.0 2023-06-18 14:31:02,181 INFO [train.py:996] (3/4) Epoch 2, batch 8050, loss[loss=0.2812, simple_loss=0.319, pruned_loss=0.1217, over 21255.00 frames. ], tot_loss[loss=0.3168, simple_loss=0.3739, pruned_loss=0.1299, over 4259095.33 frames. ], batch size: 159, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:31:03,262 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=22.5 2023-06-18 14:31:09,252 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.56 vs. limit=22.5 2023-06-18 14:31:21,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=231330.0, ans=0.125 2023-06-18 14:32:43,058 INFO [train.py:996] (3/4) Epoch 2, batch 8100, loss[loss=0.3077, simple_loss=0.3565, pruned_loss=0.1294, over 21308.00 frames. ], tot_loss[loss=0.3165, simple_loss=0.3717, pruned_loss=0.1307, over 4265101.85 frames. ], batch size: 143, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:32:48,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=231570.0, ans=0.125 2023-06-18 14:33:28,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=231690.0, ans=0.0 2023-06-18 14:34:02,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=231750.0, ans=0.1 2023-06-18 14:34:21,406 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.452e+02 3.933e+02 5.153e+02 6.580e+02 1.761e+03, threshold=1.031e+03, percent-clipped=12.0 2023-06-18 14:34:24,582 INFO [train.py:996] (3/4) Epoch 2, batch 8150, loss[loss=0.3466, simple_loss=0.4321, pruned_loss=0.1306, over 21671.00 frames. ], tot_loss[loss=0.3222, simple_loss=0.3794, pruned_loss=0.1325, over 4266786.51 frames. ], batch size: 414, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:35:02,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=231930.0, ans=0.1 2023-06-18 14:35:26,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=232050.0, ans=0.125 2023-06-18 14:35:39,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=232110.0, ans=0.2 2023-06-18 14:35:56,590 INFO [train.py:996] (3/4) Epoch 2, batch 8200, loss[loss=0.2611, simple_loss=0.3123, pruned_loss=0.105, over 21629.00 frames. ], tot_loss[loss=0.316, simple_loss=0.3723, pruned_loss=0.1298, over 4265120.11 frames. ], batch size: 298, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:36:40,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=232230.0, ans=0.07 2023-06-18 14:36:40,982 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=12.0 2023-06-18 14:36:42,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=232290.0, ans=0.0 2023-06-18 14:36:48,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.81 vs. limit=15.0 2023-06-18 14:37:02,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=232350.0, ans=0.125 2023-06-18 14:37:03,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-18 14:37:26,543 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 3.508e+02 4.438e+02 6.300e+02 1.246e+03, threshold=8.875e+02, percent-clipped=2.0 2023-06-18 14:37:29,920 INFO [train.py:996] (3/4) Epoch 2, batch 8250, loss[loss=0.2511, simple_loss=0.3252, pruned_loss=0.08855, over 21321.00 frames. ], tot_loss[loss=0.3165, simple_loss=0.3721, pruned_loss=0.1304, over 4267879.64 frames. ], batch size: 131, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:38:16,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=232590.0, ans=0.1 2023-06-18 14:39:08,468 INFO [train.py:996] (3/4) Epoch 2, batch 8300, loss[loss=0.2755, simple_loss=0.3475, pruned_loss=0.1018, over 21682.00 frames. ], tot_loss[loss=0.3109, simple_loss=0.3685, pruned_loss=0.1267, over 4266701.47 frames. ], batch size: 263, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:40:38,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 3.024e+02 4.156e+02 5.477e+02 9.498e+02, threshold=8.312e+02, percent-clipped=2.0 2023-06-18 14:40:46,284 INFO [train.py:996] (3/4) Epoch 2, batch 8350, loss[loss=0.2747, simple_loss=0.3274, pruned_loss=0.111, over 21479.00 frames. ], tot_loss[loss=0.3056, simple_loss=0.366, pruned_loss=0.1226, over 4257846.14 frames. ], batch size: 195, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:41:01,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=233070.0, ans=0.0 2023-06-18 14:41:04,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=233070.0, ans=0.125 2023-06-18 14:41:23,735 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.84 vs. limit=15.0 2023-06-18 14:41:26,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=233190.0, ans=0.0 2023-06-18 14:41:46,898 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-18 14:42:18,638 INFO [train.py:996] (3/4) Epoch 2, batch 8400, loss[loss=0.2354, simple_loss=0.2793, pruned_loss=0.09578, over 21844.00 frames. ], tot_loss[loss=0.3001, simple_loss=0.3635, pruned_loss=0.1183, over 4257420.30 frames. ], batch size: 107, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:42:48,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=233430.0, ans=0.1 2023-06-18 14:43:37,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=233610.0, ans=0.2 2023-06-18 14:43:38,508 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.43 vs. limit=15.0 2023-06-18 14:43:47,332 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 3.200e+02 3.844e+02 5.205e+02 8.692e+02, threshold=7.689e+02, percent-clipped=1.0 2023-06-18 14:43:55,594 INFO [train.py:996] (3/4) Epoch 2, batch 8450, loss[loss=0.3088, simple_loss=0.3513, pruned_loss=0.1332, over 21768.00 frames. ], tot_loss[loss=0.2986, simple_loss=0.3603, pruned_loss=0.1184, over 4255382.85 frames. ], batch size: 124, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:44:41,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=233850.0, ans=0.125 2023-06-18 14:44:57,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=233910.0, ans=0.2 2023-06-18 14:45:22,563 INFO [train.py:996] (3/4) Epoch 2, batch 8500, loss[loss=0.2728, simple_loss=0.3236, pruned_loss=0.111, over 21828.00 frames. ], tot_loss[loss=0.2996, simple_loss=0.3573, pruned_loss=0.121, over 4255873.47 frames. ], batch size: 98, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:45:55,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=234030.0, ans=0.0 2023-06-18 14:46:30,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=234210.0, ans=0.125 2023-06-18 14:46:56,926 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.289e+02 3.316e+02 3.972e+02 4.532e+02 9.950e+02, threshold=7.945e+02, percent-clipped=2.0 2023-06-18 14:46:57,712 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.06 vs. limit=12.0 2023-06-18 14:47:05,332 INFO [train.py:996] (3/4) Epoch 2, batch 8550, loss[loss=0.3824, simple_loss=0.4723, pruned_loss=0.1462, over 20715.00 frames. ], tot_loss[loss=0.3078, simple_loss=0.3648, pruned_loss=0.1254, over 4265854.25 frames. ], batch size: 607, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:47:10,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=234270.0, ans=0.125 2023-06-18 14:47:11,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=234270.0, ans=0.1 2023-06-18 14:47:32,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=234330.0, ans=0.125 2023-06-18 14:47:44,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=234390.0, ans=0.125 2023-06-18 14:47:49,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=234390.0, ans=0.2 2023-06-18 14:48:40,156 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.54 vs. limit=12.0 2023-06-18 14:48:43,739 INFO [train.py:996] (3/4) Epoch 2, batch 8600, loss[loss=0.3143, simple_loss=0.3712, pruned_loss=0.1288, over 21786.00 frames. ], tot_loss[loss=0.3134, simple_loss=0.37, pruned_loss=0.1284, over 4268698.83 frames. ], batch size: 332, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:49:12,145 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.91 vs. limit=5.0 2023-06-18 14:49:28,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=234690.0, ans=0.125 2023-06-18 14:49:59,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=234810.0, ans=0.07 2023-06-18 14:50:16,910 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.25 vs. limit=6.0 2023-06-18 14:50:17,540 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 3.454e+02 4.150e+02 5.051e+02 9.343e+02, threshold=8.300e+02, percent-clipped=1.0 2023-06-18 14:50:18,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=234810.0, ans=0.1 2023-06-18 14:50:20,564 INFO [train.py:996] (3/4) Epoch 2, batch 8650, loss[loss=0.3437, simple_loss=0.3971, pruned_loss=0.1451, over 21469.00 frames. ], tot_loss[loss=0.3188, simple_loss=0.378, pruned_loss=0.1298, over 4272747.68 frames. ], batch size: 211, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:50:26,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=234870.0, ans=0.0 2023-06-18 14:50:30,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=234870.0, ans=0.2 2023-06-18 14:50:42,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=234930.0, ans=0.1 2023-06-18 14:51:44,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=235110.0, ans=0.2 2023-06-18 14:51:55,559 INFO [train.py:996] (3/4) Epoch 2, batch 8700, loss[loss=0.2727, simple_loss=0.3132, pruned_loss=0.116, over 21485.00 frames. ], tot_loss[loss=0.3089, simple_loss=0.3689, pruned_loss=0.1244, over 4266480.64 frames. ], batch size: 230, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:52:00,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=235170.0, ans=0.0 2023-06-18 14:52:14,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=235230.0, ans=0.05 2023-06-18 14:52:20,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=235230.0, ans=0.1 2023-06-18 14:52:37,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=235290.0, ans=0.125 2023-06-18 14:52:55,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=235350.0, ans=0.125 2023-06-18 14:53:29,259 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 3.324e+02 3.894e+02 5.284e+02 1.235e+03, threshold=7.788e+02, percent-clipped=5.0 2023-06-18 14:53:32,121 INFO [train.py:996] (3/4) Epoch 2, batch 8750, loss[loss=0.3034, simple_loss=0.3564, pruned_loss=0.1252, over 21827.00 frames. ], tot_loss[loss=0.3074, simple_loss=0.3649, pruned_loss=0.125, over 4274034.00 frames. ], batch size: 298, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:53:53,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=235530.0, ans=0.0 2023-06-18 14:54:02,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=235590.0, ans=0.2 2023-06-18 14:54:11,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=235590.0, ans=0.1 2023-06-18 14:55:09,485 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.92 vs. limit=15.0 2023-06-18 14:55:09,951 INFO [train.py:996] (3/4) Epoch 2, batch 8800, loss[loss=0.3326, simple_loss=0.4067, pruned_loss=0.1292, over 21767.00 frames. ], tot_loss[loss=0.3176, simple_loss=0.3753, pruned_loss=0.13, over 4280724.74 frames. ], batch size: 332, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:56:39,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=236010.0, ans=0.07 2023-06-18 14:56:45,180 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 3.886e+02 4.865e+02 6.860e+02 1.473e+03, threshold=9.729e+02, percent-clipped=14.0 2023-06-18 14:56:45,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=236010.0, ans=0.2 2023-06-18 14:56:48,265 INFO [train.py:996] (3/4) Epoch 2, batch 8850, loss[loss=0.2679, simple_loss=0.3431, pruned_loss=0.09635, over 21407.00 frames. ], tot_loss[loss=0.3223, simple_loss=0.3813, pruned_loss=0.1316, over 4277645.66 frames. ], batch size: 194, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:57:14,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=236130.0, ans=0.125 2023-06-18 14:57:26,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=236190.0, ans=0.05 2023-06-18 14:57:45,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=236190.0, ans=0.2 2023-06-18 14:57:51,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=236250.0, ans=0.0 2023-06-18 14:58:26,530 INFO [train.py:996] (3/4) Epoch 2, batch 8900, loss[loss=0.349, simple_loss=0.4064, pruned_loss=0.1458, over 21578.00 frames. ], tot_loss[loss=0.3184, simple_loss=0.3745, pruned_loss=0.1312, over 4272608.42 frames. ], batch size: 441, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:58:47,904 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=12.0 2023-06-18 14:58:55,868 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.83 vs. limit=6.0 2023-06-18 14:59:21,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=236490.0, ans=0.125 2023-06-18 14:59:49,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=236610.0, ans=0.125 2023-06-18 14:59:53,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=236610.0, ans=0.1 2023-06-18 14:59:55,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=236610.0, ans=0.0 2023-06-18 15:00:03,005 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 3.292e+02 4.166e+02 5.426e+02 1.146e+03, threshold=8.333e+02, percent-clipped=5.0 2023-06-18 15:00:05,990 INFO [train.py:996] (3/4) Epoch 2, batch 8950, loss[loss=0.2641, simple_loss=0.3332, pruned_loss=0.09749, over 21780.00 frames. ], tot_loss[loss=0.3158, simple_loss=0.3734, pruned_loss=0.1291, over 4275078.60 frames. ], batch size: 282, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:00:13,102 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.92 vs. limit=15.0 2023-06-18 15:00:20,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=236670.0, ans=0.0 2023-06-18 15:00:45,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=236730.0, ans=0.0 2023-06-18 15:01:02,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=236790.0, ans=0.0 2023-06-18 15:01:19,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=236850.0, ans=0.125 2023-06-18 15:01:36,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=236910.0, ans=0.125 2023-06-18 15:01:42,178 INFO [train.py:996] (3/4) Epoch 2, batch 9000, loss[loss=0.2822, simple_loss=0.3401, pruned_loss=0.1121, over 21685.00 frames. ], tot_loss[loss=0.3114, simple_loss=0.3663, pruned_loss=0.1282, over 4270215.16 frames. ], batch size: 333, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:01:42,179 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 15:01:54,617 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.3147, 3.5283, 1.9271, 2.1770], device='cuda:3') 2023-06-18 15:02:02,154 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.2979, simple_loss=0.3967, pruned_loss=0.09958, over 1796401.00 frames. 2023-06-18 15:02:02,155 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-18 15:02:36,952 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-18 15:02:57,222 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=15.0 2023-06-18 15:03:05,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=237150.0, ans=0.1 2023-06-18 15:03:17,165 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.12 vs. limit=15.0 2023-06-18 15:03:17,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=237210.0, ans=0.0 2023-06-18 15:03:36,052 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 3.310e+02 4.099e+02 5.036e+02 9.465e+02, threshold=8.198e+02, percent-clipped=3.0 2023-06-18 15:03:39,407 INFO [train.py:996] (3/4) Epoch 2, batch 9050, loss[loss=0.3249, simple_loss=0.3806, pruned_loss=0.1346, over 21719.00 frames. ], tot_loss[loss=0.305, simple_loss=0.3617, pruned_loss=0.1242, over 4278638.98 frames. ], batch size: 332, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:04:35,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=237390.0, ans=0.2 2023-06-18 15:05:06,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=237510.0, ans=0.2 2023-06-18 15:05:23,356 INFO [train.py:996] (3/4) Epoch 2, batch 9100, loss[loss=0.2956, simple_loss=0.3644, pruned_loss=0.1134, over 21271.00 frames. ], tot_loss[loss=0.3148, simple_loss=0.3705, pruned_loss=0.1296, over 4280741.67 frames. ], batch size: 159, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:05:36,831 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.90 vs. limit=12.0 2023-06-18 15:05:49,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=237630.0, ans=0.125 2023-06-18 15:05:53,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=237630.0, ans=0.2 2023-06-18 15:06:14,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=237750.0, ans=0.1 2023-06-18 15:06:58,936 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 3.002e+02 3.899e+02 5.912e+02 1.285e+03, threshold=7.799e+02, percent-clipped=7.0 2023-06-18 15:07:05,470 INFO [train.py:996] (3/4) Epoch 2, batch 9150, loss[loss=0.3126, simple_loss=0.3839, pruned_loss=0.1207, over 21740.00 frames. ], tot_loss[loss=0.3131, simple_loss=0.3743, pruned_loss=0.1259, over 4280905.44 frames. ], batch size: 298, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:07:15,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=237870.0, ans=0.05 2023-06-18 15:07:35,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=237990.0, ans=0.5 2023-06-18 15:07:39,719 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=22.5 2023-06-18 15:08:43,035 INFO [train.py:996] (3/4) Epoch 2, batch 9200, loss[loss=0.3461, simple_loss=0.4064, pruned_loss=0.1429, over 21630.00 frames. ], tot_loss[loss=0.3125, simple_loss=0.3762, pruned_loss=0.1244, over 4274842.48 frames. ], batch size: 389, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:08:49,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=238170.0, ans=0.0 2023-06-18 15:09:01,183 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=22.5 2023-06-18 15:09:30,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=238290.0, ans=0.125 2023-06-18 15:10:17,227 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 3.265e+02 3.893e+02 4.706e+02 1.094e+03, threshold=7.786e+02, percent-clipped=2.0 2023-06-18 15:10:18,916 INFO [train.py:996] (3/4) Epoch 2, batch 9250, loss[loss=0.342, simple_loss=0.3661, pruned_loss=0.1589, over 21257.00 frames. ], tot_loss[loss=0.3182, simple_loss=0.3772, pruned_loss=0.1296, over 4276405.03 frames. ], batch size: 471, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:10:22,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=238470.0, ans=0.125 2023-06-18 15:10:49,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=238590.0, ans=0.1 2023-06-18 15:11:59,718 INFO [train.py:996] (3/4) Epoch 2, batch 9300, loss[loss=0.2926, simple_loss=0.3662, pruned_loss=0.1095, over 21557.00 frames. ], tot_loss[loss=0.3146, simple_loss=0.3715, pruned_loss=0.1289, over 4274499.14 frames. ], batch size: 230, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:12:09,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=238770.0, ans=0.1 2023-06-18 15:13:15,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=238950.0, ans=0.125 2023-06-18 15:13:32,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=239010.0, ans=0.125 2023-06-18 15:13:37,187 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.699e+02 3.726e+02 4.567e+02 5.377e+02 1.117e+03, threshold=9.135e+02, percent-clipped=5.0 2023-06-18 15:13:38,728 INFO [train.py:996] (3/4) Epoch 2, batch 9350, loss[loss=0.405, simple_loss=0.4442, pruned_loss=0.1829, over 21483.00 frames. ], tot_loss[loss=0.3194, simple_loss=0.3782, pruned_loss=0.1303, over 4276840.40 frames. ], batch size: 471, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:13:52,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=239070.0, ans=0.125 2023-06-18 15:14:25,137 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.82 vs. limit=15.0 2023-06-18 15:14:48,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=239250.0, ans=0.1 2023-06-18 15:15:17,460 INFO [train.py:996] (3/4) Epoch 2, batch 9400, loss[loss=0.2608, simple_loss=0.3169, pruned_loss=0.1024, over 21634.00 frames. ], tot_loss[loss=0.3206, simple_loss=0.3792, pruned_loss=0.131, over 4283298.92 frames. ], batch size: 298, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:15:20,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=239370.0, ans=0.125 2023-06-18 15:15:20,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=239370.0, ans=0.125 2023-06-18 15:15:34,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=239370.0, ans=0.125 2023-06-18 15:15:40,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=239430.0, ans=0.1 2023-06-18 15:16:45,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=239610.0, ans=0.125 2023-06-18 15:16:53,057 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.296e+02 4.208e+02 5.207e+02 1.060e+03, threshold=8.416e+02, percent-clipped=2.0 2023-06-18 15:16:54,627 INFO [train.py:996] (3/4) Epoch 2, batch 9450, loss[loss=0.2602, simple_loss=0.3039, pruned_loss=0.1082, over 21484.00 frames. ], tot_loss[loss=0.3131, simple_loss=0.3685, pruned_loss=0.1288, over 4277864.19 frames. ], batch size: 195, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:16:55,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=239670.0, ans=0.125 2023-06-18 15:17:09,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=239670.0, ans=0.0 2023-06-18 15:18:01,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=239850.0, ans=0.2 2023-06-18 15:18:10,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=239850.0, ans=0.0 2023-06-18 15:18:27,729 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-18 15:18:31,534 INFO [train.py:996] (3/4) Epoch 2, batch 9500, loss[loss=0.3001, simple_loss=0.3575, pruned_loss=0.1214, over 21842.00 frames. ], tot_loss[loss=0.3089, simple_loss=0.3633, pruned_loss=0.1273, over 4267505.02 frames. ], batch size: 316, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:18:34,129 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.06 vs. limit=15.0 2023-06-18 15:19:38,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.70 vs. limit=22.5 2023-06-18 15:19:44,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=240150.0, ans=0.125 2023-06-18 15:19:48,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=240210.0, ans=0.125 2023-06-18 15:20:02,352 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.526e+02 3.477e+02 4.438e+02 5.411e+02 9.373e+02, threshold=8.876e+02, percent-clipped=3.0 2023-06-18 15:20:02,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=240270.0, ans=0.025 2023-06-18 15:20:02,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=240270.0, ans=0.2 2023-06-18 15:20:04,079 INFO [train.py:996] (3/4) Epoch 2, batch 9550, loss[loss=0.3309, simple_loss=0.383, pruned_loss=0.1395, over 21379.00 frames. ], tot_loss[loss=0.3137, simple_loss=0.368, pruned_loss=0.1297, over 4264733.47 frames. ], batch size: 131, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:20:38,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=240330.0, ans=0.125 2023-06-18 15:21:06,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=240390.0, ans=0.125 2023-06-18 15:21:40,117 INFO [train.py:996] (3/4) Epoch 2, batch 9600, loss[loss=0.3381, simple_loss=0.3766, pruned_loss=0.1498, over 21754.00 frames. ], tot_loss[loss=0.3162, simple_loss=0.3699, pruned_loss=0.1313, over 4276259.53 frames. ], batch size: 441, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:23:16,550 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 3.171e+02 3.689e+02 4.506e+02 8.293e+02, threshold=7.377e+02, percent-clipped=0.0 2023-06-18 15:23:18,129 INFO [train.py:996] (3/4) Epoch 2, batch 9650, loss[loss=0.3216, simple_loss=0.3788, pruned_loss=0.1322, over 21466.00 frames. ], tot_loss[loss=0.3169, simple_loss=0.3703, pruned_loss=0.1318, over 4277896.51 frames. ], batch size: 131, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:23:40,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=240870.0, ans=0.125 2023-06-18 15:24:23,029 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.02 vs. limit=5.0 2023-06-18 15:24:41,550 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-18 15:24:47,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=241110.0, ans=0.2 2023-06-18 15:24:59,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=241170.0, ans=0.1 2023-06-18 15:25:00,592 INFO [train.py:996] (3/4) Epoch 2, batch 9700, loss[loss=0.3337, simple_loss=0.3866, pruned_loss=0.1405, over 21731.00 frames. ], tot_loss[loss=0.3177, simple_loss=0.3731, pruned_loss=0.1312, over 4267592.62 frames. ], batch size: 414, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:25:02,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=241170.0, ans=0.125 2023-06-18 15:26:03,867 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-18 15:26:07,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=241350.0, ans=0.0 2023-06-18 15:26:35,693 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.207e+02 3.701e+02 4.556e+02 8.027e+02, threshold=7.401e+02, percent-clipped=3.0 2023-06-18 15:26:37,173 INFO [train.py:996] (3/4) Epoch 2, batch 9750, loss[loss=0.2905, simple_loss=0.3317, pruned_loss=0.1246, over 21863.00 frames. ], tot_loss[loss=0.3132, simple_loss=0.3671, pruned_loss=0.1296, over 4261035.33 frames. ], batch size: 98, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:26:37,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=241470.0, ans=0.0 2023-06-18 15:27:15,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=241590.0, ans=0.0 2023-06-18 15:27:16,431 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=22.5 2023-06-18 15:27:25,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=241590.0, ans=0.125 2023-06-18 15:28:01,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=241710.0, ans=0.1 2023-06-18 15:28:08,542 INFO [train.py:996] (3/4) Epoch 2, batch 9800, loss[loss=0.2927, simple_loss=0.3526, pruned_loss=0.1164, over 21867.00 frames. ], tot_loss[loss=0.3108, simple_loss=0.3643, pruned_loss=0.1286, over 4253633.91 frames. ], batch size: 107, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:29:02,261 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=22.5 2023-06-18 15:29:10,001 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=22.5 2023-06-18 15:29:42,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=242010.0, ans=0.125 2023-06-18 15:29:43,739 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.224e+02 3.313e+02 4.009e+02 5.228e+02 9.511e+02, threshold=8.018e+02, percent-clipped=4.0 2023-06-18 15:29:44,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=242070.0, ans=0.125 2023-06-18 15:29:45,199 INFO [train.py:996] (3/4) Epoch 2, batch 9850, loss[loss=0.2946, simple_loss=0.3356, pruned_loss=0.1268, over 21758.00 frames. ], tot_loss[loss=0.3087, simple_loss=0.3604, pruned_loss=0.1285, over 4252912.42 frames. ], batch size: 415, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:31:22,215 INFO [train.py:996] (3/4) Epoch 2, batch 9900, loss[loss=0.4391, simple_loss=0.5123, pruned_loss=0.183, over 19707.00 frames. ], tot_loss[loss=0.3066, simple_loss=0.3577, pruned_loss=0.1277, over 4257070.71 frames. ], batch size: 702, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:32:16,825 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=22.5 2023-06-18 15:32:23,233 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=22.5 2023-06-18 15:32:33,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=242550.0, ans=0.125 2023-06-18 15:32:33,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=242550.0, ans=0.0 2023-06-18 15:32:41,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=242550.0, ans=0.125 2023-06-18 15:33:02,432 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 3.487e+02 4.462e+02 5.702e+02 1.060e+03, threshold=8.923e+02, percent-clipped=2.0 2023-06-18 15:33:03,926 INFO [train.py:996] (3/4) Epoch 2, batch 9950, loss[loss=0.2964, simple_loss=0.3337, pruned_loss=0.1295, over 21380.00 frames. ], tot_loss[loss=0.3106, simple_loss=0.3602, pruned_loss=0.1305, over 4256517.55 frames. ], batch size: 194, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:33:32,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=242730.0, ans=0.2 2023-06-18 15:33:59,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=242850.0, ans=0.0 2023-06-18 15:34:12,975 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=22.5 2023-06-18 15:34:20,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=242910.0, ans=0.125 2023-06-18 15:34:20,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=242910.0, ans=0.125 2023-06-18 15:34:41,498 INFO [train.py:996] (3/4) Epoch 2, batch 10000, loss[loss=0.3347, simple_loss=0.3874, pruned_loss=0.141, over 21632.00 frames. ], tot_loss[loss=0.3067, simple_loss=0.3568, pruned_loss=0.1283, over 4251026.70 frames. ], batch size: 415, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:34:42,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=242970.0, ans=0.125 2023-06-18 15:36:14,707 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 3.362e+02 4.103e+02 5.165e+02 9.257e+02, threshold=8.205e+02, percent-clipped=2.0 2023-06-18 15:36:16,250 INFO [train.py:996] (3/4) Epoch 2, batch 10050, loss[loss=0.2977, simple_loss=0.3497, pruned_loss=0.1228, over 21617.00 frames. ], tot_loss[loss=0.3093, simple_loss=0.3596, pruned_loss=0.1295, over 4254610.59 frames. ], batch size: 391, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:36:33,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=243270.0, ans=0.125 2023-06-18 15:36:38,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=243270.0, ans=0.125 2023-06-18 15:37:23,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=243450.0, ans=0.0 2023-06-18 15:38:03,627 INFO [train.py:996] (3/4) Epoch 2, batch 10100, loss[loss=0.2725, simple_loss=0.3241, pruned_loss=0.1104, over 21472.00 frames. ], tot_loss[loss=0.3012, simple_loss=0.3529, pruned_loss=0.1248, over 4261358.07 frames. ], batch size: 211, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:38:21,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=243630.0, ans=0.1 2023-06-18 15:38:56,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=243750.0, ans=0.1 2023-06-18 15:39:02,964 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-06-18 15:39:08,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=243750.0, ans=0.125 2023-06-18 15:39:33,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=243810.0, ans=0.125 2023-06-18 15:39:34,453 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.79 vs. limit=15.0 2023-06-18 15:39:39,567 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.265e+02 3.952e+02 5.116e+02 8.346e+02, threshold=7.904e+02, percent-clipped=1.0 2023-06-18 15:39:41,246 INFO [train.py:996] (3/4) Epoch 2, batch 10150, loss[loss=0.3234, simple_loss=0.3578, pruned_loss=0.1445, over 21813.00 frames. ], tot_loss[loss=0.3101, simple_loss=0.3615, pruned_loss=0.1294, over 4266325.90 frames. ], batch size: 98, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:40:17,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=243990.0, ans=0.2 2023-06-18 15:40:43,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=244050.0, ans=0.07 2023-06-18 15:41:02,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=244110.0, ans=0.1 2023-06-18 15:41:04,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=244110.0, ans=0.125 2023-06-18 15:41:07,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=244110.0, ans=0.125 2023-06-18 15:41:19,207 INFO [train.py:996] (3/4) Epoch 2, batch 10200, loss[loss=0.275, simple_loss=0.3481, pruned_loss=0.101, over 21560.00 frames. ], tot_loss[loss=0.3071, simple_loss=0.3603, pruned_loss=0.127, over 4268056.92 frames. ], batch size: 389, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:41:48,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=244230.0, ans=0.1 2023-06-18 15:42:17,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=244350.0, ans=0.2 2023-06-18 15:42:25,242 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.21 vs. limit=15.0 2023-06-18 15:42:43,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=244410.0, ans=0.125 2023-06-18 15:42:54,647 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.912e+02 2.941e+02 3.489e+02 4.418e+02 6.706e+02, threshold=6.977e+02, percent-clipped=0.0 2023-06-18 15:42:56,139 INFO [train.py:996] (3/4) Epoch 2, batch 10250, loss[loss=0.2379, simple_loss=0.3164, pruned_loss=0.07967, over 21501.00 frames. ], tot_loss[loss=0.2952, simple_loss=0.3529, pruned_loss=0.1188, over 4263025.73 frames. ], batch size: 195, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:43:08,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=244470.0, ans=0.125 2023-06-18 15:44:01,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=244650.0, ans=0.2 2023-06-18 15:44:34,297 INFO [train.py:996] (3/4) Epoch 2, batch 10300, loss[loss=0.31, simple_loss=0.3836, pruned_loss=0.1182, over 21812.00 frames. ], tot_loss[loss=0.3001, simple_loss=0.3574, pruned_loss=0.1213, over 4270771.00 frames. ], batch size: 282, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:45:00,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=244830.0, ans=0.0 2023-06-18 15:45:10,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=244890.0, ans=0.125 2023-06-18 15:45:18,870 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:45:31,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=244890.0, ans=0.125 2023-06-18 15:46:17,222 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.346e+02 4.331e+02 5.577e+02 1.197e+03, threshold=8.662e+02, percent-clipped=10.0 2023-06-18 15:46:18,810 INFO [train.py:996] (3/4) Epoch 2, batch 10350, loss[loss=0.3508, simple_loss=0.4023, pruned_loss=0.1496, over 21453.00 frames. ], tot_loss[loss=0.2998, simple_loss=0.3581, pruned_loss=0.1207, over 4274120.39 frames. ], batch size: 471, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:46:29,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=245070.0, ans=0.0 2023-06-18 15:46:40,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=245130.0, ans=0.0 2023-06-18 15:47:46,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=245310.0, ans=0.125 2023-06-18 15:48:00,714 INFO [train.py:996] (3/4) Epoch 2, batch 10400, loss[loss=0.3389, simple_loss=0.3911, pruned_loss=0.1434, over 20727.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3479, pruned_loss=0.1166, over 4271011.82 frames. ], batch size: 607, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:48:42,297 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.99 vs. limit=10.0 2023-06-18 15:49:40,523 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 3.447e+02 4.106e+02 4.896e+02 8.870e+02, threshold=8.213e+02, percent-clipped=2.0 2023-06-18 15:49:42,065 INFO [train.py:996] (3/4) Epoch 2, batch 10450, loss[loss=0.3107, simple_loss=0.3725, pruned_loss=0.1245, over 21640.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.3532, pruned_loss=0.1208, over 4274934.96 frames. ], batch size: 263, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:49:42,966 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.21 vs. limit=10.0 2023-06-18 15:50:23,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=245730.0, ans=0.125 2023-06-18 15:50:23,918 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-06-18 15:50:28,767 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2023-06-18 15:51:19,826 INFO [train.py:996] (3/4) Epoch 2, batch 10500, loss[loss=0.2799, simple_loss=0.3307, pruned_loss=0.1146, over 21563.00 frames. ], tot_loss[loss=0.2982, simple_loss=0.3546, pruned_loss=0.1209, over 4268113.57 frames. ], batch size: 414, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:51:58,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=246030.0, ans=0.0 2023-06-18 15:52:30,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=246150.0, ans=0.125 2023-06-18 15:52:30,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=246150.0, ans=0.125 2023-06-18 15:52:31,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=246150.0, ans=0.125 2023-06-18 15:52:54,399 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 3.201e+02 3.705e+02 4.440e+02 6.098e+02, threshold=7.409e+02, percent-clipped=0.0 2023-06-18 15:52:55,942 INFO [train.py:996] (3/4) Epoch 2, batch 10550, loss[loss=0.2832, simple_loss=0.3248, pruned_loss=0.1207, over 21659.00 frames. ], tot_loss[loss=0.2956, simple_loss=0.3495, pruned_loss=0.1209, over 4271207.44 frames. ], batch size: 282, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:53:12,352 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=22.5 2023-06-18 15:53:18,858 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=12.0 2023-06-18 15:54:22,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=246510.0, ans=0.125 2023-06-18 15:54:32,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=246570.0, ans=0.125 2023-06-18 15:54:33,561 INFO [train.py:996] (3/4) Epoch 2, batch 10600, loss[loss=0.2636, simple_loss=0.353, pruned_loss=0.08707, over 21617.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3448, pruned_loss=0.1179, over 4277938.88 frames. ], batch size: 389, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:55:45,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=246750.0, ans=0.0 2023-06-18 15:55:52,187 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-18 15:56:09,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=246810.0, ans=0.125 2023-06-18 15:56:22,342 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 3.234e+02 3.580e+02 4.539e+02 8.323e+02, threshold=7.159e+02, percent-clipped=4.0 2023-06-18 15:56:23,894 INFO [train.py:996] (3/4) Epoch 2, batch 10650, loss[loss=0.3208, simple_loss=0.3935, pruned_loss=0.124, over 21647.00 frames. ], tot_loss[loss=0.2886, simple_loss=0.3468, pruned_loss=0.1152, over 4278035.73 frames. ], batch size: 414, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:56:54,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=246930.0, ans=0.125 2023-06-18 15:56:55,384 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=22.5 2023-06-18 15:57:12,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=246990.0, ans=0.0 2023-06-18 15:58:01,627 INFO [train.py:996] (3/4) Epoch 2, batch 10700, loss[loss=0.2707, simple_loss=0.3309, pruned_loss=0.1053, over 21409.00 frames. ], tot_loss[loss=0.2889, simple_loss=0.3463, pruned_loss=0.1157, over 4255397.36 frames. ], batch size: 211, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:58:38,523 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=15.0 2023-06-18 15:59:29,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=247410.0, ans=0.1 2023-06-18 15:59:32,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=15.0 2023-06-18 15:59:39,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=247410.0, ans=0.0 2023-06-18 15:59:43,238 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.410e+02 4.130e+02 4.973e+02 8.640e+02, threshold=8.260e+02, percent-clipped=4.0 2023-06-18 15:59:44,830 INFO [train.py:996] (3/4) Epoch 2, batch 10750, loss[loss=0.3292, simple_loss=0.4155, pruned_loss=0.1214, over 21710.00 frames. ], tot_loss[loss=0.3035, simple_loss=0.3606, pruned_loss=0.1232, over 4261840.04 frames. ], batch size: 414, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 16:00:08,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=247530.0, ans=0.1 2023-06-18 16:00:14,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=247530.0, ans=0.125 2023-06-18 16:00:18,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=247530.0, ans=0.2 2023-06-18 16:00:57,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=247650.0, ans=0.125 2023-06-18 16:01:00,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=247650.0, ans=0.125 2023-06-18 16:01:02,644 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-18 16:01:30,540 INFO [train.py:996] (3/4) Epoch 2, batch 10800, loss[loss=0.3137, simple_loss=0.3744, pruned_loss=0.1265, over 21721.00 frames. ], tot_loss[loss=0.3052, simple_loss=0.3646, pruned_loss=0.1228, over 4259283.70 frames. ], batch size: 332, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 16:01:41,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=247770.0, ans=0.2 2023-06-18 16:01:51,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=247830.0, ans=0.125 2023-06-18 16:01:54,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=247830.0, ans=0.125 2023-06-18 16:02:37,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=247950.0, ans=0.125 2023-06-18 16:03:08,329 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 3.160e+02 3.815e+02 4.913e+02 8.496e+02, threshold=7.629e+02, percent-clipped=1.0 2023-06-18 16:03:08,349 INFO [train.py:996] (3/4) Epoch 2, batch 10850, loss[loss=0.2869, simple_loss=0.337, pruned_loss=0.1183, over 21573.00 frames. ], tot_loss[loss=0.3063, simple_loss=0.3653, pruned_loss=0.1237, over 4251956.50 frames. ], batch size: 263, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:03:38,180 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 16:03:41,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=248190.0, ans=0.125 2023-06-18 16:03:44,119 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=13.53 vs. limit=15.0 2023-06-18 16:04:15,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=248250.0, ans=0.0 2023-06-18 16:04:46,955 INFO [train.py:996] (3/4) Epoch 2, batch 10900, loss[loss=0.2862, simple_loss=0.3353, pruned_loss=0.1185, over 20753.00 frames. ], tot_loss[loss=0.3008, simple_loss=0.3594, pruned_loss=0.1211, over 4247110.68 frames. ], batch size: 607, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:04:50,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=248370.0, ans=0.1 2023-06-18 16:04:51,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=248370.0, ans=0.125 2023-06-18 16:04:55,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=248370.0, ans=0.2 2023-06-18 16:05:04,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=248430.0, ans=0.1 2023-06-18 16:05:21,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=248490.0, ans=0.125 2023-06-18 16:05:44,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=248550.0, ans=0.0 2023-06-18 16:06:10,898 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.41 vs. limit=6.0 2023-06-18 16:06:11,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=248610.0, ans=0.2 2023-06-18 16:06:18,625 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.191e+02 2.990e+02 3.670e+02 4.688e+02 1.000e+03, threshold=7.341e+02, percent-clipped=2.0 2023-06-18 16:06:18,645 INFO [train.py:996] (3/4) Epoch 2, batch 10950, loss[loss=0.3042, simple_loss=0.342, pruned_loss=0.1332, over 21993.00 frames. ], tot_loss[loss=0.296, simple_loss=0.354, pruned_loss=0.1191, over 4252681.88 frames. ], batch size: 103, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:07:55,482 INFO [train.py:996] (3/4) Epoch 2, batch 11000, loss[loss=0.2589, simple_loss=0.3194, pruned_loss=0.09918, over 21835.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3544, pruned_loss=0.1216, over 4258415.62 frames. ], batch size: 98, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:08:03,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=248970.0, ans=0.2 2023-06-18 16:08:33,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=249090.0, ans=0.0 2023-06-18 16:08:34,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=249090.0, ans=0.125 2023-06-18 16:08:51,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=249090.0, ans=0.1 2023-06-18 16:08:55,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=249150.0, ans=0.1 2023-06-18 16:09:27,381 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-18 16:09:32,785 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 3.446e+02 4.232e+02 5.447e+02 9.802e+02, threshold=8.463e+02, percent-clipped=9.0 2023-06-18 16:09:32,805 INFO [train.py:996] (3/4) Epoch 2, batch 11050, loss[loss=0.2719, simple_loss=0.3161, pruned_loss=0.1139, over 21488.00 frames. ], tot_loss[loss=0.2991, simple_loss=0.3528, pruned_loss=0.1227, over 4259828.73 frames. ], batch size: 212, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:09:38,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.96 vs. limit=6.0 2023-06-18 16:09:51,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=249330.0, ans=0.2 2023-06-18 16:09:55,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=249330.0, ans=0.125 2023-06-18 16:10:02,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=249390.0, ans=0.125 2023-06-18 16:10:31,234 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.01 vs. limit=12.0 2023-06-18 16:11:10,287 INFO [train.py:996] (3/4) Epoch 2, batch 11100, loss[loss=0.2867, simple_loss=0.3395, pruned_loss=0.1169, over 21726.00 frames. ], tot_loss[loss=0.2992, simple_loss=0.3511, pruned_loss=0.1236, over 4249636.71 frames. ], batch size: 351, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:12:29,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=249750.0, ans=0.125 2023-06-18 16:12:44,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=249810.0, ans=0.125 2023-06-18 16:12:47,651 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.019e+02 3.669e+02 4.475e+02 9.197e+02, threshold=7.338e+02, percent-clipped=1.0 2023-06-18 16:12:47,671 INFO [train.py:996] (3/4) Epoch 2, batch 11150, loss[loss=0.2844, simple_loss=0.3515, pruned_loss=0.1087, over 21672.00 frames. ], tot_loss[loss=0.2975, simple_loss=0.3497, pruned_loss=0.1227, over 4248276.29 frames. ], batch size: 332, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:12:51,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=249870.0, ans=0.0 2023-06-18 16:13:26,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=249990.0, ans=0.2 2023-06-18 16:14:23,554 INFO [train.py:996] (3/4) Epoch 2, batch 11200, loss[loss=0.3073, simple_loss=0.3377, pruned_loss=0.1385, over 21550.00 frames. ], tot_loss[loss=0.2956, simple_loss=0.3471, pruned_loss=0.122, over 4257850.49 frames. ], batch size: 443, lr: 1.73e-02, grad_scale: 32.0 2023-06-18 16:14:25,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=250170.0, ans=0.125 2023-06-18 16:14:25,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=250170.0, ans=0.0 2023-06-18 16:14:38,190 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=22.5 2023-06-18 16:14:51,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=250230.0, ans=0.2 2023-06-18 16:15:59,447 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 3.566e+02 4.194e+02 5.517e+02 1.156e+03, threshold=8.389e+02, percent-clipped=11.0 2023-06-18 16:15:59,467 INFO [train.py:996] (3/4) Epoch 2, batch 11250, loss[loss=0.2781, simple_loss=0.331, pruned_loss=0.1126, over 21176.00 frames. ], tot_loss[loss=0.2952, simple_loss=0.3474, pruned_loss=0.1215, over 4251005.04 frames. ], batch size: 176, lr: 1.73e-02, grad_scale: 32.0 2023-06-18 16:16:23,315 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=15.0 2023-06-18 16:16:42,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=250590.0, ans=0.0 2023-06-18 16:17:36,314 INFO [train.py:996] (3/4) Epoch 2, batch 11300, loss[loss=0.2763, simple_loss=0.3307, pruned_loss=0.1109, over 21870.00 frames. ], tot_loss[loss=0.2952, simple_loss=0.3482, pruned_loss=0.1211, over 4256472.29 frames. ], batch size: 118, lr: 1.73e-02, grad_scale: 32.0 2023-06-18 16:17:53,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=250830.0, ans=0.125 2023-06-18 16:18:56,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=251010.0, ans=0.125 2023-06-18 16:19:13,153 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 3.183e+02 3.739e+02 4.623e+02 9.049e+02, threshold=7.478e+02, percent-clipped=1.0 2023-06-18 16:19:13,173 INFO [train.py:996] (3/4) Epoch 2, batch 11350, loss[loss=0.2878, simple_loss=0.352, pruned_loss=0.1118, over 21768.00 frames. ], tot_loss[loss=0.2932, simple_loss=0.3475, pruned_loss=0.1195, over 4262590.15 frames. ], batch size: 124, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:20:23,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=251250.0, ans=0.0 2023-06-18 16:20:53,113 INFO [train.py:996] (3/4) Epoch 2, batch 11400, loss[loss=0.3792, simple_loss=0.4184, pruned_loss=0.17, over 21752.00 frames. ], tot_loss[loss=0.3037, simple_loss=0.3574, pruned_loss=0.125, over 4264998.81 frames. ], batch size: 441, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:21:04,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=251370.0, ans=0.0 2023-06-18 16:21:18,261 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-18 16:21:19,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=251430.0, ans=0.0 2023-06-18 16:22:00,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=251550.0, ans=0.0 2023-06-18 16:22:11,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=251550.0, ans=0.125 2023-06-18 16:22:34,135 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.522e+02 4.244e+02 5.675e+02 1.170e+03, threshold=8.488e+02, percent-clipped=5.0 2023-06-18 16:22:34,155 INFO [train.py:996] (3/4) Epoch 2, batch 11450, loss[loss=0.2819, simple_loss=0.3436, pruned_loss=0.1101, over 21435.00 frames. ], tot_loss[loss=0.3047, simple_loss=0.3593, pruned_loss=0.1251, over 4267172.51 frames. ], batch size: 211, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:23:31,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=251790.0, ans=0.125 2023-06-18 16:24:02,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=251910.0, ans=0.125 2023-06-18 16:24:13,562 INFO [train.py:996] (3/4) Epoch 2, batch 11500, loss[loss=0.2562, simple_loss=0.2896, pruned_loss=0.1114, over 20721.00 frames. ], tot_loss[loss=0.3058, simple_loss=0.361, pruned_loss=0.1253, over 4261732.33 frames. ], batch size: 608, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:25:37,903 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-18 16:25:52,841 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.222e+02 3.283e+02 4.041e+02 4.776e+02 1.091e+03, threshold=8.082e+02, percent-clipped=3.0 2023-06-18 16:25:52,862 INFO [train.py:996] (3/4) Epoch 2, batch 11550, loss[loss=0.3206, simple_loss=0.3666, pruned_loss=0.1372, over 21438.00 frames. ], tot_loss[loss=0.3093, simple_loss=0.3671, pruned_loss=0.1258, over 4257399.19 frames. ], batch size: 131, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:26:25,643 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2023-06-18 16:26:26,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=252330.0, ans=0.2 2023-06-18 16:26:28,801 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-18 16:26:51,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=252390.0, ans=0.125 2023-06-18 16:26:57,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=252450.0, ans=0.1 2023-06-18 16:27:03,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=252450.0, ans=0.125 2023-06-18 16:27:09,975 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 16:27:26,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=252510.0, ans=0.125 2023-06-18 16:27:48,278 INFO [train.py:996] (3/4) Epoch 2, batch 11600, loss[loss=0.2857, simple_loss=0.3592, pruned_loss=0.1061, over 21866.00 frames. ], tot_loss[loss=0.3173, simple_loss=0.3812, pruned_loss=0.1267, over 4263016.07 frames. ], batch size: 107, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:28:08,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=252630.0, ans=0.1 2023-06-18 16:28:21,967 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-06-18 16:29:04,890 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=12.0 2023-06-18 16:29:14,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=252810.0, ans=0.1 2023-06-18 16:29:25,170 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.182e+02 3.666e+02 4.957e+02 6.331e+02 1.126e+03, threshold=9.914e+02, percent-clipped=8.0 2023-06-18 16:29:25,190 INFO [train.py:996] (3/4) Epoch 2, batch 11650, loss[loss=0.291, simple_loss=0.3785, pruned_loss=0.1017, over 21426.00 frames. ], tot_loss[loss=0.3189, simple_loss=0.3867, pruned_loss=0.1256, over 4255854.61 frames. ], batch size: 194, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:29:29,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=252870.0, ans=0.0 2023-06-18 16:30:12,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=253050.0, ans=0.1 2023-06-18 16:31:02,378 INFO [train.py:996] (3/4) Epoch 2, batch 11700, loss[loss=0.2625, simple_loss=0.3098, pruned_loss=0.1076, over 21630.00 frames. ], tot_loss[loss=0.3145, simple_loss=0.378, pruned_loss=0.1255, over 4257719.36 frames. ], batch size: 282, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:31:36,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=253290.0, ans=0.125 2023-06-18 16:32:38,114 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.450e+02 3.237e+02 3.960e+02 5.340e+02 1.578e+03, threshold=7.920e+02, percent-clipped=3.0 2023-06-18 16:32:38,135 INFO [train.py:996] (3/4) Epoch 2, batch 11750, loss[loss=0.2603, simple_loss=0.3079, pruned_loss=0.1064, over 21988.00 frames. ], tot_loss[loss=0.3091, simple_loss=0.3683, pruned_loss=0.125, over 4266664.14 frames. ], batch size: 103, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:32:40,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=253470.0, ans=0.0 2023-06-18 16:32:40,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=253470.0, ans=0.125 2023-06-18 16:32:45,495 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.87 vs. limit=15.0 2023-06-18 16:33:13,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=253590.0, ans=0.2 2023-06-18 16:34:17,513 INFO [train.py:996] (3/4) Epoch 2, batch 11800, loss[loss=0.3344, simple_loss=0.4141, pruned_loss=0.1274, over 21878.00 frames. ], tot_loss[loss=0.3158, simple_loss=0.373, pruned_loss=0.1293, over 4264865.59 frames. ], batch size: 372, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:34:30,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=253770.0, ans=0.0 2023-06-18 16:34:38,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=253830.0, ans=0.1 2023-06-18 16:35:23,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=253950.0, ans=0.0 2023-06-18 16:35:23,588 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.17 vs. limit=22.5 2023-06-18 16:35:57,247 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 3.227e+02 4.040e+02 5.078e+02 7.033e+02, threshold=8.080e+02, percent-clipped=0.0 2023-06-18 16:35:57,267 INFO [train.py:996] (3/4) Epoch 2, batch 11850, loss[loss=0.2979, simple_loss=0.3673, pruned_loss=0.1142, over 21732.00 frames. ], tot_loss[loss=0.3164, simple_loss=0.3747, pruned_loss=0.1291, over 4267936.33 frames. ], batch size: 298, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:36:24,877 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-06-18 16:37:35,361 INFO [train.py:996] (3/4) Epoch 2, batch 11900, loss[loss=0.2582, simple_loss=0.3278, pruned_loss=0.09431, over 21741.00 frames. ], tot_loss[loss=0.311, simple_loss=0.3723, pruned_loss=0.1249, over 4271073.46 frames. ], batch size: 282, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:37:45,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=254370.0, ans=0.1 2023-06-18 16:39:10,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=254610.0, ans=0.125 2023-06-18 16:39:13,278 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 3.213e+02 3.818e+02 4.903e+02 8.116e+02, threshold=7.635e+02, percent-clipped=1.0 2023-06-18 16:39:13,298 INFO [train.py:996] (3/4) Epoch 2, batch 11950, loss[loss=0.2401, simple_loss=0.3229, pruned_loss=0.07862, over 21770.00 frames. ], tot_loss[loss=0.3068, simple_loss=0.3713, pruned_loss=0.1211, over 4265020.15 frames. ], batch size: 316, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:39:16,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=254670.0, ans=0.125 2023-06-18 16:39:25,720 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 16:39:35,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=254730.0, ans=0.125 2023-06-18 16:40:02,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=254790.0, ans=0.2 2023-06-18 16:40:43,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=254910.0, ans=0.1 2023-06-18 16:40:49,306 INFO [train.py:996] (3/4) Epoch 2, batch 12000, loss[loss=0.3274, simple_loss=0.3696, pruned_loss=0.1426, over 22007.00 frames. ], tot_loss[loss=0.3001, simple_loss=0.3646, pruned_loss=0.1178, over 4259905.87 frames. ], batch size: 103, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:40:49,307 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 16:41:05,152 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.2926, simple_loss=0.3848, pruned_loss=0.1002, over 1796401.00 frames. 2023-06-18 16:41:05,152 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-18 16:41:09,078 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-18 16:41:19,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=255030.0, ans=0.0 2023-06-18 16:42:28,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=255210.0, ans=0.025 2023-06-18 16:42:42,372 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 3.604e+02 5.059e+02 6.079e+02 1.381e+03, threshold=1.012e+03, percent-clipped=10.0 2023-06-18 16:42:42,392 INFO [train.py:996] (3/4) Epoch 2, batch 12050, loss[loss=0.3893, simple_loss=0.4082, pruned_loss=0.1852, over 21635.00 frames. ], tot_loss[loss=0.3016, simple_loss=0.3616, pruned_loss=0.1208, over 4262662.40 frames. ], batch size: 471, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:42:48,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=255270.0, ans=0.2 2023-06-18 16:44:03,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=255510.0, ans=0.0 2023-06-18 16:44:15,706 INFO [train.py:996] (3/4) Epoch 2, batch 12100, loss[loss=0.3551, simple_loss=0.3963, pruned_loss=0.157, over 21367.00 frames. ], tot_loss[loss=0.31, simple_loss=0.3671, pruned_loss=0.1265, over 4275377.99 frames. ], batch size: 176, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:44:47,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=255630.0, ans=0.125 2023-06-18 16:44:58,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=255630.0, ans=22.5 2023-06-18 16:45:36,717 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-06-18 16:45:42,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=255810.0, ans=0.125 2023-06-18 16:46:01,039 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.581e+02 3.947e+02 4.820e+02 5.748e+02 9.180e+02, threshold=9.640e+02, percent-clipped=0.0 2023-06-18 16:46:01,060 INFO [train.py:996] (3/4) Epoch 2, batch 12150, loss[loss=0.256, simple_loss=0.3212, pruned_loss=0.09546, over 21230.00 frames. ], tot_loss[loss=0.3099, simple_loss=0.368, pruned_loss=0.1259, over 4274498.81 frames. ], batch size: 159, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:46:59,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=256050.0, ans=0.04949747468305833 2023-06-18 16:46:59,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=256050.0, ans=0.1 2023-06-18 16:47:04,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=256050.0, ans=0.1 2023-06-18 16:47:10,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=256050.0, ans=0.125 2023-06-18 16:47:30,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=256110.0, ans=0.1 2023-06-18 16:47:47,399 INFO [train.py:996] (3/4) Epoch 2, batch 12200, loss[loss=0.3088, simple_loss=0.3448, pruned_loss=0.1364, over 21542.00 frames. ], tot_loss[loss=0.308, simple_loss=0.3644, pruned_loss=0.1258, over 4273394.45 frames. ], batch size: 414, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:48:19,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=256290.0, ans=0.0 2023-06-18 16:48:24,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=256290.0, ans=0.2 2023-06-18 16:48:29,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=256290.0, ans=0.5 2023-06-18 16:48:35,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=256350.0, ans=0.125 2023-06-18 16:48:43,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=256350.0, ans=0.125 2023-06-18 16:49:01,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=256410.0, ans=0.0 2023-06-18 16:49:24,652 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.91 vs. limit=5.0 2023-06-18 16:49:24,890 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 2.962e+02 3.747e+02 4.961e+02 1.098e+03, threshold=7.494e+02, percent-clipped=1.0 2023-06-18 16:49:24,910 INFO [train.py:996] (3/4) Epoch 2, batch 12250, loss[loss=0.2642, simple_loss=0.3298, pruned_loss=0.09927, over 21788.00 frames. ], tot_loss[loss=0.2992, simple_loss=0.3565, pruned_loss=0.1209, over 4262007.27 frames. ], batch size: 352, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:50:19,965 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-18 16:50:47,494 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.51 vs. limit=10.0 2023-06-18 16:51:02,001 INFO [train.py:996] (3/4) Epoch 2, batch 12300, loss[loss=0.1837, simple_loss=0.248, pruned_loss=0.05967, over 21252.00 frames. ], tot_loss[loss=0.2865, simple_loss=0.3478, pruned_loss=0.1126, over 4260840.89 frames. ], batch size: 159, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:52:13,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=257010.0, ans=0.2 2023-06-18 16:52:37,991 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.933e+02 2.926e+02 3.592e+02 4.510e+02 1.066e+03, threshold=7.183e+02, percent-clipped=4.0 2023-06-18 16:52:38,011 INFO [train.py:996] (3/4) Epoch 2, batch 12350, loss[loss=0.3146, simple_loss=0.3606, pruned_loss=0.1344, over 21857.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.3511, pruned_loss=0.1125, over 4269849.91 frames. ], batch size: 124, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 16:52:47,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=257070.0, ans=0.035 2023-06-18 16:52:49,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=257070.0, ans=0.2 2023-06-18 16:52:55,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=257130.0, ans=0.125 2023-06-18 16:54:09,374 INFO [train.py:996] (3/4) Epoch 2, batch 12400, loss[loss=0.3103, simple_loss=0.3582, pruned_loss=0.1312, over 21258.00 frames. ], tot_loss[loss=0.2943, simple_loss=0.3544, pruned_loss=0.1171, over 4272174.35 frames. ], batch size: 176, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 16:55:29,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=257610.0, ans=0.125 2023-06-18 16:55:43,210 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.681e+02 3.570e+02 4.078e+02 4.958e+02 7.763e+02, threshold=8.156e+02, percent-clipped=1.0 2023-06-18 16:55:43,230 INFO [train.py:996] (3/4) Epoch 2, batch 12450, loss[loss=0.3194, simple_loss=0.3741, pruned_loss=0.1324, over 21685.00 frames. ], tot_loss[loss=0.3014, simple_loss=0.3588, pruned_loss=0.122, over 4270791.57 frames. ], batch size: 389, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 16:56:21,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=257790.0, ans=0.2 2023-06-18 16:56:23,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=257790.0, ans=0.125 2023-06-18 16:56:37,697 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=22.5 2023-06-18 16:57:11,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=257910.0, ans=0.125 2023-06-18 16:57:20,742 INFO [train.py:996] (3/4) Epoch 2, batch 12500, loss[loss=0.3796, simple_loss=0.439, pruned_loss=0.1601, over 21412.00 frames. ], tot_loss[loss=0.3137, simple_loss=0.3719, pruned_loss=0.1277, over 4276144.91 frames. ], batch size: 131, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 16:57:22,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=257970.0, ans=0.125 2023-06-18 16:57:33,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=257970.0, ans=0.125 2023-06-18 16:58:06,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=258090.0, ans=0.1 2023-06-18 16:58:15,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=258090.0, ans=0.2 2023-06-18 16:58:52,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=258210.0, ans=0.05 2023-06-18 16:58:58,480 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.621e+02 3.350e+02 3.939e+02 4.917e+02 9.519e+02, threshold=7.878e+02, percent-clipped=2.0 2023-06-18 16:58:58,500 INFO [train.py:996] (3/4) Epoch 2, batch 12550, loss[loss=0.3378, simple_loss=0.4382, pruned_loss=0.1187, over 20731.00 frames. ], tot_loss[loss=0.3209, simple_loss=0.3796, pruned_loss=0.1312, over 4277907.67 frames. ], batch size: 608, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 16:59:14,053 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.39 vs. limit=5.0 2023-06-18 16:59:16,556 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 16:59:56,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=258390.0, ans=0.2 2023-06-18 17:00:21,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=258510.0, ans=0.1 2023-06-18 17:00:28,323 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-06-18 17:00:36,874 INFO [train.py:996] (3/4) Epoch 2, batch 12600, loss[loss=0.2352, simple_loss=0.3129, pruned_loss=0.0787, over 21472.00 frames. ], tot_loss[loss=0.3149, simple_loss=0.3758, pruned_loss=0.127, over 4274532.46 frames. ], batch size: 212, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:02:12,915 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.876e+02 3.494e+02 4.502e+02 7.452e+02, threshold=6.987e+02, percent-clipped=0.0 2023-06-18 17:02:12,935 INFO [train.py:996] (3/4) Epoch 2, batch 12650, loss[loss=0.2816, simple_loss=0.3342, pruned_loss=0.1145, over 21754.00 frames. ], tot_loss[loss=0.304, simple_loss=0.3648, pruned_loss=0.1216, over 4271448.67 frames. ], batch size: 247, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:02:25,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=258870.0, ans=0.0 2023-06-18 17:03:25,622 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.49 vs. limit=10.0 2023-06-18 17:03:31,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=259110.0, ans=0.04949747468305833 2023-06-18 17:03:49,445 INFO [train.py:996] (3/4) Epoch 2, batch 12700, loss[loss=0.3918, simple_loss=0.4252, pruned_loss=0.1792, over 21442.00 frames. ], tot_loss[loss=0.3076, simple_loss=0.3658, pruned_loss=0.1247, over 4269824.27 frames. ], batch size: 471, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:04:04,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=259170.0, ans=0.125 2023-06-18 17:04:49,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=259350.0, ans=0.125 2023-06-18 17:05:07,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=259410.0, ans=0.0 2023-06-18 17:05:25,058 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 3.415e+02 4.071e+02 4.986e+02 8.988e+02, threshold=8.142e+02, percent-clipped=6.0 2023-06-18 17:05:25,078 INFO [train.py:996] (3/4) Epoch 2, batch 12750, loss[loss=0.2967, simple_loss=0.3521, pruned_loss=0.1206, over 21896.00 frames. ], tot_loss[loss=0.3097, simple_loss=0.3689, pruned_loss=0.1253, over 4274136.54 frames. ], batch size: 118, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:05:41,511 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-18 17:06:19,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=259590.0, ans=0.2 2023-06-18 17:06:56,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=259710.0, ans=0.125 2023-06-18 17:07:08,566 INFO [train.py:996] (3/4) Epoch 2, batch 12800, loss[loss=0.3686, simple_loss=0.3997, pruned_loss=0.1687, over 21871.00 frames. ], tot_loss[loss=0.3095, simple_loss=0.3674, pruned_loss=0.1258, over 4280972.80 frames. ], batch size: 414, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:07:24,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=259770.0, ans=0.0 2023-06-18 17:07:35,457 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-18 17:08:46,727 INFO [train.py:996] (3/4) Epoch 2, batch 12850, loss[loss=0.3599, simple_loss=0.4111, pruned_loss=0.1544, over 21748.00 frames. ], tot_loss[loss=0.3128, simple_loss=0.3701, pruned_loss=0.1277, over 4281584.86 frames. ], batch size: 441, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:08:48,203 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.193e+02 3.179e+02 3.774e+02 4.668e+02 7.829e+02, threshold=7.547e+02, percent-clipped=0.0 2023-06-18 17:09:13,817 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-06-18 17:10:37,510 INFO [train.py:996] (3/4) Epoch 2, batch 12900, loss[loss=0.2571, simple_loss=0.3316, pruned_loss=0.09128, over 21619.00 frames. ], tot_loss[loss=0.3069, simple_loss=0.3681, pruned_loss=0.1229, over 4278948.12 frames. ], batch size: 263, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:10:45,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=260370.0, ans=0.0 2023-06-18 17:10:59,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=260430.0, ans=0.1 2023-06-18 17:11:21,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=260490.0, ans=0.0 2023-06-18 17:12:15,683 INFO [train.py:996] (3/4) Epoch 2, batch 12950, loss[loss=0.3138, simple_loss=0.3696, pruned_loss=0.129, over 21495.00 frames. ], tot_loss[loss=0.3044, simple_loss=0.3676, pruned_loss=0.1206, over 4272492.35 frames. ], batch size: 194, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:12:17,121 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.935e+02 3.080e+02 3.556e+02 4.378e+02 7.837e+02, threshold=7.111e+02, percent-clipped=1.0 2023-06-18 17:12:26,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=260670.0, ans=0.125 2023-06-18 17:12:38,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=260730.0, ans=0.1 2023-06-18 17:13:35,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=260910.0, ans=0.125 2023-06-18 17:13:51,817 INFO [train.py:996] (3/4) Epoch 2, batch 13000, loss[loss=0.3752, simple_loss=0.4227, pruned_loss=0.1639, over 21395.00 frames. ], tot_loss[loss=0.3077, simple_loss=0.3692, pruned_loss=0.123, over 4277483.27 frames. ], batch size: 549, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:14:05,255 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.17 vs. limit=6.0 2023-06-18 17:14:10,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=261030.0, ans=0.2 2023-06-18 17:14:20,276 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.95 vs. limit=12.0 2023-06-18 17:15:20,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=261210.0, ans=0.125 2023-06-18 17:15:27,567 INFO [train.py:996] (3/4) Epoch 2, batch 13050, loss[loss=0.3382, simple_loss=0.3802, pruned_loss=0.148, over 21951.00 frames. ], tot_loss[loss=0.3035, simple_loss=0.3664, pruned_loss=0.1203, over 4273351.43 frames. ], batch size: 333, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:15:28,589 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-18 17:15:29,090 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.088e+02 4.287e+02 5.215e+02 1.044e+03, threshold=8.575e+02, percent-clipped=6.0 2023-06-18 17:15:40,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=261270.0, ans=0.125 2023-06-18 17:16:21,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=261390.0, ans=0.1 2023-06-18 17:16:55,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=261510.0, ans=0.1 2023-06-18 17:17:04,248 INFO [train.py:996] (3/4) Epoch 2, batch 13100, loss[loss=0.341, simple_loss=0.3942, pruned_loss=0.1439, over 21244.00 frames. ], tot_loss[loss=0.3035, simple_loss=0.3658, pruned_loss=0.1206, over 4274229.94 frames. ], batch size: 143, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:17:27,407 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.53 vs. limit=22.5 2023-06-18 17:18:44,151 INFO [train.py:996] (3/4) Epoch 2, batch 13150, loss[loss=0.3274, simple_loss=0.3756, pruned_loss=0.1396, over 20936.00 frames. ], tot_loss[loss=0.3075, simple_loss=0.3674, pruned_loss=0.1238, over 4274712.50 frames. ], batch size: 608, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:18:45,941 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.361e+02 3.617e+02 4.529e+02 5.724e+02 9.376e+02, threshold=9.058e+02, percent-clipped=0.0 2023-06-18 17:18:50,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=261870.0, ans=0.0 2023-06-18 17:19:06,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=261930.0, ans=0.0 2023-06-18 17:19:15,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=261930.0, ans=0.125 2023-06-18 17:19:16,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=261930.0, ans=0.0 2023-06-18 17:19:41,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=261990.0, ans=0.1 2023-06-18 17:19:59,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=262050.0, ans=0.125 2023-06-18 17:20:18,144 INFO [train.py:996] (3/4) Epoch 2, batch 13200, loss[loss=0.3384, simple_loss=0.3827, pruned_loss=0.1471, over 21936.00 frames. ], tot_loss[loss=0.3044, simple_loss=0.3638, pruned_loss=0.1225, over 4270031.90 frames. ], batch size: 372, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:21:13,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=262290.0, ans=0.0 2023-06-18 17:21:17,307 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.40 vs. limit=15.0 2023-06-18 17:21:59,582 INFO [train.py:996] (3/4) Epoch 2, batch 13250, loss[loss=0.3103, simple_loss=0.36, pruned_loss=0.1304, over 21789.00 frames. ], tot_loss[loss=0.3074, simple_loss=0.3643, pruned_loss=0.1253, over 4277025.98 frames. ], batch size: 124, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:22:06,463 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 3.187e+02 3.809e+02 4.597e+02 7.682e+02, threshold=7.618e+02, percent-clipped=1.0 2023-06-18 17:22:11,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=262470.0, ans=0.2 2023-06-18 17:22:43,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=262530.0, ans=0.0 2023-06-18 17:23:34,936 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.05 vs. limit=15.0 2023-06-18 17:23:43,644 INFO [train.py:996] (3/4) Epoch 2, batch 13300, loss[loss=0.3338, simple_loss=0.3954, pruned_loss=0.1361, over 21755.00 frames. ], tot_loss[loss=0.3081, simple_loss=0.3674, pruned_loss=0.1245, over 4271423.74 frames. ], batch size: 332, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:24:21,360 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.33 vs. limit=15.0 2023-06-18 17:25:25,052 INFO [train.py:996] (3/4) Epoch 2, batch 13350, loss[loss=0.3162, simple_loss=0.381, pruned_loss=0.1257, over 21798.00 frames. ], tot_loss[loss=0.3125, simple_loss=0.3703, pruned_loss=0.1274, over 4275534.40 frames. ], batch size: 282, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:25:26,608 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.186e+02 3.887e+02 4.954e+02 1.112e+03, threshold=7.774e+02, percent-clipped=6.0 2023-06-18 17:25:31,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=263070.0, ans=0.0 2023-06-18 17:25:59,579 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-18 17:27:08,151 INFO [train.py:996] (3/4) Epoch 2, batch 13400, loss[loss=0.3113, simple_loss=0.3629, pruned_loss=0.1298, over 21667.00 frames. ], tot_loss[loss=0.3165, simple_loss=0.3724, pruned_loss=0.1303, over 4284335.16 frames. ], batch size: 263, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:27:38,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=263490.0, ans=0.0 2023-06-18 17:27:51,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=263490.0, ans=0.125 2023-06-18 17:28:03,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=263550.0, ans=0.125 2023-06-18 17:28:45,347 INFO [train.py:996] (3/4) Epoch 2, batch 13450, loss[loss=0.2807, simple_loss=0.3375, pruned_loss=0.1119, over 21668.00 frames. ], tot_loss[loss=0.3221, simple_loss=0.3763, pruned_loss=0.134, over 4286285.68 frames. ], batch size: 298, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:28:46,838 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.450e+02 3.497e+02 3.954e+02 4.827e+02 1.042e+03, threshold=7.908e+02, percent-clipped=7.0 2023-06-18 17:28:52,508 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.15 vs. limit=22.5 2023-06-18 17:29:00,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=263730.0, ans=0.0 2023-06-18 17:29:39,995 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=15.0 2023-06-18 17:29:42,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=263850.0, ans=0.2 2023-06-18 17:30:10,026 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-18 17:30:23,554 INFO [train.py:996] (3/4) Epoch 2, batch 13500, loss[loss=0.2602, simple_loss=0.3072, pruned_loss=0.1066, over 21429.00 frames. ], tot_loss[loss=0.3099, simple_loss=0.3638, pruned_loss=0.128, over 4277780.01 frames. ], batch size: 211, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:30:34,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=263970.0, ans=0.125 2023-06-18 17:30:34,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=263970.0, ans=0.0 2023-06-18 17:30:36,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=22.5 2023-06-18 17:30:51,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=22.5 2023-06-18 17:31:11,485 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-18 17:31:51,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=264210.0, ans=0.0 2023-06-18 17:32:03,898 INFO [train.py:996] (3/4) Epoch 2, batch 13550, loss[loss=0.301, simple_loss=0.3761, pruned_loss=0.1129, over 21453.00 frames. ], tot_loss[loss=0.3127, simple_loss=0.3696, pruned_loss=0.128, over 4274621.05 frames. ], batch size: 194, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:32:05,761 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.329e+02 4.198e+02 5.480e+02 1.124e+03, threshold=8.396e+02, percent-clipped=8.0 2023-06-18 17:33:26,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=264510.0, ans=0.125 2023-06-18 17:33:31,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=264510.0, ans=0.07 2023-06-18 17:33:38,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=264510.0, ans=0.2 2023-06-18 17:33:42,105 INFO [train.py:996] (3/4) Epoch 2, batch 13600, loss[loss=0.2639, simple_loss=0.3275, pruned_loss=0.1001, over 21750.00 frames. ], tot_loss[loss=0.3143, simple_loss=0.3713, pruned_loss=0.1286, over 4273176.82 frames. ], batch size: 247, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:34:05,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=264630.0, ans=0.125 2023-06-18 17:34:10,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=264630.0, ans=0.2 2023-06-18 17:34:13,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=264630.0, ans=0.125 2023-06-18 17:34:38,278 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:35:19,300 INFO [train.py:996] (3/4) Epoch 2, batch 13650, loss[loss=0.2814, simple_loss=0.328, pruned_loss=0.1174, over 21876.00 frames. ], tot_loss[loss=0.3073, simple_loss=0.3654, pruned_loss=0.1247, over 4274070.02 frames. ], batch size: 118, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:35:19,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=264870.0, ans=0.1 2023-06-18 17:35:20,640 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 3.001e+02 3.620e+02 4.450e+02 8.511e+02, threshold=7.240e+02, percent-clipped=1.0 2023-06-18 17:35:47,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=264930.0, ans=10.0 2023-06-18 17:36:34,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=265050.0, ans=0.125 2023-06-18 17:36:36,217 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-18 17:36:47,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=265110.0, ans=0.2 2023-06-18 17:36:55,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=265110.0, ans=15.0 2023-06-18 17:36:59,385 INFO [train.py:996] (3/4) Epoch 2, batch 13700, loss[loss=0.2976, simple_loss=0.3331, pruned_loss=0.131, over 20239.00 frames. ], tot_loss[loss=0.3035, simple_loss=0.3587, pruned_loss=0.1242, over 4277535.89 frames. ], batch size: 703, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:37:06,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=265170.0, ans=0.1 2023-06-18 17:37:15,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=265170.0, ans=0.2 2023-06-18 17:37:30,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=265230.0, ans=0.0 2023-06-18 17:37:57,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=265290.0, ans=0.125 2023-06-18 17:37:59,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=265290.0, ans=0.1 2023-06-18 17:38:39,024 INFO [train.py:996] (3/4) Epoch 2, batch 13750, loss[loss=0.2352, simple_loss=0.2924, pruned_loss=0.08905, over 21170.00 frames. ], tot_loss[loss=0.2997, simple_loss=0.3546, pruned_loss=0.1224, over 4266405.06 frames. ], batch size: 143, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:38:45,114 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.304e+02 3.651e+02 4.578e+02 5.768e+02 1.165e+03, threshold=9.156e+02, percent-clipped=11.0 2023-06-18 17:38:52,851 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=15.0 2023-06-18 17:38:55,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=265470.0, ans=0.0 2023-06-18 17:39:50,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=265650.0, ans=0.0 2023-06-18 17:40:22,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=265710.0, ans=0.125 2023-06-18 17:40:24,250 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=12.0 2023-06-18 17:40:32,081 INFO [train.py:996] (3/4) Epoch 2, batch 13800, loss[loss=0.2664, simple_loss=0.3327, pruned_loss=0.1, over 21092.00 frames. ], tot_loss[loss=0.3046, simple_loss=0.3635, pruned_loss=0.1229, over 4256151.20 frames. ], batch size: 143, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:40:35,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=265770.0, ans=0.1 2023-06-18 17:40:52,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=265830.0, ans=0.125 2023-06-18 17:41:01,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=265830.0, ans=0.125 2023-06-18 17:41:07,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=265830.0, ans=0.125 2023-06-18 17:41:10,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=265830.0, ans=0.125 2023-06-18 17:41:18,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=265890.0, ans=0.125 2023-06-18 17:42:15,731 INFO [train.py:996] (3/4) Epoch 2, batch 13850, loss[loss=0.3456, simple_loss=0.399, pruned_loss=0.1461, over 21741.00 frames. ], tot_loss[loss=0.3084, simple_loss=0.369, pruned_loss=0.1239, over 4265588.54 frames. ], batch size: 332, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:42:16,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=266070.0, ans=0.05 2023-06-18 17:42:17,193 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.209e+02 3.095e+02 3.814e+02 4.955e+02 1.017e+03, threshold=7.628e+02, percent-clipped=1.0 2023-06-18 17:43:16,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=266250.0, ans=0.125 2023-06-18 17:43:53,101 INFO [train.py:996] (3/4) Epoch 2, batch 13900, loss[loss=0.333, simple_loss=0.3761, pruned_loss=0.1449, over 21816.00 frames. ], tot_loss[loss=0.3163, simple_loss=0.3742, pruned_loss=0.1291, over 4267689.93 frames. ], batch size: 414, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:44:05,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=266370.0, ans=0.1 2023-06-18 17:44:12,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=266430.0, ans=0.0 2023-06-18 17:44:52,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=266550.0, ans=0.125 2023-06-18 17:45:35,911 INFO [train.py:996] (3/4) Epoch 2, batch 13950, loss[loss=0.3488, simple_loss=0.3884, pruned_loss=0.1546, over 21866.00 frames. ], tot_loss[loss=0.3211, simple_loss=0.3769, pruned_loss=0.1327, over 4278561.07 frames. ], batch size: 351, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:45:37,802 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.495e+02 3.843e+02 4.662e+02 6.006e+02 1.294e+03, threshold=9.323e+02, percent-clipped=7.0 2023-06-18 17:45:56,008 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.86 vs. limit=10.0 2023-06-18 17:46:09,845 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=15.0 2023-06-18 17:46:27,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=266850.0, ans=0.2 2023-06-18 17:47:14,066 INFO [train.py:996] (3/4) Epoch 2, batch 14000, loss[loss=0.2682, simple_loss=0.3496, pruned_loss=0.09341, over 21672.00 frames. ], tot_loss[loss=0.3127, simple_loss=0.3699, pruned_loss=0.1277, over 4278096.95 frames. ], batch size: 247, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:47:21,193 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-18 17:47:31,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=267030.0, ans=0.125 2023-06-18 17:48:20,145 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=12.0 2023-06-18 17:48:41,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=267210.0, ans=0.125 2023-06-18 17:48:48,746 INFO [train.py:996] (3/4) Epoch 2, batch 14050, loss[loss=0.2686, simple_loss=0.3178, pruned_loss=0.1097, over 21651.00 frames. ], tot_loss[loss=0.3043, simple_loss=0.3629, pruned_loss=0.1228, over 4288234.37 frames. ], batch size: 298, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:48:50,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 2.887e+02 3.656e+02 4.389e+02 9.715e+02, threshold=7.312e+02, percent-clipped=1.0 2023-06-18 17:48:50,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=267270.0, ans=0.0 2023-06-18 17:48:56,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=267270.0, ans=0.95 2023-06-18 17:50:06,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=267510.0, ans=0.125 2023-06-18 17:50:20,002 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.15 vs. limit=10.0 2023-06-18 17:50:22,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=267570.0, ans=0.0 2023-06-18 17:50:23,576 INFO [train.py:996] (3/4) Epoch 2, batch 14100, loss[loss=0.2512, simple_loss=0.3073, pruned_loss=0.09756, over 21338.00 frames. ], tot_loss[loss=0.3008, simple_loss=0.3564, pruned_loss=0.1226, over 4284138.78 frames. ], batch size: 131, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:50:47,473 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-06-18 17:51:48,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=267810.0, ans=0.125 2023-06-18 17:51:50,646 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=22.5 2023-06-18 17:51:52,568 INFO [train.py:996] (3/4) Epoch 2, batch 14150, loss[loss=0.2583, simple_loss=0.3428, pruned_loss=0.08689, over 21753.00 frames. ], tot_loss[loss=0.3053, simple_loss=0.3614, pruned_loss=0.1246, over 4277437.35 frames. ], batch size: 332, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:51:59,080 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.602e+02 4.448e+02 5.500e+02 9.616e+02, threshold=8.896e+02, percent-clipped=7.0 2023-06-18 17:51:59,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=267870.0, ans=0.0 2023-06-18 17:52:21,551 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=22.5 2023-06-18 17:52:24,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=267930.0, ans=0.2 2023-06-18 17:52:40,834 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.79 vs. limit=15.0 2023-06-18 17:53:21,315 INFO [train.py:996] (3/4) Epoch 2, batch 14200, loss[loss=0.306, simple_loss=0.3532, pruned_loss=0.1294, over 21862.00 frames. ], tot_loss[loss=0.3002, simple_loss=0.358, pruned_loss=0.1212, over 4280917.88 frames. ], batch size: 118, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:53:32,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=268170.0, ans=0.125 2023-06-18 17:53:32,697 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=22.5 2023-06-18 17:54:25,737 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=12.0 2023-06-18 17:54:45,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=268410.0, ans=0.125 2023-06-18 17:54:54,776 INFO [train.py:996] (3/4) Epoch 2, batch 14250, loss[loss=0.2698, simple_loss=0.3102, pruned_loss=0.1146, over 21474.00 frames. ], tot_loss[loss=0.2965, simple_loss=0.352, pruned_loss=0.1205, over 4272791.27 frames. ], batch size: 230, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:54:56,157 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 3.115e+02 4.292e+02 5.783e+02 1.043e+03, threshold=8.584e+02, percent-clipped=1.0 2023-06-18 17:55:16,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=268470.0, ans=0.0 2023-06-18 17:55:25,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=268530.0, ans=0.125 2023-06-18 17:56:25,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=268710.0, ans=10.0 2023-06-18 17:56:28,080 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-18 17:56:33,158 INFO [train.py:996] (3/4) Epoch 2, batch 14300, loss[loss=0.3105, simple_loss=0.3869, pruned_loss=0.117, over 21400.00 frames. ], tot_loss[loss=0.3001, simple_loss=0.3582, pruned_loss=0.1211, over 4282090.68 frames. ], batch size: 211, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:56:37,188 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.12 vs. limit=15.0 2023-06-18 17:56:42,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=268770.0, ans=0.125 2023-06-18 17:56:44,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=268770.0, ans=0.125 2023-06-18 17:57:15,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=268890.0, ans=0.125 2023-06-18 17:57:46,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=268950.0, ans=0.125 2023-06-18 17:57:59,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=269010.0, ans=0.1 2023-06-18 17:58:01,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=269010.0, ans=0.035 2023-06-18 17:58:09,472 INFO [train.py:996] (3/4) Epoch 2, batch 14350, loss[loss=0.3542, simple_loss=0.3984, pruned_loss=0.155, over 21712.00 frames. ], tot_loss[loss=0.3046, simple_loss=0.3647, pruned_loss=0.1223, over 4258653.40 frames. ], batch size: 389, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:58:11,147 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 3.215e+02 4.287e+02 5.391e+02 1.265e+03, threshold=8.575e+02, percent-clipped=5.0 2023-06-18 17:58:28,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=269070.0, ans=0.125 2023-06-18 17:58:43,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=22.5 2023-06-18 17:59:04,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=269250.0, ans=0.125 2023-06-18 17:59:21,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-18 17:59:50,460 INFO [train.py:996] (3/4) Epoch 2, batch 14400, loss[loss=0.3376, simple_loss=0.37, pruned_loss=0.1526, over 21816.00 frames. ], tot_loss[loss=0.3049, simple_loss=0.3619, pruned_loss=0.124, over 4265537.32 frames. ], batch size: 351, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:59:57,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=269370.0, ans=0.1 2023-06-18 18:00:00,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=269370.0, ans=0.2 2023-06-18 18:00:15,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=269430.0, ans=0.0 2023-06-18 18:00:20,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=269430.0, ans=0.125 2023-06-18 18:00:37,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=269490.0, ans=0.1 2023-06-18 18:00:47,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=269550.0, ans=0.0 2023-06-18 18:00:56,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=269550.0, ans=0.125 2023-06-18 18:01:25,574 INFO [train.py:996] (3/4) Epoch 2, batch 14450, loss[loss=0.2522, simple_loss=0.2999, pruned_loss=0.1022, over 21576.00 frames. ], tot_loss[loss=0.301, simple_loss=0.3555, pruned_loss=0.1232, over 4269493.05 frames. ], batch size: 231, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 18:01:26,958 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 3.300e+02 3.933e+02 4.836e+02 8.413e+02, threshold=7.867e+02, percent-clipped=0.0 2023-06-18 18:01:56,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=269730.0, ans=0.125 2023-06-18 18:02:01,448 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-18 18:02:14,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=22.5 2023-06-18 18:02:26,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=269850.0, ans=0.1 2023-06-18 18:02:51,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=269910.0, ans=0.125 2023-06-18 18:03:01,243 INFO [train.py:996] (3/4) Epoch 2, batch 14500, loss[loss=0.3255, simple_loss=0.3877, pruned_loss=0.1317, over 21828.00 frames. ], tot_loss[loss=0.2987, simple_loss=0.3524, pruned_loss=0.1225, over 4270646.37 frames. ], batch size: 371, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:03:50,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=270090.0, ans=0.1 2023-06-18 18:04:43,607 INFO [train.py:996] (3/4) Epoch 2, batch 14550, loss[loss=0.3697, simple_loss=0.4161, pruned_loss=0.1616, over 21364.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.3573, pruned_loss=0.1246, over 4267572.28 frames. ], batch size: 176, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:04:45,442 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.923e+02 3.262e+02 3.772e+02 5.838e+02, threshold=6.523e+02, percent-clipped=0.0 2023-06-18 18:06:08,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=270510.0, ans=0.0 2023-06-18 18:06:21,021 INFO [train.py:996] (3/4) Epoch 2, batch 14600, loss[loss=0.3411, simple_loss=0.4076, pruned_loss=0.1372, over 21792.00 frames. ], tot_loss[loss=0.3147, simple_loss=0.367, pruned_loss=0.1312, over 4262971.07 frames. ], batch size: 282, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:06:52,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=270690.0, ans=0.0 2023-06-18 18:07:15,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=270750.0, ans=0.2 2023-06-18 18:07:32,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=270750.0, ans=0.0 2023-06-18 18:07:58,640 INFO [train.py:996] (3/4) Epoch 2, batch 14650, loss[loss=0.2405, simple_loss=0.3237, pruned_loss=0.0787, over 21799.00 frames. ], tot_loss[loss=0.3106, simple_loss=0.366, pruned_loss=0.1276, over 4258601.53 frames. ], batch size: 371, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:08:00,045 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.284e+02 3.921e+02 4.845e+02 9.187e+02, threshold=7.842e+02, percent-clipped=12.0 2023-06-18 18:09:21,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=271110.0, ans=0.0 2023-06-18 18:09:34,782 INFO [train.py:996] (3/4) Epoch 2, batch 14700, loss[loss=0.2516, simple_loss=0.3203, pruned_loss=0.09143, over 21798.00 frames. ], tot_loss[loss=0.3023, simple_loss=0.3617, pruned_loss=0.1215, over 4261676.52 frames. ], batch size: 124, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:09:48,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=271230.0, ans=0.125 2023-06-18 18:09:53,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=271230.0, ans=0.125 2023-06-18 18:10:01,926 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.29 vs. limit=15.0 2023-06-18 18:10:07,245 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-06-18 18:10:15,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=271290.0, ans=0.125 2023-06-18 18:10:35,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=271350.0, ans=0.1 2023-06-18 18:11:13,565 INFO [train.py:996] (3/4) Epoch 2, batch 14750, loss[loss=0.5443, simple_loss=0.5642, pruned_loss=0.2622, over 21430.00 frames. ], tot_loss[loss=0.3101, simple_loss=0.3694, pruned_loss=0.1254, over 4265318.11 frames. ], batch size: 507, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:11:15,468 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 3.023e+02 4.317e+02 6.398e+02 9.994e+02, threshold=8.633e+02, percent-clipped=9.0 2023-06-18 18:11:17,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=271470.0, ans=0.125 2023-06-18 18:11:57,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-18 18:12:05,089 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=15.0 2023-06-18 18:12:31,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=271650.0, ans=0.125 2023-06-18 18:12:51,307 INFO [train.py:996] (3/4) Epoch 2, batch 14800, loss[loss=0.2899, simple_loss=0.3399, pruned_loss=0.12, over 21734.00 frames. ], tot_loss[loss=0.3215, simple_loss=0.3794, pruned_loss=0.1318, over 4263319.77 frames. ], batch size: 124, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:12:53,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=271770.0, ans=0.125 2023-06-18 18:12:59,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=271770.0, ans=0.04949747468305833 2023-06-18 18:14:31,693 INFO [train.py:996] (3/4) Epoch 2, batch 14850, loss[loss=0.2747, simple_loss=0.3455, pruned_loss=0.102, over 20798.00 frames. ], tot_loss[loss=0.3178, simple_loss=0.3728, pruned_loss=0.1314, over 4259911.31 frames. ], batch size: 608, lr: 1.66e-02, grad_scale: 64.0 2023-06-18 18:14:33,116 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.362e+02 3.948e+02 5.141e+02 8.278e+02, threshold=7.896e+02, percent-clipped=0.0 2023-06-18 18:14:33,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=272070.0, ans=0.0 2023-06-18 18:15:28,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=272190.0, ans=0.0 2023-06-18 18:15:33,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=272190.0, ans=0.125 2023-06-18 18:16:09,676 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:16:13,870 INFO [train.py:996] (3/4) Epoch 2, batch 14900, loss[loss=0.3166, simple_loss=0.3788, pruned_loss=0.1272, over 21775.00 frames. ], tot_loss[loss=0.3194, simple_loss=0.3739, pruned_loss=0.1325, over 4257573.40 frames. ], batch size: 124, lr: 1.66e-02, grad_scale: 64.0 2023-06-18 18:16:42,613 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=22.5 2023-06-18 18:16:46,141 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.98 vs. limit=5.0 2023-06-18 18:16:48,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=272430.0, ans=0.0 2023-06-18 18:16:59,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=272430.0, ans=0.1 2023-06-18 18:17:18,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=272550.0, ans=0.125 2023-06-18 18:17:30,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=272550.0, ans=0.125 2023-06-18 18:17:41,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=272610.0, ans=0.1 2023-06-18 18:18:07,143 INFO [train.py:996] (3/4) Epoch 2, batch 14950, loss[loss=0.3034, simple_loss=0.3672, pruned_loss=0.1198, over 21613.00 frames. ], tot_loss[loss=0.3175, simple_loss=0.373, pruned_loss=0.1311, over 4257516.92 frames. ], batch size: 263, lr: 1.66e-02, grad_scale: 64.0 2023-06-18 18:18:08,881 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.099e+02 3.898e+02 5.275e+02 1.469e+03, threshold=7.796e+02, percent-clipped=9.0 2023-06-18 18:18:20,824 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-18 18:18:23,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=272730.0, ans=0.125 2023-06-18 18:18:32,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=272730.0, ans=0.1 2023-06-18 18:19:10,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=272850.0, ans=0.0 2023-06-18 18:19:44,471 INFO [train.py:996] (3/4) Epoch 2, batch 15000, loss[loss=0.3379, simple_loss=0.3872, pruned_loss=0.1443, over 21775.00 frames. ], tot_loss[loss=0.3207, simple_loss=0.376, pruned_loss=0.1327, over 4264101.93 frames. ], batch size: 441, lr: 1.66e-02, grad_scale: 64.0 2023-06-18 18:19:44,472 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 18:19:53,952 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.4554, 3.0710, 2.2899, 3.4576], device='cuda:3') 2023-06-18 18:19:59,946 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.2784, simple_loss=0.3732, pruned_loss=0.09186, over 1796401.00 frames. 2023-06-18 18:19:59,947 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-18 18:20:08,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=272970.0, ans=0.125 2023-06-18 18:20:21,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=273030.0, ans=0.025 2023-06-18 18:20:33,019 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.25 vs. limit=15.0 2023-06-18 18:21:34,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=273210.0, ans=0.0 2023-06-18 18:21:39,220 INFO [train.py:996] (3/4) Epoch 2, batch 15050, loss[loss=0.3204, simple_loss=0.3965, pruned_loss=0.1221, over 21857.00 frames. ], tot_loss[loss=0.3225, simple_loss=0.3775, pruned_loss=0.1337, over 4261009.53 frames. ], batch size: 316, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:21:42,437 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 3.625e+02 4.289e+02 5.445e+02 1.034e+03, threshold=8.577e+02, percent-clipped=3.0 2023-06-18 18:22:05,204 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.07 vs. limit=15.0 2023-06-18 18:23:16,033 INFO [train.py:996] (3/4) Epoch 2, batch 15100, loss[loss=0.3726, simple_loss=0.4188, pruned_loss=0.1631, over 21778.00 frames. ], tot_loss[loss=0.3232, simple_loss=0.3802, pruned_loss=0.1331, over 4266370.13 frames. ], batch size: 441, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:23:45,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=273630.0, ans=0.125 2023-06-18 18:23:57,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=273690.0, ans=0.125 2023-06-18 18:24:21,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=273750.0, ans=0.125 2023-06-18 18:24:29,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=273750.0, ans=0.0 2023-06-18 18:24:34,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=22.5 2023-06-18 18:24:45,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=273810.0, ans=0.0 2023-06-18 18:24:50,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=273810.0, ans=0.125 2023-06-18 18:24:52,900 INFO [train.py:996] (3/4) Epoch 2, batch 15150, loss[loss=0.2553, simple_loss=0.3058, pruned_loss=0.1024, over 21659.00 frames. ], tot_loss[loss=0.321, simple_loss=0.3756, pruned_loss=0.1332, over 4266542.33 frames. ], batch size: 231, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:24:56,352 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.278e+02 3.235e+02 3.924e+02 4.419e+02 1.242e+03, threshold=7.848e+02, percent-clipped=3.0 2023-06-18 18:25:59,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=274050.0, ans=0.0 2023-06-18 18:26:15,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=274110.0, ans=0.1 2023-06-18 18:26:16,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=274110.0, ans=0.125 2023-06-18 18:26:29,405 INFO [train.py:996] (3/4) Epoch 2, batch 15200, loss[loss=0.2617, simple_loss=0.3258, pruned_loss=0.09879, over 21826.00 frames. ], tot_loss[loss=0.3111, simple_loss=0.3655, pruned_loss=0.1284, over 4265887.40 frames. ], batch size: 317, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:26:36,377 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=15.0 2023-06-18 18:27:25,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=274290.0, ans=0.0 2023-06-18 18:27:26,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=274290.0, ans=0.125 2023-06-18 18:27:34,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=274350.0, ans=0.125 2023-06-18 18:28:05,728 INFO [train.py:996] (3/4) Epoch 2, batch 15250, loss[loss=0.325, simple_loss=0.365, pruned_loss=0.1425, over 21863.00 frames. ], tot_loss[loss=0.3062, simple_loss=0.3592, pruned_loss=0.1266, over 4273892.01 frames. ], batch size: 98, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:28:08,693 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.957e+02 3.425e+02 4.072e+02 6.895e+02, threshold=6.850e+02, percent-clipped=0.0 2023-06-18 18:29:43,126 INFO [train.py:996] (3/4) Epoch 2, batch 15300, loss[loss=0.3642, simple_loss=0.4028, pruned_loss=0.1628, over 21837.00 frames. ], tot_loss[loss=0.3134, simple_loss=0.3651, pruned_loss=0.1309, over 4260078.92 frames. ], batch size: 441, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:30:05,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=274770.0, ans=0.0 2023-06-18 18:31:09,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=275010.0, ans=0.0 2023-06-18 18:31:19,242 INFO [train.py:996] (3/4) Epoch 2, batch 15350, loss[loss=0.2933, simple_loss=0.3779, pruned_loss=0.1043, over 21803.00 frames. ], tot_loss[loss=0.3189, simple_loss=0.3713, pruned_loss=0.1332, over 4266687.86 frames. ], batch size: 332, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:31:22,366 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.274e+02 3.404e+02 4.041e+02 5.108e+02 1.058e+03, threshold=8.082e+02, percent-clipped=7.0 2023-06-18 18:31:45,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=275130.0, ans=0.125 2023-06-18 18:32:08,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=275190.0, ans=0.07 2023-06-18 18:32:12,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=275190.0, ans=0.125 2023-06-18 18:32:49,574 INFO [train.py:996] (3/4) Epoch 2, batch 15400, loss[loss=0.3334, simple_loss=0.3761, pruned_loss=0.1453, over 21869.00 frames. ], tot_loss[loss=0.3158, simple_loss=0.3716, pruned_loss=0.13, over 4255187.20 frames. ], batch size: 118, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:33:15,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=275430.0, ans=0.0 2023-06-18 18:33:49,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=275490.0, ans=0.0 2023-06-18 18:34:23,612 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.41 vs. limit=10.0 2023-06-18 18:34:26,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=275670.0, ans=0.1 2023-06-18 18:34:27,579 INFO [train.py:996] (3/4) Epoch 2, batch 15450, loss[loss=0.2569, simple_loss=0.3152, pruned_loss=0.0993, over 21260.00 frames. ], tot_loss[loss=0.3116, simple_loss=0.3673, pruned_loss=0.1279, over 4266368.63 frames. ], batch size: 159, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:34:30,679 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 3.287e+02 3.877e+02 4.804e+02 8.434e+02, threshold=7.754e+02, percent-clipped=1.0 2023-06-18 18:34:52,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=275730.0, ans=0.09899494936611666 2023-06-18 18:35:22,980 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-18 18:35:39,005 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.90 vs. limit=10.0 2023-06-18 18:35:45,449 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-06-18 18:36:02,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=275910.0, ans=0.0 2023-06-18 18:36:05,300 INFO [train.py:996] (3/4) Epoch 2, batch 15500, loss[loss=0.3116, simple_loss=0.3708, pruned_loss=0.1262, over 21651.00 frames. ], tot_loss[loss=0.3132, simple_loss=0.3696, pruned_loss=0.1284, over 4259300.27 frames. ], batch size: 263, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:37:09,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=276090.0, ans=0.125 2023-06-18 18:37:14,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=276150.0, ans=0.125 2023-06-18 18:37:22,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=276150.0, ans=0.1 2023-06-18 18:37:47,891 INFO [train.py:996] (3/4) Epoch 2, batch 15550, loss[loss=0.3338, simple_loss=0.3789, pruned_loss=0.1443, over 21545.00 frames. ], tot_loss[loss=0.3084, simple_loss=0.3661, pruned_loss=0.1253, over 4262746.72 frames. ], batch size: 441, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:37:51,077 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 3.141e+02 3.857e+02 4.872e+02 7.208e+02, threshold=7.715e+02, percent-clipped=0.0 2023-06-18 18:37:54,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=276270.0, ans=0.125 2023-06-18 18:38:21,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=276330.0, ans=0.125 2023-06-18 18:38:43,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=276390.0, ans=0.0 2023-06-18 18:39:17,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=276510.0, ans=0.125 2023-06-18 18:39:24,629 INFO [train.py:996] (3/4) Epoch 2, batch 15600, loss[loss=0.2627, simple_loss=0.3103, pruned_loss=0.1075, over 21392.00 frames. ], tot_loss[loss=0.3026, simple_loss=0.3589, pruned_loss=0.1231, over 4256225.45 frames. ], batch size: 194, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:39:56,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=276630.0, ans=0.0 2023-06-18 18:40:03,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=276630.0, ans=0.125 2023-06-18 18:40:04,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=276630.0, ans=0.5 2023-06-18 18:41:13,296 INFO [train.py:996] (3/4) Epoch 2, batch 15650, loss[loss=0.2968, simple_loss=0.3332, pruned_loss=0.1302, over 21782.00 frames. ], tot_loss[loss=0.3031, simple_loss=0.3586, pruned_loss=0.1238, over 4257597.69 frames. ], batch size: 112, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:41:16,411 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.176e+02 3.146e+02 3.937e+02 5.420e+02 1.080e+03, threshold=7.874e+02, percent-clipped=10.0 2023-06-18 18:42:03,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=277050.0, ans=0.07 2023-06-18 18:42:17,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=277050.0, ans=0.125 2023-06-18 18:42:18,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=277110.0, ans=0.1 2023-06-18 18:42:32,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=277110.0, ans=0.125 2023-06-18 18:42:49,642 INFO [train.py:996] (3/4) Epoch 2, batch 15700, loss[loss=0.2593, simple_loss=0.3057, pruned_loss=0.1065, over 21607.00 frames. ], tot_loss[loss=0.3002, simple_loss=0.3555, pruned_loss=0.1224, over 4255960.03 frames. ], batch size: 247, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:43:16,128 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-18 18:43:21,995 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.71 vs. limit=15.0 2023-06-18 18:43:30,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=277290.0, ans=0.125 2023-06-18 18:43:35,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=277290.0, ans=0.1 2023-06-18 18:43:50,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=277350.0, ans=0.0 2023-06-18 18:44:19,892 INFO [train.py:996] (3/4) Epoch 2, batch 15750, loss[loss=0.2817, simple_loss=0.3284, pruned_loss=0.1175, over 21324.00 frames. ], tot_loss[loss=0.2952, simple_loss=0.3496, pruned_loss=0.1203, over 4252392.30 frames. ], batch size: 211, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:44:28,124 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 3.151e+02 3.926e+02 5.162e+02 7.477e+02, threshold=7.853e+02, percent-clipped=0.0 2023-06-18 18:44:29,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=15.0 2023-06-18 18:44:31,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=277470.0, ans=0.05 2023-06-18 18:44:31,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=277470.0, ans=0.0 2023-06-18 18:44:47,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=277530.0, ans=0.0 2023-06-18 18:45:29,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=277650.0, ans=0.125 2023-06-18 18:46:01,542 INFO [train.py:996] (3/4) Epoch 2, batch 15800, loss[loss=0.3043, simple_loss=0.344, pruned_loss=0.1324, over 21874.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3451, pruned_loss=0.1197, over 4259498.00 frames. ], batch size: 373, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:46:03,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=277770.0, ans=0.125 2023-06-18 18:46:10,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=277770.0, ans=15.0 2023-06-18 18:46:37,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=277890.0, ans=0.2 2023-06-18 18:46:47,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=277890.0, ans=0.125 2023-06-18 18:46:55,908 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.87 vs. limit=15.0 2023-06-18 18:47:06,615 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=12.0 2023-06-18 18:47:30,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=278010.0, ans=0.04949747468305833 2023-06-18 18:47:34,335 INFO [train.py:996] (3/4) Epoch 2, batch 15850, loss[loss=0.3112, simple_loss=0.361, pruned_loss=0.1307, over 21702.00 frames. ], tot_loss[loss=0.2991, simple_loss=0.3504, pruned_loss=0.1239, over 4260646.65 frames. ], batch size: 351, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:47:37,251 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.427e+02 3.232e+02 3.999e+02 5.094e+02 1.481e+03, threshold=7.998e+02, percent-clipped=6.0 2023-06-18 18:48:07,189 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=22.5 2023-06-18 18:48:13,705 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=22.5 2023-06-18 18:48:19,793 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-06-18 18:48:22,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=278190.0, ans=0.1 2023-06-18 18:48:30,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-18 18:49:10,699 INFO [train.py:996] (3/4) Epoch 2, batch 15900, loss[loss=0.2782, simple_loss=0.3253, pruned_loss=0.1155, over 22021.00 frames. ], tot_loss[loss=0.2996, simple_loss=0.3512, pruned_loss=0.124, over 4261035.95 frames. ], batch size: 103, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:49:44,335 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-18 18:49:59,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=278490.0, ans=0.125 2023-06-18 18:50:41,655 INFO [train.py:996] (3/4) Epoch 2, batch 15950, loss[loss=0.227, simple_loss=0.3273, pruned_loss=0.06333, over 21650.00 frames. ], tot_loss[loss=0.294, simple_loss=0.3488, pruned_loss=0.1196, over 4266686.53 frames. ], batch size: 389, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:50:49,856 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 3.268e+02 3.930e+02 5.229e+02 1.203e+03, threshold=7.860e+02, percent-clipped=8.0 2023-06-18 18:51:20,571 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-18 18:52:18,718 INFO [train.py:996] (3/4) Epoch 2, batch 16000, loss[loss=0.2399, simple_loss=0.3063, pruned_loss=0.08674, over 21896.00 frames. ], tot_loss[loss=0.291, simple_loss=0.3487, pruned_loss=0.1166, over 4234420.12 frames. ], batch size: 98, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:52:30,242 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:53:49,659 INFO [train.py:996] (3/4) Epoch 2, batch 16050, loss[loss=0.2309, simple_loss=0.3101, pruned_loss=0.07586, over 21420.00 frames. ], tot_loss[loss=0.2886, simple_loss=0.3496, pruned_loss=0.1138, over 4238797.35 frames. ], batch size: 211, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:53:57,487 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.839e+02 3.161e+02 3.861e+02 4.688e+02 7.896e+02, threshold=7.722e+02, percent-clipped=1.0 2023-06-18 18:54:07,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=279270.0, ans=0.125 2023-06-18 18:55:25,215 INFO [train.py:996] (3/4) Epoch 2, batch 16100, loss[loss=0.2917, simple_loss=0.3485, pruned_loss=0.1175, over 21328.00 frames. ], tot_loss[loss=0.2934, simple_loss=0.3545, pruned_loss=0.1162, over 4257252.44 frames. ], batch size: 176, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:56:45,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=279810.0, ans=0.125 2023-06-18 18:56:49,489 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:56:55,132 INFO [train.py:996] (3/4) Epoch 2, batch 16150, loss[loss=0.3204, simple_loss=0.3661, pruned_loss=0.1373, over 21954.00 frames. ], tot_loss[loss=0.2973, simple_loss=0.3553, pruned_loss=0.1197, over 4271849.35 frames. ], batch size: 316, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:57:02,507 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.627e+02 3.360e+02 4.498e+02 6.275e+02 1.287e+03, threshold=8.996e+02, percent-clipped=10.0 2023-06-18 18:58:14,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=280110.0, ans=0.1 2023-06-18 18:58:35,856 INFO [train.py:996] (3/4) Epoch 2, batch 16200, loss[loss=0.3746, simple_loss=0.4252, pruned_loss=0.162, over 21647.00 frames. ], tot_loss[loss=0.3022, simple_loss=0.3603, pruned_loss=0.122, over 4276917.14 frames. ], batch size: 389, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 18:58:47,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=280170.0, ans=0.125 2023-06-18 18:59:04,272 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=15.0 2023-06-18 18:59:19,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=280290.0, ans=0.2 2023-06-18 18:59:44,083 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-06-18 18:59:45,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=280350.0, ans=0.0 2023-06-18 18:59:51,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=280410.0, ans=0.2 2023-06-18 19:00:17,021 INFO [train.py:996] (3/4) Epoch 2, batch 16250, loss[loss=0.2531, simple_loss=0.3236, pruned_loss=0.09131, over 21669.00 frames. ], tot_loss[loss=0.3028, simple_loss=0.3612, pruned_loss=0.1222, over 4279978.51 frames. ], batch size: 391, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:00:19,947 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.939e+02 3.512e+02 5.146e+02 1.306e+03, threshold=7.023e+02, percent-clipped=4.0 2023-06-18 19:00:20,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=280470.0, ans=0.125 2023-06-18 19:00:40,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=280530.0, ans=0.125 2023-06-18 19:00:44,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=280530.0, ans=0.0 2023-06-18 19:00:50,875 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.22 vs. limit=6.0 2023-06-18 19:01:09,565 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-18 19:01:11,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=280650.0, ans=10.0 2023-06-18 19:01:16,759 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=22.5 2023-06-18 19:01:48,540 INFO [train.py:996] (3/4) Epoch 2, batch 16300, loss[loss=0.2532, simple_loss=0.3422, pruned_loss=0.08204, over 21622.00 frames. ], tot_loss[loss=0.294, simple_loss=0.3546, pruned_loss=0.1167, over 4277410.21 frames. ], batch size: 389, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:02:06,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=280770.0, ans=0.0 2023-06-18 19:02:13,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=280830.0, ans=0.0 2023-06-18 19:02:17,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=280830.0, ans=0.2 2023-06-18 19:02:23,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=280890.0, ans=0.1 2023-06-18 19:02:40,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=280950.0, ans=0.0 2023-06-18 19:03:19,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=281010.0, ans=0.125 2023-06-18 19:03:31,605 INFO [train.py:996] (3/4) Epoch 2, batch 16350, loss[loss=0.2997, simple_loss=0.4131, pruned_loss=0.09318, over 20833.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3558, pruned_loss=0.1178, over 4263924.59 frames. ], batch size: 608, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:03:34,721 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 3.295e+02 4.517e+02 5.318e+02 1.033e+03, threshold=9.034e+02, percent-clipped=9.0 2023-06-18 19:04:11,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=281190.0, ans=0.0 2023-06-18 19:05:08,342 INFO [train.py:996] (3/4) Epoch 2, batch 16400, loss[loss=0.3158, simple_loss=0.365, pruned_loss=0.1333, over 19980.00 frames. ], tot_loss[loss=0.3, simple_loss=0.3604, pruned_loss=0.1197, over 4260138.35 frames. ], batch size: 702, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:05:15,250 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-18 19:05:27,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=281430.0, ans=0.02 2023-06-18 19:05:37,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=281490.0, ans=0.2 2023-06-18 19:05:47,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=281490.0, ans=0.0 2023-06-18 19:05:50,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=281490.0, ans=0.05 2023-06-18 19:06:13,793 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2023-06-18 19:06:15,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=281610.0, ans=0.0 2023-06-18 19:06:43,705 INFO [train.py:996] (3/4) Epoch 2, batch 16450, loss[loss=0.2897, simple_loss=0.3368, pruned_loss=0.1213, over 21459.00 frames. ], tot_loss[loss=0.2997, simple_loss=0.3585, pruned_loss=0.1205, over 4271487.66 frames. ], batch size: 194, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:06:46,790 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.379e+02 3.250e+02 3.733e+02 4.517e+02 7.172e+02, threshold=7.466e+02, percent-clipped=0.0 2023-06-18 19:06:50,160 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:07:16,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=281790.0, ans=0.0 2023-06-18 19:07:31,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=281850.0, ans=0.1 2023-06-18 19:07:35,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=281850.0, ans=0.1 2023-06-18 19:08:18,608 INFO [train.py:996] (3/4) Epoch 2, batch 16500, loss[loss=0.2304, simple_loss=0.2892, pruned_loss=0.08584, over 21632.00 frames. ], tot_loss[loss=0.2994, simple_loss=0.3571, pruned_loss=0.1209, over 4268302.00 frames. ], batch size: 230, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:08:46,511 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.18 vs. limit=15.0 2023-06-18 19:09:00,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=282090.0, ans=0.125 2023-06-18 19:09:57,179 INFO [train.py:996] (3/4) Epoch 2, batch 16550, loss[loss=0.3118, simple_loss=0.3791, pruned_loss=0.1222, over 21312.00 frames. ], tot_loss[loss=0.2982, simple_loss=0.3579, pruned_loss=0.1193, over 4263684.86 frames. ], batch size: 548, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:10:00,337 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 3.451e+02 4.279e+02 5.005e+02 9.425e+02, threshold=8.558e+02, percent-clipped=2.0 2023-06-18 19:10:13,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=282330.0, ans=0.125 2023-06-18 19:10:39,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=282390.0, ans=0.1 2023-06-18 19:11:21,299 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=22.5 2023-06-18 19:11:28,845 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:11:36,186 INFO [train.py:996] (3/4) Epoch 2, batch 16600, loss[loss=0.3548, simple_loss=0.4371, pruned_loss=0.1362, over 21782.00 frames. ], tot_loss[loss=0.3097, simple_loss=0.3692, pruned_loss=0.1251, over 4267526.14 frames. ], batch size: 282, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:11:48,034 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.63 vs. limit=10.0 2023-06-18 19:11:56,211 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.25 vs. limit=15.0 2023-06-18 19:12:30,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=282690.0, ans=0.2 2023-06-18 19:12:37,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=282690.0, ans=0.125 2023-06-18 19:13:15,406 INFO [train.py:996] (3/4) Epoch 2, batch 16650, loss[loss=0.3122, simple_loss=0.3807, pruned_loss=0.1219, over 21381.00 frames. ], tot_loss[loss=0.3156, simple_loss=0.3762, pruned_loss=0.1275, over 4267036.18 frames. ], batch size: 176, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:13:18,771 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 3.486e+02 3.999e+02 5.162e+02 7.260e+02, threshold=7.998e+02, percent-clipped=0.0 2023-06-18 19:13:38,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=282870.0, ans=0.125 2023-06-18 19:14:04,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=282930.0, ans=0.125 2023-06-18 19:14:55,411 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.67 vs. limit=15.0 2023-06-18 19:15:07,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=283110.0, ans=0.1 2023-06-18 19:15:10,015 INFO [train.py:996] (3/4) Epoch 2, batch 16700, loss[loss=0.2714, simple_loss=0.3136, pruned_loss=0.1146, over 21821.00 frames. ], tot_loss[loss=0.3136, simple_loss=0.3738, pruned_loss=0.1267, over 4268021.54 frames. ], batch size: 124, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:15:10,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=283170.0, ans=0.125 2023-06-18 19:15:20,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=283170.0, ans=0.0 2023-06-18 19:15:28,812 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.67 vs. limit=15.0 2023-06-18 19:15:43,250 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-18 19:15:56,245 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=15.0 2023-06-18 19:16:51,998 INFO [train.py:996] (3/4) Epoch 2, batch 16750, loss[loss=0.3356, simple_loss=0.3965, pruned_loss=0.1374, over 21747.00 frames. ], tot_loss[loss=0.3205, simple_loss=0.3788, pruned_loss=0.1311, over 4265324.91 frames. ], batch size: 332, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:16:55,205 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.344e+02 3.957e+02 4.930e+02 8.837e+02, threshold=7.914e+02, percent-clipped=3.0 2023-06-18 19:18:30,452 INFO [train.py:996] (3/4) Epoch 2, batch 16800, loss[loss=0.2865, simple_loss=0.3461, pruned_loss=0.1134, over 21815.00 frames. ], tot_loss[loss=0.3228, simple_loss=0.3826, pruned_loss=0.1315, over 4269491.38 frames. ], batch size: 298, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:18:50,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=283830.0, ans=0.125 2023-06-18 19:18:52,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=283830.0, ans=10.0 2023-06-18 19:19:02,260 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=22.5 2023-06-18 19:19:21,179 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.24 vs. limit=12.0 2023-06-18 19:20:05,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=284070.0, ans=0.1 2023-06-18 19:20:06,222 INFO [train.py:996] (3/4) Epoch 2, batch 16850, loss[loss=0.2872, simple_loss=0.3323, pruned_loss=0.121, over 21587.00 frames. ], tot_loss[loss=0.321, simple_loss=0.3785, pruned_loss=0.1317, over 4273414.84 frames. ], batch size: 212, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:20:09,184 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.662e+02 3.797e+02 4.349e+02 5.361e+02 8.347e+02, threshold=8.698e+02, percent-clipped=3.0 2023-06-18 19:20:15,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=284070.0, ans=0.125 2023-06-18 19:20:28,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=284130.0, ans=0.2 2023-06-18 19:20:31,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=284130.0, ans=0.125 2023-06-18 19:20:46,015 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=15.0 2023-06-18 19:21:37,622 INFO [train.py:996] (3/4) Epoch 2, batch 16900, loss[loss=0.3204, simple_loss=0.3554, pruned_loss=0.1427, over 20125.00 frames. ], tot_loss[loss=0.3173, simple_loss=0.374, pruned_loss=0.1303, over 4278010.54 frames. ], batch size: 707, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:21:49,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=15.0 2023-06-18 19:21:58,645 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:23:09,999 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.31 vs. limit=12.0 2023-06-18 19:23:13,326 INFO [train.py:996] (3/4) Epoch 2, batch 16950, loss[loss=0.2913, simple_loss=0.3452, pruned_loss=0.1187, over 21894.00 frames. ], tot_loss[loss=0.3111, simple_loss=0.3665, pruned_loss=0.1279, over 4277502.93 frames. ], batch size: 351, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:23:16,630 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.116e+02 4.142e+02 5.213e+02 8.477e+02, threshold=8.284e+02, percent-clipped=0.0 2023-06-18 19:23:51,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=284790.0, ans=0.0 2023-06-18 19:24:36,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=284910.0, ans=0.025 2023-06-18 19:24:45,708 INFO [train.py:996] (3/4) Epoch 2, batch 17000, loss[loss=0.2805, simple_loss=0.3375, pruned_loss=0.1118, over 21517.00 frames. ], tot_loss[loss=0.3086, simple_loss=0.3628, pruned_loss=0.1272, over 4283474.77 frames. ], batch size: 131, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:24:50,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=284970.0, ans=0.07 2023-06-18 19:25:56,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=285150.0, ans=0.125 2023-06-18 19:26:13,660 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=15.0 2023-06-18 19:26:17,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=285210.0, ans=0.1 2023-06-18 19:26:22,062 INFO [train.py:996] (3/4) Epoch 2, batch 17050, loss[loss=0.2873, simple_loss=0.3412, pruned_loss=0.1167, over 21198.00 frames. ], tot_loss[loss=0.3154, simple_loss=0.3701, pruned_loss=0.1304, over 4287715.72 frames. ], batch size: 608, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:26:25,249 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 3.160e+02 3.853e+02 4.663e+02 1.166e+03, threshold=7.706e+02, percent-clipped=2.0 2023-06-18 19:26:30,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=285270.0, ans=0.125 2023-06-18 19:27:13,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=285390.0, ans=0.125 2023-06-18 19:27:56,608 INFO [train.py:996] (3/4) Epoch 2, batch 17100, loss[loss=0.2824, simple_loss=0.3383, pruned_loss=0.1133, over 21723.00 frames. ], tot_loss[loss=0.3154, simple_loss=0.3692, pruned_loss=0.1308, over 4287959.32 frames. ], batch size: 230, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:28:06,858 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.07 vs. limit=10.0 2023-06-18 19:28:10,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=285630.0, ans=0.125 2023-06-18 19:28:41,916 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.64 vs. limit=5.0 2023-06-18 19:29:16,817 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.86 vs. limit=22.5 2023-06-18 19:29:20,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=285810.0, ans=0.125 2023-06-18 19:29:26,163 INFO [train.py:996] (3/4) Epoch 2, batch 17150, loss[loss=0.2705, simple_loss=0.3164, pruned_loss=0.1123, over 21247.00 frames. ], tot_loss[loss=0.311, simple_loss=0.364, pruned_loss=0.129, over 4285069.64 frames. ], batch size: 608, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:29:29,304 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 3.459e+02 4.285e+02 5.134e+02 9.644e+02, threshold=8.570e+02, percent-clipped=5.0 2023-06-18 19:29:30,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=285870.0, ans=0.0 2023-06-18 19:30:11,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=285990.0, ans=0.125 2023-06-18 19:30:47,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=286110.0, ans=0.1 2023-06-18 19:30:57,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=286110.0, ans=0.125 2023-06-18 19:31:03,076 INFO [train.py:996] (3/4) Epoch 2, batch 17200, loss[loss=0.3867, simple_loss=0.4156, pruned_loss=0.1789, over 21864.00 frames. ], tot_loss[loss=0.3121, simple_loss=0.3645, pruned_loss=0.1298, over 4285045.29 frames. ], batch size: 371, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:31:06,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=286170.0, ans=0.125 2023-06-18 19:31:08,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=286170.0, ans=0.125 2023-06-18 19:31:56,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=286290.0, ans=0.0 2023-06-18 19:32:15,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=286350.0, ans=0.5 2023-06-18 19:32:50,084 INFO [train.py:996] (3/4) Epoch 2, batch 17250, loss[loss=0.4406, simple_loss=0.471, pruned_loss=0.2051, over 21723.00 frames. ], tot_loss[loss=0.3143, simple_loss=0.367, pruned_loss=0.1308, over 4283014.36 frames. ], batch size: 441, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:32:53,221 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.355e+02 3.381e+02 4.035e+02 5.054e+02 8.566e+02, threshold=8.070e+02, percent-clipped=0.0 2023-06-18 19:33:18,651 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.52 vs. limit=10.0 2023-06-18 19:33:40,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=286590.0, ans=0.0 2023-06-18 19:34:33,913 INFO [train.py:996] (3/4) Epoch 2, batch 17300, loss[loss=0.3239, simple_loss=0.3879, pruned_loss=0.1299, over 21766.00 frames. ], tot_loss[loss=0.3244, simple_loss=0.377, pruned_loss=0.1359, over 4274630.85 frames. ], batch size: 332, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:34:51,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=286770.0, ans=0.125 2023-06-18 19:34:53,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=286770.0, ans=0.1 2023-06-18 19:35:07,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=286830.0, ans=0.125 2023-06-18 19:36:18,562 INFO [train.py:996] (3/4) Epoch 2, batch 17350, loss[loss=0.2833, simple_loss=0.3558, pruned_loss=0.1054, over 21710.00 frames. ], tot_loss[loss=0.323, simple_loss=0.3759, pruned_loss=0.135, over 4267342.34 frames. ], batch size: 298, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:36:23,402 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 3.256e+02 3.949e+02 5.005e+02 1.104e+03, threshold=7.898e+02, percent-clipped=4.0 2023-06-18 19:36:50,808 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.06 vs. limit=10.0 2023-06-18 19:37:42,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=287310.0, ans=0.125 2023-06-18 19:37:50,472 INFO [train.py:996] (3/4) Epoch 2, batch 17400, loss[loss=0.2216, simple_loss=0.2784, pruned_loss=0.08243, over 21198.00 frames. ], tot_loss[loss=0.3151, simple_loss=0.3703, pruned_loss=0.13, over 4265433.76 frames. ], batch size: 159, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:37:55,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=287370.0, ans=0.2 2023-06-18 19:38:14,049 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-18 19:38:25,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=287490.0, ans=0.0 2023-06-18 19:38:50,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-18 19:39:10,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=287610.0, ans=0.0 2023-06-18 19:39:24,225 INFO [train.py:996] (3/4) Epoch 2, batch 17450, loss[loss=0.2382, simple_loss=0.3119, pruned_loss=0.08225, over 21250.00 frames. ], tot_loss[loss=0.3077, simple_loss=0.365, pruned_loss=0.1252, over 4267559.42 frames. ], batch size: 176, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:39:26,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=287670.0, ans=0.0 2023-06-18 19:39:28,529 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 2.947e+02 3.682e+02 4.725e+02 7.588e+02, threshold=7.364e+02, percent-clipped=0.0 2023-06-18 19:39:50,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=287730.0, ans=0.125 2023-06-18 19:40:16,494 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-18 19:40:36,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=287850.0, ans=0.0 2023-06-18 19:40:54,705 INFO [train.py:996] (3/4) Epoch 2, batch 17500, loss[loss=0.2731, simple_loss=0.3278, pruned_loss=0.1093, over 21819.00 frames. ], tot_loss[loss=0.3017, simple_loss=0.36, pruned_loss=0.1217, over 4277486.00 frames. ], batch size: 282, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:41:05,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=287970.0, ans=0.1 2023-06-18 19:42:25,497 INFO [train.py:996] (3/4) Epoch 2, batch 17550, loss[loss=0.2908, simple_loss=0.3612, pruned_loss=0.1103, over 21197.00 frames. ], tot_loss[loss=0.3003, simple_loss=0.3607, pruned_loss=0.1199, over 4273434.44 frames. ], batch size: 159, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:42:30,239 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 3.260e+02 4.297e+02 5.739e+02 1.320e+03, threshold=8.594e+02, percent-clipped=8.0 2023-06-18 19:42:33,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=288270.0, ans=0.125 2023-06-18 19:43:17,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=288390.0, ans=0.125 2023-06-18 19:43:46,295 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-06-18 19:43:49,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=288510.0, ans=0.1 2023-06-18 19:43:55,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=288510.0, ans=0.0 2023-06-18 19:44:01,360 INFO [train.py:996] (3/4) Epoch 2, batch 17600, loss[loss=0.304, simple_loss=0.3609, pruned_loss=0.1236, over 21770.00 frames. ], tot_loss[loss=0.2998, simple_loss=0.3605, pruned_loss=0.1195, over 4267947.76 frames. ], batch size: 247, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:44:40,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=288690.0, ans=0.125 2023-06-18 19:45:19,306 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.86 vs. limit=5.0 2023-06-18 19:45:31,151 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:45:34,682 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.11 vs. limit=10.0 2023-06-18 19:45:39,885 INFO [train.py:996] (3/4) Epoch 2, batch 17650, loss[loss=0.1967, simple_loss=0.2458, pruned_loss=0.07376, over 21225.00 frames. ], tot_loss[loss=0.299, simple_loss=0.3586, pruned_loss=0.1197, over 4269212.51 frames. ], batch size: 176, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:45:40,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=288870.0, ans=0.125 2023-06-18 19:45:41,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=288870.0, ans=0.2 2023-06-18 19:45:44,490 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.848e+02 2.915e+02 4.145e+02 5.677e+02 8.803e+02, threshold=8.289e+02, percent-clipped=2.0 2023-06-18 19:46:40,414 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.17 vs. limit=22.5 2023-06-18 19:47:13,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=289110.0, ans=0.09899494936611666 2023-06-18 19:47:17,282 INFO [train.py:996] (3/4) Epoch 2, batch 17700, loss[loss=0.2911, simple_loss=0.3724, pruned_loss=0.1049, over 21812.00 frames. ], tot_loss[loss=0.29, simple_loss=0.3502, pruned_loss=0.115, over 4266005.78 frames. ], batch size: 282, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:47:34,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=289170.0, ans=0.0 2023-06-18 19:47:44,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=289230.0, ans=0.0 2023-06-18 19:48:12,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=289290.0, ans=0.0 2023-06-18 19:48:55,990 INFO [train.py:996] (3/4) Epoch 2, batch 17750, loss[loss=0.3653, simple_loss=0.4144, pruned_loss=0.1582, over 21690.00 frames. ], tot_loss[loss=0.3037, simple_loss=0.363, pruned_loss=0.1222, over 4268986.84 frames. ], batch size: 351, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:49:10,433 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 3.081e+02 4.126e+02 5.116e+02 1.162e+03, threshold=8.253e+02, percent-clipped=4.0 2023-06-18 19:49:20,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=289530.0, ans=0.0 2023-06-18 19:50:07,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=289650.0, ans=0.0 2023-06-18 19:50:24,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=22.5 2023-06-18 19:50:40,986 INFO [train.py:996] (3/4) Epoch 2, batch 17800, loss[loss=0.297, simple_loss=0.362, pruned_loss=0.116, over 21740.00 frames. ], tot_loss[loss=0.3028, simple_loss=0.3625, pruned_loss=0.1215, over 4273175.65 frames. ], batch size: 332, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:50:51,307 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-18 19:52:04,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=290010.0, ans=0.0 2023-06-18 19:52:15,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=290010.0, ans=0.0 2023-06-18 19:52:30,844 INFO [train.py:996] (3/4) Epoch 2, batch 17850, loss[loss=0.3292, simple_loss=0.3605, pruned_loss=0.149, over 20174.00 frames. ], tot_loss[loss=0.3031, simple_loss=0.3629, pruned_loss=0.1217, over 4273701.74 frames. ], batch size: 702, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:52:35,407 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.113e+02 3.565e+02 4.214e+02 1.146e+03, threshold=7.130e+02, percent-clipped=1.0 2023-06-18 19:53:21,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=290250.0, ans=0.2 2023-06-18 19:53:41,415 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.70 vs. limit=15.0 2023-06-18 19:54:08,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=290370.0, ans=0.0 2023-06-18 19:54:09,667 INFO [train.py:996] (3/4) Epoch 2, batch 17900, loss[loss=0.3444, simple_loss=0.4017, pruned_loss=0.1435, over 21488.00 frames. ], tot_loss[loss=0.3086, simple_loss=0.3688, pruned_loss=0.1243, over 4273384.59 frames. ], batch size: 131, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:54:29,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=290430.0, ans=0.0 2023-06-18 19:55:24,159 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:55:30,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=290610.0, ans=0.0 2023-06-18 19:55:47,648 INFO [train.py:996] (3/4) Epoch 2, batch 17950, loss[loss=0.3242, simple_loss=0.3984, pruned_loss=0.125, over 21498.00 frames. ], tot_loss[loss=0.3049, simple_loss=0.3691, pruned_loss=0.1204, over 4274304.63 frames. ], batch size: 471, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:55:52,412 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 3.329e+02 4.288e+02 5.696e+02 7.966e+02, threshold=8.576e+02, percent-clipped=5.0 2023-06-18 19:56:34,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=290790.0, ans=0.1 2023-06-18 19:56:49,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=290850.0, ans=0.0 2023-06-18 19:57:22,921 INFO [train.py:996] (3/4) Epoch 2, batch 18000, loss[loss=0.3056, simple_loss=0.3449, pruned_loss=0.1331, over 21739.00 frames. ], tot_loss[loss=0.2993, simple_loss=0.3606, pruned_loss=0.119, over 4274370.57 frames. ], batch size: 371, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 19:57:22,922 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 19:57:38,933 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.2951, simple_loss=0.3927, pruned_loss=0.09871, over 1796401.00 frames. 2023-06-18 19:57:38,934 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-18 19:57:39,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=290970.0, ans=0.125 2023-06-18 19:58:35,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=291090.0, ans=0.2 2023-06-18 19:59:15,532 INFO [train.py:996] (3/4) Epoch 2, batch 18050, loss[loss=0.283, simple_loss=0.3345, pruned_loss=0.1157, over 21428.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3556, pruned_loss=0.1179, over 4271534.30 frames. ], batch size: 211, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 19:59:24,566 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 3.442e+02 4.328e+02 5.219e+02 8.565e+02, threshold=8.656e+02, percent-clipped=0.0 2023-06-18 19:59:34,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=291330.0, ans=0.04949747468305833 2023-06-18 19:59:46,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=291330.0, ans=0.125 2023-06-18 20:00:02,934 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:00:41,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=291510.0, ans=0.1 2023-06-18 20:00:57,636 INFO [train.py:996] (3/4) Epoch 2, batch 18100, loss[loss=0.3287, simple_loss=0.3989, pruned_loss=0.1292, over 21636.00 frames. ], tot_loss[loss=0.3008, simple_loss=0.3604, pruned_loss=0.1205, over 4277566.97 frames. ], batch size: 414, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:00:58,631 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=15.0 2023-06-18 20:01:03,076 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=12.0 2023-06-18 20:01:07,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=291570.0, ans=0.2 2023-06-18 20:01:18,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=291630.0, ans=0.0 2023-06-18 20:01:34,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=291630.0, ans=0.0 2023-06-18 20:01:44,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=291690.0, ans=0.2 2023-06-18 20:01:46,090 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=22.5 2023-06-18 20:02:34,134 INFO [train.py:996] (3/4) Epoch 2, batch 18150, loss[loss=0.3344, simple_loss=0.3797, pruned_loss=0.1446, over 21633.00 frames. ], tot_loss[loss=0.3012, simple_loss=0.3615, pruned_loss=0.1205, over 4279373.05 frames. ], batch size: 415, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:02:38,926 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.520e+02 3.253e+02 3.964e+02 4.730e+02 7.645e+02, threshold=7.929e+02, percent-clipped=0.0 2023-06-18 20:02:54,053 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2023-06-18 20:03:59,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=292110.0, ans=0.125 2023-06-18 20:04:09,554 INFO [train.py:996] (3/4) Epoch 2, batch 18200, loss[loss=0.2351, simple_loss=0.3014, pruned_loss=0.08439, over 21827.00 frames. ], tot_loss[loss=0.298, simple_loss=0.3564, pruned_loss=0.1198, over 4270933.50 frames. ], batch size: 102, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:04:38,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=292230.0, ans=0.125 2023-06-18 20:04:39,476 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-18 20:05:09,289 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:05:09,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=292350.0, ans=0.2 2023-06-18 20:05:27,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=292410.0, ans=0.0 2023-06-18 20:05:39,481 INFO [train.py:996] (3/4) Epoch 2, batch 18250, loss[loss=0.2881, simple_loss=0.3146, pruned_loss=0.1308, over 20821.00 frames. ], tot_loss[loss=0.291, simple_loss=0.3491, pruned_loss=0.1165, over 4277329.28 frames. ], batch size: 609, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:05:44,119 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.858e+02 2.938e+02 3.797e+02 4.811e+02 7.621e+02, threshold=7.594e+02, percent-clipped=0.0 2023-06-18 20:06:01,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=292530.0, ans=15.0 2023-06-18 20:06:47,168 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-18 20:06:58,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=292710.0, ans=0.125 2023-06-18 20:07:15,031 INFO [train.py:996] (3/4) Epoch 2, batch 18300, loss[loss=0.3235, simple_loss=0.4173, pruned_loss=0.1149, over 21791.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.3482, pruned_loss=0.1154, over 4274072.13 frames. ], batch size: 351, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:07:30,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=292770.0, ans=0.0 2023-06-18 20:07:30,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=292770.0, ans=0.125 2023-06-18 20:07:32,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=292770.0, ans=0.1 2023-06-18 20:08:12,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=292890.0, ans=0.0 2023-06-18 20:08:40,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=293010.0, ans=0.0 2023-06-18 20:08:50,534 INFO [train.py:996] (3/4) Epoch 2, batch 18350, loss[loss=0.2628, simple_loss=0.3151, pruned_loss=0.1053, over 21335.00 frames. ], tot_loss[loss=0.2939, simple_loss=0.354, pruned_loss=0.1169, over 4271251.65 frames. ], batch size: 144, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:08:55,231 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 3.344e+02 4.081e+02 5.632e+02 1.157e+03, threshold=8.162e+02, percent-clipped=13.0 2023-06-18 20:09:14,606 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-18 20:09:44,056 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.29 vs. limit=6.0 2023-06-18 20:10:12,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=293310.0, ans=10.0 2023-06-18 20:10:24,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=293310.0, ans=0.0 2023-06-18 20:10:27,220 INFO [train.py:996] (3/4) Epoch 2, batch 18400, loss[loss=0.2323, simple_loss=0.2893, pruned_loss=0.08767, over 21340.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3493, pruned_loss=0.1145, over 4266087.50 frames. ], batch size: 160, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:10:30,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=293370.0, ans=0.125 2023-06-18 20:11:17,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=293490.0, ans=0.125 2023-06-18 20:12:08,210 INFO [train.py:996] (3/4) Epoch 2, batch 18450, loss[loss=0.2382, simple_loss=0.2904, pruned_loss=0.09303, over 21831.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3451, pruned_loss=0.11, over 4265893.03 frames. ], batch size: 118, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:12:12,709 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.811e+02 3.741e+02 4.962e+02 8.715e+02, threshold=7.483e+02, percent-clipped=1.0 2023-06-18 20:12:52,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=293790.0, ans=0.125 2023-06-18 20:13:33,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=293910.0, ans=0.0 2023-06-18 20:13:45,330 INFO [train.py:996] (3/4) Epoch 2, batch 18500, loss[loss=0.2604, simple_loss=0.3388, pruned_loss=0.09101, over 21568.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.34, pruned_loss=0.1094, over 4268120.70 frames. ], batch size: 389, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:13:57,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=293970.0, ans=0.2 2023-06-18 20:14:44,011 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.12 vs. limit=22.5 2023-06-18 20:15:12,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=294210.0, ans=0.2 2023-06-18 20:15:17,128 INFO [train.py:996] (3/4) Epoch 2, batch 18550, loss[loss=0.2456, simple_loss=0.2868, pruned_loss=0.1022, over 21353.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3367, pruned_loss=0.1083, over 4258023.80 frames. ], batch size: 160, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:15:26,535 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 3.027e+02 3.740e+02 5.224e+02 1.354e+03, threshold=7.479e+02, percent-clipped=4.0 2023-06-18 20:15:44,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=294330.0, ans=0.0 2023-06-18 20:16:58,769 INFO [train.py:996] (3/4) Epoch 2, batch 18600, loss[loss=0.2743, simple_loss=0.3377, pruned_loss=0.1054, over 21648.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3354, pruned_loss=0.1097, over 4258975.01 frames. ], batch size: 247, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:18:03,723 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-18 20:18:09,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=294750.0, ans=0.2 2023-06-18 20:18:35,034 INFO [train.py:996] (3/4) Epoch 2, batch 18650, loss[loss=0.2698, simple_loss=0.3194, pruned_loss=0.1101, over 21590.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.335, pruned_loss=0.1107, over 4257585.22 frames. ], batch size: 415, lr: 1.59e-02, grad_scale: 16.0 2023-06-18 20:18:40,797 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 3.101e+02 3.636e+02 4.483e+02 8.727e+02, threshold=7.272e+02, percent-clipped=3.0 2023-06-18 20:19:13,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=294990.0, ans=0.2 2023-06-18 20:20:06,182 INFO [train.py:996] (3/4) Epoch 2, batch 18700, loss[loss=0.3295, simple_loss=0.3636, pruned_loss=0.1477, over 21564.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3349, pruned_loss=0.1133, over 4259793.80 frames. ], batch size: 471, lr: 1.59e-02, grad_scale: 16.0 2023-06-18 20:20:15,875 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.52 vs. limit=15.0 2023-06-18 20:21:08,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=295290.0, ans=0.2 2023-06-18 20:21:25,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=295410.0, ans=0.95 2023-06-18 20:21:31,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=295410.0, ans=0.2 2023-06-18 20:21:41,679 INFO [train.py:996] (3/4) Epoch 2, batch 18750, loss[loss=0.3469, simple_loss=0.4032, pruned_loss=0.1453, over 21651.00 frames. ], tot_loss[loss=0.2845, simple_loss=0.3374, pruned_loss=0.1158, over 4267194.34 frames. ], batch size: 389, lr: 1.59e-02, grad_scale: 16.0 2023-06-18 20:21:52,347 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 3.155e+02 3.822e+02 4.527e+02 9.030e+02, threshold=7.645e+02, percent-clipped=1.0 2023-06-18 20:22:17,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=295530.0, ans=0.125 2023-06-18 20:22:29,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=295590.0, ans=0.125 2023-06-18 20:22:48,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=295650.0, ans=0.125 2023-06-18 20:22:49,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=295650.0, ans=0.125 2023-06-18 20:22:54,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=295650.0, ans=0.125 2023-06-18 20:23:17,270 INFO [train.py:996] (3/4) Epoch 2, batch 18800, loss[loss=0.2738, simple_loss=0.356, pruned_loss=0.09581, over 21693.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.341, pruned_loss=0.1144, over 4267872.15 frames. ], batch size: 441, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:23:17,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=295770.0, ans=0.0 2023-06-18 20:23:17,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=295770.0, ans=0.125 2023-06-18 20:24:00,215 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-18 20:24:21,791 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=22.5 2023-06-18 20:24:57,825 INFO [train.py:996] (3/4) Epoch 2, batch 18850, loss[loss=0.2087, simple_loss=0.2803, pruned_loss=0.06858, over 21619.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3355, pruned_loss=0.1076, over 4266475.44 frames. ], batch size: 247, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:25:03,588 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.896e+02 3.612e+02 4.924e+02 1.009e+03, threshold=7.223e+02, percent-clipped=2.0 2023-06-18 20:25:27,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=296190.0, ans=0.1 2023-06-18 20:26:30,498 INFO [train.py:996] (3/4) Epoch 2, batch 18900, loss[loss=0.2486, simple_loss=0.3083, pruned_loss=0.09442, over 20801.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3362, pruned_loss=0.1107, over 4258297.27 frames. ], batch size: 609, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:26:30,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=296370.0, ans=0.0 2023-06-18 20:27:21,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=296490.0, ans=0.125 2023-06-18 20:27:26,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=296490.0, ans=0.0 2023-06-18 20:28:07,307 INFO [train.py:996] (3/4) Epoch 2, batch 18950, loss[loss=0.3062, simple_loss=0.3868, pruned_loss=0.1128, over 21432.00 frames. ], tot_loss[loss=0.2859, simple_loss=0.3409, pruned_loss=0.1155, over 4261959.27 frames. ], batch size: 212, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:28:11,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=296670.0, ans=0.0 2023-06-18 20:28:18,301 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.174e+02 3.002e+02 3.745e+02 4.553e+02 7.623e+02, threshold=7.489e+02, percent-clipped=1.0 2023-06-18 20:29:14,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=296850.0, ans=0.125 2023-06-18 20:29:34,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=296910.0, ans=0.1 2023-06-18 20:29:44,719 INFO [train.py:996] (3/4) Epoch 2, batch 19000, loss[loss=0.2874, simple_loss=0.365, pruned_loss=0.1048, over 21410.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3488, pruned_loss=0.1167, over 4254384.63 frames. ], batch size: 131, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:29:54,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=296970.0, ans=0.0 2023-06-18 20:30:03,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=297030.0, ans=0.125 2023-06-18 20:30:05,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=297030.0, ans=0.07 2023-06-18 20:30:26,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=297030.0, ans=0.125 2023-06-18 20:30:29,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=297090.0, ans=0.2 2023-06-18 20:30:47,009 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=22.5 2023-06-18 20:30:51,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=297150.0, ans=0.125 2023-06-18 20:30:52,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=297150.0, ans=0.04949747468305833 2023-06-18 20:30:59,802 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.33 vs. limit=10.0 2023-06-18 20:31:28,469 INFO [train.py:996] (3/4) Epoch 2, batch 19050, loss[loss=0.333, simple_loss=0.3759, pruned_loss=0.145, over 21485.00 frames. ], tot_loss[loss=0.3008, simple_loss=0.356, pruned_loss=0.1228, over 4262195.78 frames. ], batch size: 211, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:31:34,521 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.373e+02 3.128e+02 4.146e+02 5.526e+02 1.033e+03, threshold=8.291e+02, percent-clipped=8.0 2023-06-18 20:32:18,306 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=15.0 2023-06-18 20:32:45,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=297510.0, ans=0.0 2023-06-18 20:33:05,244 INFO [train.py:996] (3/4) Epoch 2, batch 19100, loss[loss=0.2767, simple_loss=0.3291, pruned_loss=0.1122, over 21613.00 frames. ], tot_loss[loss=0.3009, simple_loss=0.3542, pruned_loss=0.1238, over 4265992.83 frames. ], batch size: 332, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:33:23,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=297630.0, ans=0.125 2023-06-18 20:33:40,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=297630.0, ans=0.02 2023-06-18 20:33:42,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=297630.0, ans=0.125 2023-06-18 20:33:55,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=297690.0, ans=0.1 2023-06-18 20:34:03,341 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:34:21,217 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:34:31,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=297810.0, ans=0.125 2023-06-18 20:34:44,309 INFO [train.py:996] (3/4) Epoch 2, batch 19150, loss[loss=0.3536, simple_loss=0.45, pruned_loss=0.1286, over 21233.00 frames. ], tot_loss[loss=0.3037, simple_loss=0.3583, pruned_loss=0.1246, over 4258603.74 frames. ], batch size: 549, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:34:44,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=297870.0, ans=0.1 2023-06-18 20:34:51,108 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.341e+02 3.456e+02 4.497e+02 5.703e+02 1.044e+03, threshold=8.993e+02, percent-clipped=4.0 2023-06-18 20:35:56,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=298050.0, ans=0.1 2023-06-18 20:36:09,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=298110.0, ans=0.125 2023-06-18 20:36:27,162 INFO [train.py:996] (3/4) Epoch 2, batch 19200, loss[loss=0.2957, simple_loss=0.3796, pruned_loss=0.106, over 21614.00 frames. ], tot_loss[loss=0.3129, simple_loss=0.3717, pruned_loss=0.127, over 4261791.43 frames. ], batch size: 230, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:36:56,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=298230.0, ans=0.125 2023-06-18 20:36:57,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=298230.0, ans=0.125 2023-06-18 20:37:03,507 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.95 vs. limit=6.0 2023-06-18 20:37:20,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=298290.0, ans=0.04949747468305833 2023-06-18 20:37:58,665 INFO [train.py:996] (3/4) Epoch 2, batch 19250, loss[loss=0.3187, simple_loss=0.395, pruned_loss=0.1212, over 19943.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.3663, pruned_loss=0.1182, over 4259180.48 frames. ], batch size: 702, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:38:09,018 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.661e+02 2.933e+02 3.534e+02 4.386e+02 8.060e+02, threshold=7.069e+02, percent-clipped=0.0 2023-06-18 20:39:02,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=298650.0, ans=0.0 2023-06-18 20:39:30,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=298710.0, ans=0.2 2023-06-18 20:39:34,512 INFO [train.py:996] (3/4) Epoch 2, batch 19300, loss[loss=0.2387, simple_loss=0.2902, pruned_loss=0.09363, over 16337.00 frames. ], tot_loss[loss=0.2981, simple_loss=0.3619, pruned_loss=0.1171, over 4262237.61 frames. ], batch size: 60, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:39:34,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=298770.0, ans=0.2 2023-06-18 20:39:43,076 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.94 vs. limit=15.0 2023-06-18 20:39:57,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=298770.0, ans=0.5 2023-06-18 20:40:00,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=298770.0, ans=0.09899494936611666 2023-06-18 20:40:15,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=298830.0, ans=0.2 2023-06-18 20:40:32,502 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=15.0 2023-06-18 20:40:45,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=298950.0, ans=0.025 2023-06-18 20:40:57,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=299010.0, ans=0.05 2023-06-18 20:41:14,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=299010.0, ans=0.1 2023-06-18 20:41:17,372 INFO [train.py:996] (3/4) Epoch 2, batch 19350, loss[loss=0.2312, simple_loss=0.3037, pruned_loss=0.0794, over 21583.00 frames. ], tot_loss[loss=0.2884, simple_loss=0.3536, pruned_loss=0.1115, over 4271777.51 frames. ], batch size: 230, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:41:28,556 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 3.011e+02 3.707e+02 4.151e+02 9.500e+02, threshold=7.414e+02, percent-clipped=4.0 2023-06-18 20:41:35,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=299070.0, ans=0.1 2023-06-18 20:41:39,596 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=22.5 2023-06-18 20:41:40,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=299130.0, ans=0.1 2023-06-18 20:41:50,547 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-06-18 20:42:33,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=299310.0, ans=0.1 2023-06-18 20:42:46,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=299310.0, ans=0.0 2023-06-18 20:42:54,482 INFO [train.py:996] (3/4) Epoch 2, batch 19400, loss[loss=0.2805, simple_loss=0.3323, pruned_loss=0.1144, over 21264.00 frames. ], tot_loss[loss=0.2864, simple_loss=0.3519, pruned_loss=0.1105, over 4261730.80 frames. ], batch size: 143, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:43:08,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=299430.0, ans=0.125 2023-06-18 20:43:22,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=299430.0, ans=0.2 2023-06-18 20:43:50,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=299550.0, ans=0.0 2023-06-18 20:44:25,373 INFO [train.py:996] (3/4) Epoch 2, batch 19450, loss[loss=0.2968, simple_loss=0.3419, pruned_loss=0.1259, over 21974.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.3496, pruned_loss=0.1129, over 4275898.54 frames. ], batch size: 113, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:44:36,324 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.273e+02 3.184e+02 3.692e+02 4.694e+02 9.525e+02, threshold=7.383e+02, percent-clipped=2.0 2023-06-18 20:44:40,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=299670.0, ans=0.0 2023-06-18 20:44:53,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=299730.0, ans=0.0 2023-06-18 20:45:25,501 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.96 vs. limit=15.0 2023-06-18 20:46:07,386 INFO [train.py:996] (3/4) Epoch 2, batch 19500, loss[loss=0.3623, simple_loss=0.4102, pruned_loss=0.1572, over 21507.00 frames. ], tot_loss[loss=0.2893, simple_loss=0.3469, pruned_loss=0.1159, over 4267404.82 frames. ], batch size: 509, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:46:19,233 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=15.0 2023-06-18 20:46:25,778 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.74 vs. limit=15.0 2023-06-18 20:47:07,018 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=15.0 2023-06-18 20:47:45,719 INFO [train.py:996] (3/4) Epoch 2, batch 19550, loss[loss=0.242, simple_loss=0.3335, pruned_loss=0.07528, over 21661.00 frames. ], tot_loss[loss=0.2851, simple_loss=0.3428, pruned_loss=0.1137, over 4264995.44 frames. ], batch size: 263, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:47:51,798 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 3.032e+02 3.762e+02 4.813e+02 9.306e+02, threshold=7.523e+02, percent-clipped=3.0 2023-06-18 20:47:59,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=300330.0, ans=0.125 2023-06-18 20:48:44,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=300450.0, ans=0.1 2023-06-18 20:48:56,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=300510.0, ans=0.0 2023-06-18 20:49:17,046 INFO [train.py:996] (3/4) Epoch 2, batch 19600, loss[loss=0.3106, simple_loss=0.3482, pruned_loss=0.1365, over 21570.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.3443, pruned_loss=0.1152, over 4269202.35 frames. ], batch size: 548, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:49:33,852 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.34 vs. limit=22.5 2023-06-18 20:49:37,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=300630.0, ans=0.0 2023-06-18 20:49:37,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=300630.0, ans=0.2 2023-06-18 20:49:58,340 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.51 vs. limit=6.0 2023-06-18 20:50:47,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=300810.0, ans=0.125 2023-06-18 20:50:49,881 INFO [train.py:996] (3/4) Epoch 2, batch 19650, loss[loss=0.2939, simple_loss=0.3435, pruned_loss=0.1222, over 21613.00 frames. ], tot_loss[loss=0.2969, simple_loss=0.3515, pruned_loss=0.1211, over 4275460.13 frames. ], batch size: 263, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:50:55,965 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.998e+02 3.349e+02 4.086e+02 5.431e+02 7.953e+02, threshold=8.171e+02, percent-clipped=1.0 2023-06-18 20:52:09,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=301110.0, ans=0.125 2023-06-18 20:52:24,561 INFO [train.py:996] (3/4) Epoch 2, batch 19700, loss[loss=0.265, simple_loss=0.3644, pruned_loss=0.08283, over 20761.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3546, pruned_loss=0.1216, over 4265431.18 frames. ], batch size: 608, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:52:26,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=301170.0, ans=0.125 2023-06-18 20:52:27,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=301170.0, ans=0.125 2023-06-18 20:52:28,467 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.72 vs. limit=22.5 2023-06-18 20:52:32,180 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.11 vs. limit=22.5 2023-06-18 20:52:47,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=301230.0, ans=0.125 2023-06-18 20:53:50,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=15.0 2023-06-18 20:54:03,048 INFO [train.py:996] (3/4) Epoch 2, batch 19750, loss[loss=0.3434, simple_loss=0.4193, pruned_loss=0.1338, over 21773.00 frames. ], tot_loss[loss=0.3087, simple_loss=0.368, pruned_loss=0.1247, over 4268044.84 frames. ], batch size: 298, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:54:09,455 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.167e+02 3.934e+02 5.557e+02 1.096e+03, threshold=7.868e+02, percent-clipped=5.0 2023-06-18 20:54:55,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=301590.0, ans=0.125 2023-06-18 20:55:06,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=301590.0, ans=0.04949747468305833 2023-06-18 20:55:24,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=301710.0, ans=0.0 2023-06-18 20:55:32,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=301710.0, ans=0.1 2023-06-18 20:55:40,441 INFO [train.py:996] (3/4) Epoch 2, batch 19800, loss[loss=0.2092, simple_loss=0.2559, pruned_loss=0.08121, over 21793.00 frames. ], tot_loss[loss=0.3097, simple_loss=0.3681, pruned_loss=0.1257, over 4270470.17 frames. ], batch size: 102, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:55:58,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=301770.0, ans=0.0 2023-06-18 20:56:45,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=301950.0, ans=0.125 2023-06-18 20:56:50,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=301950.0, ans=0.07 2023-06-18 20:56:58,448 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.60 vs. limit=6.0 2023-06-18 20:57:22,412 INFO [train.py:996] (3/4) Epoch 2, batch 19850, loss[loss=0.2613, simple_loss=0.361, pruned_loss=0.08082, over 20859.00 frames. ], tot_loss[loss=0.295, simple_loss=0.3555, pruned_loss=0.1172, over 4265947.93 frames. ], batch size: 607, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:57:28,505 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.803e+02 3.716e+02 4.795e+02 8.783e+02, threshold=7.432e+02, percent-clipped=5.0 2023-06-18 20:58:08,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=302190.0, ans=0.1 2023-06-18 20:58:25,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=302250.0, ans=0.1 2023-06-18 20:58:52,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=302310.0, ans=0.125 2023-06-18 20:59:00,045 INFO [train.py:996] (3/4) Epoch 2, batch 19900, loss[loss=0.2416, simple_loss=0.3145, pruned_loss=0.08436, over 20774.00 frames. ], tot_loss[loss=0.293, simple_loss=0.3565, pruned_loss=0.1147, over 4258675.10 frames. ], batch size: 607, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 20:59:51,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=302490.0, ans=0.07 2023-06-18 20:59:54,078 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:00:25,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=302610.0, ans=0.125 2023-06-18 21:00:32,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=302610.0, ans=0.0 2023-06-18 21:00:35,671 INFO [train.py:996] (3/4) Epoch 2, batch 19950, loss[loss=0.2646, simple_loss=0.3369, pruned_loss=0.09614, over 21759.00 frames. ], tot_loss[loss=0.289, simple_loss=0.3494, pruned_loss=0.1143, over 4257112.39 frames. ], batch size: 316, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:00:46,440 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-18 21:00:46,804 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.768e+02 3.439e+02 5.262e+02 1.066e+03, threshold=6.877e+02, percent-clipped=5.0 2023-06-18 21:00:52,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=302670.0, ans=0.1 2023-06-18 21:01:13,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=302730.0, ans=0.1 2023-06-18 21:01:33,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=302790.0, ans=0.0 2023-06-18 21:01:41,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=302850.0, ans=0.125 2023-06-18 21:02:01,269 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:02:16,318 INFO [train.py:996] (3/4) Epoch 2, batch 20000, loss[loss=0.3909, simple_loss=0.424, pruned_loss=0.1789, over 21615.00 frames. ], tot_loss[loss=0.2903, simple_loss=0.3505, pruned_loss=0.115, over 4248432.47 frames. ], batch size: 471, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:02:19,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=302970.0, ans=0.125 2023-06-18 21:02:24,918 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-18 21:02:34,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=302970.0, ans=0.0 2023-06-18 21:02:47,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=303030.0, ans=0.2 2023-06-18 21:02:57,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=303090.0, ans=0.0 2023-06-18 21:02:59,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=303090.0, ans=0.04949747468305833 2023-06-18 21:03:32,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=303210.0, ans=0.0 2023-06-18 21:03:46,335 INFO [train.py:996] (3/4) Epoch 2, batch 20050, loss[loss=0.3686, simple_loss=0.3941, pruned_loss=0.1716, over 21758.00 frames. ], tot_loss[loss=0.297, simple_loss=0.3544, pruned_loss=0.1198, over 4248596.77 frames. ], batch size: 508, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:03:56,938 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.497e+02 3.190e+02 3.752e+02 4.909e+02 8.771e+02, threshold=7.503e+02, percent-clipped=6.0 2023-06-18 21:04:27,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=303330.0, ans=0.125 2023-06-18 21:05:05,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=303450.0, ans=0.125 2023-06-18 21:05:11,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=303510.0, ans=0.0 2023-06-18 21:05:33,689 INFO [train.py:996] (3/4) Epoch 2, batch 20100, loss[loss=0.3497, simple_loss=0.418, pruned_loss=0.1407, over 20990.00 frames. ], tot_loss[loss=0.3024, simple_loss=0.3585, pruned_loss=0.1232, over 4248921.40 frames. ], batch size: 607, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:05:56,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=303630.0, ans=0.125 2023-06-18 21:06:13,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=303690.0, ans=0.125 2023-06-18 21:06:13,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=303690.0, ans=0.1 2023-06-18 21:06:58,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=303810.0, ans=0.1 2023-06-18 21:07:17,559 INFO [train.py:996] (3/4) Epoch 2, batch 20150, loss[loss=0.2629, simple_loss=0.3019, pruned_loss=0.1119, over 20228.00 frames. ], tot_loss[loss=0.3103, simple_loss=0.3675, pruned_loss=0.1265, over 4261605.88 frames. ], batch size: 703, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:07:24,240 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.342e+02 4.169e+02 5.156e+02 8.825e+02, threshold=8.338e+02, percent-clipped=3.0 2023-06-18 21:07:39,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=303930.0, ans=0.125 2023-06-18 21:07:42,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=303930.0, ans=0.1 2023-06-18 21:07:52,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=303930.0, ans=0.04949747468305833 2023-06-18 21:08:14,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=304050.0, ans=0.125 2023-06-18 21:08:29,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=304050.0, ans=0.0 2023-06-18 21:08:45,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=304110.0, ans=0.0 2023-06-18 21:08:58,613 INFO [train.py:996] (3/4) Epoch 2, batch 20200, loss[loss=0.2935, simple_loss=0.3435, pruned_loss=0.1217, over 21682.00 frames. ], tot_loss[loss=0.3168, simple_loss=0.3739, pruned_loss=0.1299, over 4268183.86 frames. ], batch size: 247, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:09:32,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=304230.0, ans=0.05 2023-06-18 21:10:35,637 INFO [train.py:996] (3/4) Epoch 2, batch 20250, loss[loss=0.3507, simple_loss=0.3927, pruned_loss=0.1543, over 21593.00 frames. ], tot_loss[loss=0.3129, simple_loss=0.3728, pruned_loss=0.1265, over 4267138.95 frames. ], batch size: 471, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:10:41,723 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 3.323e+02 4.182e+02 5.137e+02 1.003e+03, threshold=8.365e+02, percent-clipped=2.0 2023-06-18 21:11:32,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=304590.0, ans=0.0 2023-06-18 21:11:32,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=304590.0, ans=0.95 2023-06-18 21:11:52,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=304650.0, ans=0.125 2023-06-18 21:11:56,541 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=22.5 2023-06-18 21:12:13,277 INFO [train.py:996] (3/4) Epoch 2, batch 20300, loss[loss=0.3103, simple_loss=0.3707, pruned_loss=0.125, over 21929.00 frames. ], tot_loss[loss=0.3058, simple_loss=0.3678, pruned_loss=0.1219, over 4258890.56 frames. ], batch size: 107, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:13:17,307 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=15.0 2023-06-18 21:13:24,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=304950.0, ans=0.125 2023-06-18 21:13:32,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=305010.0, ans=0.1 2023-06-18 21:13:48,915 INFO [train.py:996] (3/4) Epoch 2, batch 20350, loss[loss=0.3438, simple_loss=0.4072, pruned_loss=0.1402, over 20743.00 frames. ], tot_loss[loss=0.3064, simple_loss=0.3673, pruned_loss=0.1228, over 4256465.97 frames. ], batch size: 607, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:13:50,044 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-18 21:13:55,470 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 3.188e+02 3.897e+02 4.973e+02 9.485e+02, threshold=7.794e+02, percent-clipped=2.0 2023-06-18 21:13:57,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=305070.0, ans=0.125 2023-06-18 21:14:09,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=305130.0, ans=0.125 2023-06-18 21:14:15,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=305130.0, ans=0.125 2023-06-18 21:14:36,861 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.54 vs. limit=15.0 2023-06-18 21:15:26,544 INFO [train.py:996] (3/4) Epoch 2, batch 20400, loss[loss=0.2206, simple_loss=0.294, pruned_loss=0.07356, over 16796.00 frames. ], tot_loss[loss=0.3119, simple_loss=0.3709, pruned_loss=0.1265, over 4257117.01 frames. ], batch size: 63, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:15:41,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=305370.0, ans=0.125 2023-06-18 21:15:54,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=305430.0, ans=0.1 2023-06-18 21:16:25,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=305550.0, ans=10.0 2023-06-18 21:16:39,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=305550.0, ans=0.0 2023-06-18 21:16:39,970 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.96 vs. limit=15.0 2023-06-18 21:17:02,142 INFO [train.py:996] (3/4) Epoch 2, batch 20450, loss[loss=0.2884, simple_loss=0.3455, pruned_loss=0.1156, over 21887.00 frames. ], tot_loss[loss=0.3161, simple_loss=0.3721, pruned_loss=0.13, over 4247953.64 frames. ], batch size: 118, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:17:07,772 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.641e+02 3.573e+02 4.608e+02 6.565e+02 1.538e+03, threshold=9.216e+02, percent-clipped=19.0 2023-06-18 21:17:31,084 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=2.533e-03 2023-06-18 21:18:02,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=305850.0, ans=0.0 2023-06-18 21:18:37,620 INFO [train.py:996] (3/4) Epoch 2, batch 20500, loss[loss=0.2922, simple_loss=0.3281, pruned_loss=0.1281, over 21171.00 frames. ], tot_loss[loss=0.3151, simple_loss=0.3688, pruned_loss=0.1307, over 4255045.83 frames. ], batch size: 159, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:18:46,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=305970.0, ans=0.1 2023-06-18 21:18:48,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=305970.0, ans=0.1 2023-06-18 21:19:03,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=306030.0, ans=0.125 2023-06-18 21:19:16,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=306090.0, ans=0.1 2023-06-18 21:19:33,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=306090.0, ans=0.0 2023-06-18 21:19:48,344 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:20:19,075 INFO [train.py:996] (3/4) Epoch 2, batch 20550, loss[loss=0.2745, simple_loss=0.3246, pruned_loss=0.1121, over 21150.00 frames. ], tot_loss[loss=0.3093, simple_loss=0.3624, pruned_loss=0.1281, over 4251551.57 frames. ], batch size: 143, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:20:25,442 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 3.461e+02 4.145e+02 5.402e+02 8.194e+02, threshold=8.291e+02, percent-clipped=0.0 2023-06-18 21:20:27,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=306270.0, ans=0.125 2023-06-18 21:20:38,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=306330.0, ans=0.1 2023-06-18 21:20:39,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=306330.0, ans=0.125 2023-06-18 21:21:41,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=306510.0, ans=0.125 2023-06-18 21:21:56,396 INFO [train.py:996] (3/4) Epoch 2, batch 20600, loss[loss=0.3369, simple_loss=0.3764, pruned_loss=0.1487, over 21872.00 frames. ], tot_loss[loss=0.307, simple_loss=0.3624, pruned_loss=0.1258, over 4247814.02 frames. ], batch size: 371, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:22:01,868 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.20 vs. limit=22.5 2023-06-18 21:22:25,818 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.99 vs. limit=10.0 2023-06-18 21:23:32,691 INFO [train.py:996] (3/4) Epoch 2, batch 20650, loss[loss=0.2731, simple_loss=0.3277, pruned_loss=0.1092, over 17348.00 frames. ], tot_loss[loss=0.3033, simple_loss=0.3569, pruned_loss=0.1249, over 4247897.42 frames. ], batch size: 64, lr: 1.56e-02, grad_scale: 64.0 2023-06-18 21:23:38,854 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.283e+02 3.072e+02 3.756e+02 5.105e+02 7.352e+02, threshold=7.512e+02, percent-clipped=0.0 2023-06-18 21:24:18,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=306990.0, ans=0.015 2023-06-18 21:24:24,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=306990.0, ans=0.125 2023-06-18 21:24:30,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=306990.0, ans=0.025 2023-06-18 21:24:57,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=307110.0, ans=0.125 2023-06-18 21:25:03,730 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-06-18 21:25:12,201 INFO [train.py:996] (3/4) Epoch 2, batch 20700, loss[loss=0.2251, simple_loss=0.2821, pruned_loss=0.08406, over 21374.00 frames. ], tot_loss[loss=0.2937, simple_loss=0.3482, pruned_loss=0.1196, over 4250411.09 frames. ], batch size: 131, lr: 1.56e-02, grad_scale: 64.0 2023-06-18 21:25:12,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=307170.0, ans=0.0 2023-06-18 21:25:15,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=307170.0, ans=0.1 2023-06-18 21:25:21,475 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.68 vs. limit=15.0 2023-06-18 21:25:37,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=307230.0, ans=0.125 2023-06-18 21:25:38,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-18 21:25:54,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=307290.0, ans=0.125 2023-06-18 21:26:13,554 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-06-18 21:26:30,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=307350.0, ans=0.2 2023-06-18 21:26:49,440 INFO [train.py:996] (3/4) Epoch 2, batch 20750, loss[loss=0.3677, simple_loss=0.4471, pruned_loss=0.1442, over 21697.00 frames. ], tot_loss[loss=0.2933, simple_loss=0.3506, pruned_loss=0.1179, over 4258014.47 frames. ], batch size: 414, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:26:57,491 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.977e+02 3.559e+02 4.590e+02 7.850e+02, threshold=7.118e+02, percent-clipped=2.0 2023-06-18 21:28:23,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=307710.0, ans=0.1 2023-06-18 21:28:26,173 INFO [train.py:996] (3/4) Epoch 2, batch 20800, loss[loss=0.3855, simple_loss=0.4827, pruned_loss=0.1441, over 20765.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3549, pruned_loss=0.1188, over 4253115.76 frames. ], batch size: 607, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:28:57,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=307830.0, ans=0.125 2023-06-18 21:29:21,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=307890.0, ans=0.125 2023-06-18 21:29:34,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=307950.0, ans=0.125 2023-06-18 21:30:02,848 INFO [train.py:996] (3/4) Epoch 2, batch 20850, loss[loss=0.3095, simple_loss=0.3586, pruned_loss=0.1302, over 22003.00 frames. ], tot_loss[loss=0.2924, simple_loss=0.3496, pruned_loss=0.1175, over 4249296.06 frames. ], batch size: 113, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:30:16,768 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.417e+02 4.186e+02 5.469e+02 9.109e+02, threshold=8.373e+02, percent-clipped=11.0 2023-06-18 21:31:17,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=308250.0, ans=0.125 2023-06-18 21:31:36,013 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=22.5 2023-06-18 21:31:37,913 INFO [train.py:996] (3/4) Epoch 2, batch 20900, loss[loss=0.2942, simple_loss=0.3493, pruned_loss=0.1196, over 21855.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.3534, pruned_loss=0.12, over 4250391.15 frames. ], batch size: 351, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:31:38,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=308370.0, ans=0.2 2023-06-18 21:32:08,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=308430.0, ans=0.125 2023-06-18 21:32:24,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=308490.0, ans=0.5 2023-06-18 21:32:43,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=308550.0, ans=0.0 2023-06-18 21:32:44,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=308550.0, ans=0.1 2023-06-18 21:33:12,545 INFO [train.py:996] (3/4) Epoch 2, batch 20950, loss[loss=0.3094, simple_loss=0.3531, pruned_loss=0.1329, over 21677.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.347, pruned_loss=0.1146, over 4245670.70 frames. ], batch size: 414, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:33:21,939 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 3.051e+02 3.719e+02 4.723e+02 9.435e+02, threshold=7.438e+02, percent-clipped=1.0 2023-06-18 21:33:43,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=308730.0, ans=0.2 2023-06-18 21:33:43,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=308730.0, ans=0.1 2023-06-18 21:33:51,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=308730.0, ans=0.0 2023-06-18 21:34:27,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=308910.0, ans=0.0 2023-06-18 21:34:48,314 INFO [train.py:996] (3/4) Epoch 2, batch 21000, loss[loss=0.275, simple_loss=0.3333, pruned_loss=0.1083, over 21874.00 frames. ], tot_loss[loss=0.2889, simple_loss=0.3464, pruned_loss=0.1157, over 4261401.90 frames. ], batch size: 124, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:34:48,314 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 21:35:04,492 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.2933, simple_loss=0.3899, pruned_loss=0.09838, over 1796401.00 frames. 2023-06-18 21:35:04,493 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-18 21:35:05,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=308970.0, ans=0.0 2023-06-18 21:35:28,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=309030.0, ans=0.1 2023-06-18 21:36:22,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=309150.0, ans=0.125 2023-06-18 21:36:34,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=309210.0, ans=0.125 2023-06-18 21:36:37,030 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=22.5 2023-06-18 21:36:40,612 INFO [train.py:996] (3/4) Epoch 2, batch 21050, loss[loss=0.272, simple_loss=0.3202, pruned_loss=0.1119, over 21724.00 frames. ], tot_loss[loss=0.288, simple_loss=0.3444, pruned_loss=0.1158, over 4263158.76 frames. ], batch size: 316, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:36:45,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=309270.0, ans=0.125 2023-06-18 21:36:53,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=309270.0, ans=0.0 2023-06-18 21:36:55,147 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.843e+02 3.389e+02 4.157e+02 8.301e+02, threshold=6.779e+02, percent-clipped=3.0 2023-06-18 21:37:06,412 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.58 vs. limit=10.0 2023-06-18 21:37:06,623 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.79 vs. limit=8.0 2023-06-18 21:37:12,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=309330.0, ans=0.1 2023-06-18 21:37:44,949 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-18 21:38:16,151 INFO [train.py:996] (3/4) Epoch 2, batch 21100, loss[loss=0.2648, simple_loss=0.3229, pruned_loss=0.1033, over 21594.00 frames. ], tot_loss[loss=0.2861, simple_loss=0.3414, pruned_loss=0.1154, over 4260753.50 frames. ], batch size: 263, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:39:32,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=309750.0, ans=0.1 2023-06-18 21:39:38,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=309810.0, ans=0.0 2023-06-18 21:39:51,665 INFO [train.py:996] (3/4) Epoch 2, batch 21150, loss[loss=0.2909, simple_loss=0.3325, pruned_loss=0.1247, over 21851.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3375, pruned_loss=0.1162, over 4264204.15 frames. ], batch size: 107, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:40:05,890 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.856e+02 3.188e+02 4.098e+02 8.101e+02, threshold=6.375e+02, percent-clipped=2.0 2023-06-18 21:40:55,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=310050.0, ans=0.125 2023-06-18 21:41:27,065 INFO [train.py:996] (3/4) Epoch 2, batch 21200, loss[loss=0.2506, simple_loss=0.2967, pruned_loss=0.1023, over 21574.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3318, pruned_loss=0.1148, over 4264274.26 frames. ], batch size: 247, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:41:50,313 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.15 vs. limit=15.0 2023-06-18 21:42:24,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=310290.0, ans=0.0 2023-06-18 21:43:02,665 INFO [train.py:996] (3/4) Epoch 2, batch 21250, loss[loss=0.2754, simple_loss=0.3273, pruned_loss=0.1118, over 21167.00 frames. ], tot_loss[loss=0.2806, simple_loss=0.3306, pruned_loss=0.1153, over 4272990.20 frames. ], batch size: 176, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:43:11,958 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.405e+02 2.991e+02 3.536e+02 4.577e+02 9.525e+02, threshold=7.072e+02, percent-clipped=11.0 2023-06-18 21:43:12,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=310470.0, ans=0.125 2023-06-18 21:43:14,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=310470.0, ans=0.125 2023-06-18 21:43:59,927 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.64 vs. limit=6.0 2023-06-18 21:44:04,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=310650.0, ans=0.125 2023-06-18 21:44:38,886 INFO [train.py:996] (3/4) Epoch 2, batch 21300, loss[loss=0.356, simple_loss=0.3924, pruned_loss=0.1598, over 21889.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.3388, pruned_loss=0.1185, over 4274948.15 frames. ], batch size: 107, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:45:38,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=310890.0, ans=0.125 2023-06-18 21:45:53,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=310950.0, ans=0.125 2023-06-18 21:46:00,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=311010.0, ans=0.1 2023-06-18 21:46:16,017 INFO [train.py:996] (3/4) Epoch 2, batch 21350, loss[loss=0.3116, simple_loss=0.3646, pruned_loss=0.1294, over 21654.00 frames. ], tot_loss[loss=0.2934, simple_loss=0.3452, pruned_loss=0.1208, over 4289189.77 frames. ], batch size: 263, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:46:16,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=311070.0, ans=0.0 2023-06-18 21:46:25,529 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.68 vs. limit=15.0 2023-06-18 21:46:30,277 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 3.675e+02 5.052e+02 5.900e+02 9.607e+02, threshold=1.010e+03, percent-clipped=12.0 2023-06-18 21:46:38,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=311070.0, ans=0.125 2023-06-18 21:47:12,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=311190.0, ans=10.0 2023-06-18 21:47:12,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=311190.0, ans=0.125 2023-06-18 21:47:36,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=311310.0, ans=0.125 2023-06-18 21:47:53,566 INFO [train.py:996] (3/4) Epoch 2, batch 21400, loss[loss=0.3092, simple_loss=0.3703, pruned_loss=0.1241, over 21746.00 frames. ], tot_loss[loss=0.291, simple_loss=0.3455, pruned_loss=0.1182, over 4287058.01 frames. ], batch size: 332, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:48:25,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=311430.0, ans=0.125 2023-06-18 21:48:51,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-18 21:48:54,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=311550.0, ans=0.2 2023-06-18 21:49:20,007 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.50 vs. limit=10.0 2023-06-18 21:49:34,472 INFO [train.py:996] (3/4) Epoch 2, batch 21450, loss[loss=0.3254, simple_loss=0.3657, pruned_loss=0.1426, over 21871.00 frames. ], tot_loss[loss=0.2931, simple_loss=0.3482, pruned_loss=0.119, over 4280239.70 frames. ], batch size: 118, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:49:45,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=311670.0, ans=0.0 2023-06-18 21:49:47,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=311670.0, ans=0.125 2023-06-18 21:49:48,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.034e+02 3.817e+02 5.220e+02 1.129e+03, threshold=7.634e+02, percent-clipped=2.0 2023-06-18 21:50:10,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=311730.0, ans=0.125 2023-06-18 21:50:33,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=311850.0, ans=0.0 2023-06-18 21:50:59,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=311910.0, ans=0.0 2023-06-18 21:51:15,121 INFO [train.py:996] (3/4) Epoch 2, batch 21500, loss[loss=0.3337, simple_loss=0.3514, pruned_loss=0.158, over 21516.00 frames. ], tot_loss[loss=0.2962, simple_loss=0.3484, pruned_loss=0.122, over 4284945.39 frames. ], batch size: 511, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:51:20,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=311970.0, ans=0.125 2023-06-18 21:51:46,149 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:52:14,081 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.91 vs. limit=15.0 2023-06-18 21:52:51,487 INFO [train.py:996] (3/4) Epoch 2, batch 21550, loss[loss=0.2765, simple_loss=0.315, pruned_loss=0.1189, over 21647.00 frames. ], tot_loss[loss=0.2888, simple_loss=0.3418, pruned_loss=0.1179, over 4277027.72 frames. ], batch size: 264, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:53:05,528 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 3.167e+02 4.080e+02 5.090e+02 8.174e+02, threshold=8.161e+02, percent-clipped=3.0 2023-06-18 21:53:07,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=312270.0, ans=0.125 2023-06-18 21:54:28,565 INFO [train.py:996] (3/4) Epoch 2, batch 21600, loss[loss=0.2568, simple_loss=0.3206, pruned_loss=0.09647, over 21822.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3349, pruned_loss=0.1155, over 4270091.20 frames. ], batch size: 317, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:54:46,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=312570.0, ans=0.2 2023-06-18 21:55:06,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=312690.0, ans=0.0 2023-06-18 21:55:17,784 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-18 21:55:26,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=312750.0, ans=0.125 2023-06-18 21:55:36,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=312750.0, ans=0.0 2023-06-18 21:55:46,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=312810.0, ans=10.0 2023-06-18 21:56:04,246 INFO [train.py:996] (3/4) Epoch 2, batch 21650, loss[loss=0.3031, simple_loss=0.3984, pruned_loss=0.1039, over 21207.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.3401, pruned_loss=0.1132, over 4266416.73 frames. ], batch size: 548, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:56:06,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=312870.0, ans=0.0 2023-06-18 21:56:18,343 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 2.957e+02 3.372e+02 4.177e+02 8.367e+02, threshold=6.745e+02, percent-clipped=1.0 2023-06-18 21:56:39,700 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=22.5 2023-06-18 21:57:08,544 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=22.5 2023-06-18 21:57:34,152 INFO [train.py:996] (3/4) Epoch 2, batch 21700, loss[loss=0.1939, simple_loss=0.2608, pruned_loss=0.06354, over 17223.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3389, pruned_loss=0.1101, over 4259691.89 frames. ], batch size: 68, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:57:53,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=313170.0, ans=0.0 2023-06-18 21:57:59,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=313230.0, ans=0.2 2023-06-18 21:59:03,978 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-18 21:59:06,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=313410.0, ans=0.125 2023-06-18 21:59:09,010 INFO [train.py:996] (3/4) Epoch 2, batch 21750, loss[loss=0.2901, simple_loss=0.3283, pruned_loss=0.1259, over 21503.00 frames. ], tot_loss[loss=0.2778, simple_loss=0.334, pruned_loss=0.1108, over 4253431.31 frames. ], batch size: 442, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:59:20,922 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:59:28,452 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.954e+02 3.507e+02 4.548e+02 1.201e+03, threshold=7.014e+02, percent-clipped=5.0 2023-06-18 21:59:28,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=313470.0, ans=0.1 2023-06-18 21:59:39,937 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.36 vs. limit=22.5 2023-06-18 21:59:40,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=313530.0, ans=0.2 2023-06-18 21:59:40,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=313530.0, ans=0.125 2023-06-18 21:59:42,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=313530.0, ans=0.125 2023-06-18 22:00:10,510 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2023-06-18 22:00:22,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=313650.0, ans=0.125 2023-06-18 22:00:33,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=313710.0, ans=0.125 2023-06-18 22:00:50,248 INFO [train.py:996] (3/4) Epoch 2, batch 21800, loss[loss=0.243, simple_loss=0.2967, pruned_loss=0.09462, over 21818.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.3338, pruned_loss=0.1128, over 4263161.90 frames. ], batch size: 107, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 22:00:55,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=313770.0, ans=0.125 2023-06-18 22:01:23,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=313830.0, ans=0.1 2023-06-18 22:01:26,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=313890.0, ans=0.125 2023-06-18 22:02:26,276 INFO [train.py:996] (3/4) Epoch 2, batch 21850, loss[loss=0.3077, simple_loss=0.3555, pruned_loss=0.13, over 21797.00 frames. ], tot_loss[loss=0.2839, simple_loss=0.3409, pruned_loss=0.1135, over 4256205.57 frames. ], batch size: 247, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 22:02:40,356 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 3.114e+02 3.894e+02 4.746e+02 8.265e+02, threshold=7.787e+02, percent-clipped=3.0 2023-06-18 22:02:59,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=314130.0, ans=0.125 2023-06-18 22:03:01,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=15.0 2023-06-18 22:03:45,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=314310.0, ans=0.125 2023-06-18 22:04:06,607 INFO [train.py:996] (3/4) Epoch 2, batch 21900, loss[loss=0.309, simple_loss=0.3568, pruned_loss=0.1306, over 21619.00 frames. ], tot_loss[loss=0.2845, simple_loss=0.341, pruned_loss=0.114, over 4267708.82 frames. ], batch size: 471, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 22:04:44,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=314490.0, ans=0.125 2023-06-18 22:04:48,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=314490.0, ans=10.0 2023-06-18 22:04:51,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=314490.0, ans=0.025 2023-06-18 22:05:00,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=314550.0, ans=0.2 2023-06-18 22:05:03,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=314550.0, ans=0.1 2023-06-18 22:05:36,725 INFO [train.py:996] (3/4) Epoch 2, batch 21950, loss[loss=0.2034, simple_loss=0.2877, pruned_loss=0.0596, over 21761.00 frames. ], tot_loss[loss=0.2806, simple_loss=0.336, pruned_loss=0.1126, over 4266467.20 frames. ], batch size: 352, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:05:50,667 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 2.916e+02 3.382e+02 4.376e+02 8.385e+02, threshold=6.764e+02, percent-clipped=2.0 2023-06-18 22:06:16,280 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.99 vs. limit=15.0 2023-06-18 22:06:55,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=314910.0, ans=0.0 2023-06-18 22:07:14,882 INFO [train.py:996] (3/4) Epoch 2, batch 22000, loss[loss=0.2642, simple_loss=0.3053, pruned_loss=0.1115, over 21253.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3286, pruned_loss=0.109, over 4263960.00 frames. ], batch size: 144, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:07:59,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=315090.0, ans=0.125 2023-06-18 22:08:25,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=315150.0, ans=0.125 2023-06-18 22:09:01,060 INFO [train.py:996] (3/4) Epoch 2, batch 22050, loss[loss=0.3873, simple_loss=0.4389, pruned_loss=0.1678, over 21890.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.3355, pruned_loss=0.1117, over 4257872.74 frames. ], batch size: 372, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:09:03,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=315270.0, ans=0.125 2023-06-18 22:09:10,804 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 3.309e+02 5.007e+02 6.581e+02 1.076e+03, threshold=1.001e+03, percent-clipped=24.0 2023-06-18 22:09:59,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=315450.0, ans=0.0 2023-06-18 22:10:39,896 INFO [train.py:996] (3/4) Epoch 2, batch 22100, loss[loss=0.3063, simple_loss=0.3597, pruned_loss=0.1265, over 21833.00 frames. ], tot_loss[loss=0.2931, simple_loss=0.3486, pruned_loss=0.1188, over 4251911.81 frames. ], batch size: 332, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:10:48,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=315570.0, ans=0.1 2023-06-18 22:12:17,293 INFO [train.py:996] (3/4) Epoch 2, batch 22150, loss[loss=0.366, simple_loss=0.4171, pruned_loss=0.1574, over 19930.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.353, pruned_loss=0.1213, over 4260850.61 frames. ], batch size: 702, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:12:26,470 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.397e+02 4.124e+02 4.864e+02 1.101e+03, threshold=8.247e+02, percent-clipped=1.0 2023-06-18 22:12:26,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=315870.0, ans=0.125 2023-06-18 22:12:37,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=315930.0, ans=0.2 2023-06-18 22:12:53,090 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-18 22:13:37,981 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-18 22:13:43,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=316110.0, ans=0.0 2023-06-18 22:13:52,411 INFO [train.py:996] (3/4) Epoch 2, batch 22200, loss[loss=0.3149, simple_loss=0.391, pruned_loss=0.1194, over 21896.00 frames. ], tot_loss[loss=0.297, simple_loss=0.3526, pruned_loss=0.1207, over 4272617.75 frames. ], batch size: 316, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:14:16,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=316230.0, ans=0.125 2023-06-18 22:14:16,913 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2023-06-18 22:14:45,004 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-18 22:14:51,315 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-18 22:15:33,135 INFO [train.py:996] (3/4) Epoch 2, batch 22250, loss[loss=0.3583, simple_loss=0.4034, pruned_loss=0.1566, over 21505.00 frames. ], tot_loss[loss=0.3045, simple_loss=0.3618, pruned_loss=0.1237, over 4278347.09 frames. ], batch size: 194, lr: 1.54e-02, grad_scale: 16.0 2023-06-18 22:15:40,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=316470.0, ans=0.125 2023-06-18 22:15:43,756 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.348e+02 2.999e+02 3.846e+02 4.976e+02 1.173e+03, threshold=7.692e+02, percent-clipped=5.0 2023-06-18 22:15:46,074 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=15.0 2023-06-18 22:15:55,598 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.20 vs. limit=15.0 2023-06-18 22:16:16,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=316590.0, ans=0.125 2023-06-18 22:16:36,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=316650.0, ans=0.125 2023-06-18 22:16:55,274 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.82 vs. limit=15.0 2023-06-18 22:17:08,373 INFO [train.py:996] (3/4) Epoch 2, batch 22300, loss[loss=0.2746, simple_loss=0.3245, pruned_loss=0.1124, over 21617.00 frames. ], tot_loss[loss=0.3088, simple_loss=0.3645, pruned_loss=0.1265, over 4279617.23 frames. ], batch size: 263, lr: 1.54e-02, grad_scale: 16.0 2023-06-18 22:17:30,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=316830.0, ans=0.04949747468305833 2023-06-18 22:17:43,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=316890.0, ans=0.1 2023-06-18 22:17:44,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=316890.0, ans=0.2 2023-06-18 22:18:02,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=316950.0, ans=0.125 2023-06-18 22:18:04,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=316950.0, ans=0.0 2023-06-18 22:18:25,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=317010.0, ans=0.2 2023-06-18 22:18:42,789 INFO [train.py:996] (3/4) Epoch 2, batch 22350, loss[loss=0.2935, simple_loss=0.3425, pruned_loss=0.1222, over 21630.00 frames. ], tot_loss[loss=0.3087, simple_loss=0.3627, pruned_loss=0.1274, over 4282868.60 frames. ], batch size: 230, lr: 1.54e-02, grad_scale: 16.0 2023-06-18 22:18:45,663 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-18 22:18:53,859 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.080e+02 3.447e+02 4.334e+02 7.080e+02, threshold=6.895e+02, percent-clipped=0.0 2023-06-18 22:18:59,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=317130.0, ans=0.04949747468305833 2023-06-18 22:19:00,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=317130.0, ans=0.125 2023-06-18 22:19:19,334 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.66 vs. limit=6.0 2023-06-18 22:19:23,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=317190.0, ans=0.125 2023-06-18 22:19:34,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=317250.0, ans=15.0 2023-06-18 22:20:10,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=317310.0, ans=0.0 2023-06-18 22:20:19,399 INFO [train.py:996] (3/4) Epoch 2, batch 22400, loss[loss=0.2717, simple_loss=0.3281, pruned_loss=0.1076, over 21190.00 frames. ], tot_loss[loss=0.3023, simple_loss=0.3581, pruned_loss=0.1232, over 4277554.82 frames. ], batch size: 548, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:20:20,921 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.49 vs. limit=22.5 2023-06-18 22:20:23,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=317370.0, ans=12.0 2023-06-18 22:20:48,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=317430.0, ans=0.0 2023-06-18 22:21:24,685 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=12.0 2023-06-18 22:21:36,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=317610.0, ans=0.0 2023-06-18 22:21:48,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=317670.0, ans=0.125 2023-06-18 22:21:49,906 INFO [train.py:996] (3/4) Epoch 2, batch 22450, loss[loss=0.2979, simple_loss=0.3361, pruned_loss=0.1299, over 21602.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.3512, pruned_loss=0.1216, over 4282293.89 frames. ], batch size: 332, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:21:51,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=317670.0, ans=0.125 2023-06-18 22:22:00,643 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 3.085e+02 3.597e+02 4.516e+02 1.181e+03, threshold=7.194e+02, percent-clipped=2.0 2023-06-18 22:22:24,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=317790.0, ans=0.0 2023-06-18 22:22:32,834 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-18 22:23:27,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=317970.0, ans=0.125 2023-06-18 22:23:28,619 INFO [train.py:996] (3/4) Epoch 2, batch 22500, loss[loss=0.2778, simple_loss=0.3128, pruned_loss=0.1214, over 21465.00 frames. ], tot_loss[loss=0.2937, simple_loss=0.3463, pruned_loss=0.1205, over 4283240.28 frames. ], batch size: 195, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:23:30,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=317970.0, ans=0.2 2023-06-18 22:23:30,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=317970.0, ans=0.04949747468305833 2023-06-18 22:24:01,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=318030.0, ans=0.0 2023-06-18 22:25:02,015 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=22.5 2023-06-18 22:25:10,116 INFO [train.py:996] (3/4) Epoch 2, batch 22550, loss[loss=0.3283, simple_loss=0.3755, pruned_loss=0.1406, over 21904.00 frames. ], tot_loss[loss=0.2954, simple_loss=0.35, pruned_loss=0.1204, over 4290842.08 frames. ], batch size: 414, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:25:26,407 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.340e+02 3.552e+02 4.294e+02 6.006e+02 1.237e+03, threshold=8.588e+02, percent-clipped=14.0 2023-06-18 22:26:51,924 INFO [train.py:996] (3/4) Epoch 2, batch 22600, loss[loss=0.2777, simple_loss=0.3259, pruned_loss=0.1147, over 21599.00 frames. ], tot_loss[loss=0.2961, simple_loss=0.3518, pruned_loss=0.1203, over 4291522.93 frames. ], batch size: 230, lr: 1.54e-02, grad_scale: 16.0 2023-06-18 22:27:22,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=318630.0, ans=0.125 2023-06-18 22:27:43,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=318690.0, ans=0.125 2023-06-18 22:27:46,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=318690.0, ans=0.0 2023-06-18 22:28:17,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=318810.0, ans=0.2 2023-06-18 22:28:29,428 INFO [train.py:996] (3/4) Epoch 2, batch 22650, loss[loss=0.267, simple_loss=0.3187, pruned_loss=0.1076, over 21971.00 frames. ], tot_loss[loss=0.2932, simple_loss=0.3476, pruned_loss=0.1194, over 4282768.37 frames. ], batch size: 103, lr: 1.53e-02, grad_scale: 16.0 2023-06-18 22:28:40,010 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.13 vs. limit=15.0 2023-06-18 22:28:41,848 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 3.125e+02 3.820e+02 4.472e+02 8.562e+02, threshold=7.640e+02, percent-clipped=0.0 2023-06-18 22:28:54,899 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:29:14,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=318990.0, ans=0.0 2023-06-18 22:29:20,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=318990.0, ans=0.1 2023-06-18 22:29:22,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.69 vs. limit=10.0 2023-06-18 22:29:32,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=319050.0, ans=0.0 2023-06-18 22:29:42,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=319050.0, ans=0.125 2023-06-18 22:29:44,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=319050.0, ans=0.0 2023-06-18 22:30:07,834 INFO [train.py:996] (3/4) Epoch 2, batch 22700, loss[loss=0.2686, simple_loss=0.316, pruned_loss=0.1106, over 20029.00 frames. ], tot_loss[loss=0.2893, simple_loss=0.3407, pruned_loss=0.1189, over 4281693.07 frames. ], batch size: 703, lr: 1.53e-02, grad_scale: 16.0 2023-06-18 22:30:25,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=319230.0, ans=0.05 2023-06-18 22:31:05,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=319290.0, ans=0.125 2023-06-18 22:31:18,022 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:31:42,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=319410.0, ans=0.0 2023-06-18 22:31:46,685 INFO [train.py:996] (3/4) Epoch 2, batch 22750, loss[loss=0.3696, simple_loss=0.4047, pruned_loss=0.1672, over 21728.00 frames. ], tot_loss[loss=0.2945, simple_loss=0.3445, pruned_loss=0.1222, over 4279356.88 frames. ], batch size: 298, lr: 1.53e-02, grad_scale: 16.0 2023-06-18 22:31:54,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=319470.0, ans=0.125 2023-06-18 22:31:59,211 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 3.020e+02 3.645e+02 4.350e+02 9.693e+02, threshold=7.290e+02, percent-clipped=3.0 2023-06-18 22:32:55,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=319650.0, ans=0.09899494936611666 2023-06-18 22:33:24,561 INFO [train.py:996] (3/4) Epoch 2, batch 22800, loss[loss=0.3131, simple_loss=0.3532, pruned_loss=0.1365, over 21580.00 frames. ], tot_loss[loss=0.3003, simple_loss=0.3498, pruned_loss=0.1254, over 4286260.38 frames. ], batch size: 548, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:33:40,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=319770.0, ans=0.1 2023-06-18 22:34:12,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=319890.0, ans=0.125 2023-06-18 22:35:01,227 INFO [train.py:996] (3/4) Epoch 2, batch 22850, loss[loss=0.4126, simple_loss=0.4195, pruned_loss=0.2029, over 21384.00 frames. ], tot_loss[loss=0.2981, simple_loss=0.3468, pruned_loss=0.1246, over 4289570.74 frames. ], batch size: 508, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:35:05,260 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=22.5 2023-06-18 22:35:18,257 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.149e+02 3.783e+02 4.416e+02 9.029e+02, threshold=7.565e+02, percent-clipped=2.0 2023-06-18 22:35:48,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=320190.0, ans=0.2 2023-06-18 22:35:50,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=320190.0, ans=0.2 2023-06-18 22:35:53,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=320190.0, ans=0.0 2023-06-18 22:36:20,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=320250.0, ans=0.0 2023-06-18 22:36:23,542 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.27 vs. limit=6.0 2023-06-18 22:36:26,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=320310.0, ans=0.125 2023-06-18 22:36:27,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=320310.0, ans=0.2 2023-06-18 22:36:38,192 INFO [train.py:996] (3/4) Epoch 2, batch 22900, loss[loss=0.3457, simple_loss=0.4333, pruned_loss=0.129, over 21613.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3488, pruned_loss=0.1235, over 4272556.05 frames. ], batch size: 414, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:36:46,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=320370.0, ans=0.125 2023-06-18 22:36:48,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=320370.0, ans=0.0 2023-06-18 22:37:03,436 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.02 vs. limit=22.5 2023-06-18 22:37:45,638 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:38:19,519 INFO [train.py:996] (3/4) Epoch 2, batch 22950, loss[loss=0.3442, simple_loss=0.4516, pruned_loss=0.1184, over 21592.00 frames. ], tot_loss[loss=0.3021, simple_loss=0.3624, pruned_loss=0.1209, over 4276705.84 frames. ], batch size: 414, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:38:32,106 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 3.018e+02 3.691e+02 4.786e+02 9.826e+02, threshold=7.383e+02, percent-clipped=2.0 2023-06-18 22:39:54,968 INFO [train.py:996] (3/4) Epoch 2, batch 23000, loss[loss=0.3175, simple_loss=0.3583, pruned_loss=0.1383, over 21288.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3619, pruned_loss=0.118, over 4276870.20 frames. ], batch size: 143, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:40:01,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=320970.0, ans=0.125 2023-06-18 22:40:31,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=321030.0, ans=0.95 2023-06-18 22:40:42,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=321090.0, ans=0.0 2023-06-18 22:41:16,223 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=12.0 2023-06-18 22:41:26,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=321210.0, ans=0.125 2023-06-18 22:41:26,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=321210.0, ans=0.125 2023-06-18 22:41:32,456 INFO [train.py:996] (3/4) Epoch 2, batch 23050, loss[loss=0.3237, simple_loss=0.3705, pruned_loss=0.1384, over 21504.00 frames. ], tot_loss[loss=0.3049, simple_loss=0.3648, pruned_loss=0.1225, over 4283378.32 frames. ], batch size: 194, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:41:54,749 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 3.225e+02 4.039e+02 5.218e+02 8.181e+02, threshold=8.078e+02, percent-clipped=3.0 2023-06-18 22:42:22,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=321390.0, ans=0.125 2023-06-18 22:42:47,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=321450.0, ans=0.0 2023-06-18 22:42:57,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=321510.0, ans=0.2 2023-06-18 22:43:05,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=321510.0, ans=0.1 2023-06-18 22:43:08,185 INFO [train.py:996] (3/4) Epoch 2, batch 23100, loss[loss=0.274, simple_loss=0.3255, pruned_loss=0.1112, over 21595.00 frames. ], tot_loss[loss=0.301, simple_loss=0.3584, pruned_loss=0.1218, over 4286405.64 frames. ], batch size: 332, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:43:24,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=321570.0, ans=0.125 2023-06-18 22:43:27,323 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:44:13,889 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.20 vs. limit=22.5 2023-06-18 22:44:42,549 INFO [train.py:996] (3/4) Epoch 2, batch 23150, loss[loss=0.3253, simple_loss=0.3694, pruned_loss=0.1405, over 21413.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3514, pruned_loss=0.1206, over 4287923.31 frames. ], batch size: 143, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:44:58,931 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 3.109e+02 3.658e+02 4.384e+02 7.114e+02, threshold=7.315e+02, percent-clipped=0.0 2023-06-18 22:45:04,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=321930.0, ans=0.0 2023-06-18 22:46:12,260 INFO [train.py:996] (3/4) Epoch 2, batch 23200, loss[loss=0.3698, simple_loss=0.4207, pruned_loss=0.1595, over 20089.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3521, pruned_loss=0.1218, over 4288353.56 frames. ], batch size: 703, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:46:15,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.96 vs. limit=15.0 2023-06-18 22:46:23,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=322170.0, ans=0.2 2023-06-18 22:46:55,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=322290.0, ans=0.0 2023-06-18 22:47:48,009 INFO [train.py:996] (3/4) Epoch 2, batch 23250, loss[loss=0.3281, simple_loss=0.3672, pruned_loss=0.1445, over 21640.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3526, pruned_loss=0.1236, over 4295724.45 frames. ], batch size: 471, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:47:49,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=322470.0, ans=0.2 2023-06-18 22:48:04,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.419e+02 3.062e+02 3.496e+02 4.224e+02 8.959e+02, threshold=6.992e+02, percent-clipped=2.0 2023-06-18 22:48:24,027 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:48:41,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=322590.0, ans=0.125 2023-06-18 22:48:52,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=322650.0, ans=0.125 2023-06-18 22:49:05,442 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-18 22:49:09,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=322710.0, ans=0.1 2023-06-18 22:49:19,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=322710.0, ans=0.125 2023-06-18 22:49:25,576 INFO [train.py:996] (3/4) Epoch 2, batch 23300, loss[loss=0.3173, simple_loss=0.4063, pruned_loss=0.1142, over 21579.00 frames. ], tot_loss[loss=0.3063, simple_loss=0.3603, pruned_loss=0.1261, over 4296270.50 frames. ], batch size: 230, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:49:26,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-18 22:49:44,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=322830.0, ans=0.125 2023-06-18 22:49:45,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=322830.0, ans=0.125 2023-06-18 22:49:59,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=322890.0, ans=0.125 2023-06-18 22:50:49,305 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-18 22:51:02,215 INFO [train.py:996] (3/4) Epoch 2, batch 23350, loss[loss=0.2399, simple_loss=0.3058, pruned_loss=0.08696, over 21796.00 frames. ], tot_loss[loss=0.3066, simple_loss=0.3641, pruned_loss=0.1246, over 4290528.61 frames. ], batch size: 282, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:51:18,955 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 3.248e+02 3.923e+02 4.916e+02 7.049e+02, threshold=7.847e+02, percent-clipped=1.0 2023-06-18 22:51:35,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=323130.0, ans=0.125 2023-06-18 22:51:36,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=323190.0, ans=0.1 2023-06-18 22:51:52,455 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.10 vs. limit=10.0 2023-06-18 22:52:37,473 INFO [train.py:996] (3/4) Epoch 2, batch 23400, loss[loss=0.2419, simple_loss=0.3171, pruned_loss=0.08334, over 21616.00 frames. ], tot_loss[loss=0.2969, simple_loss=0.3556, pruned_loss=0.1191, over 4281622.77 frames. ], batch size: 389, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:52:48,775 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-18 22:52:51,621 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.31 vs. limit=15.0 2023-06-18 22:52:54,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=323370.0, ans=0.0 2023-06-18 22:52:55,213 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2023-06-18 22:53:30,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=323490.0, ans=0.125 2023-06-18 22:54:03,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=323610.0, ans=0.125 2023-06-18 22:54:20,632 INFO [train.py:996] (3/4) Epoch 2, batch 23450, loss[loss=0.3101, simple_loss=0.3649, pruned_loss=0.1276, over 21939.00 frames. ], tot_loss[loss=0.2981, simple_loss=0.355, pruned_loss=0.1206, over 4280255.10 frames. ], batch size: 372, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:54:33,417 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 3.119e+02 3.774e+02 4.736e+02 8.725e+02, threshold=7.548e+02, percent-clipped=2.0 2023-06-18 22:54:51,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=323730.0, ans=0.0 2023-06-18 22:54:54,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=323730.0, ans=0.125 2023-06-18 22:55:03,148 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-18 22:55:59,317 INFO [train.py:996] (3/4) Epoch 2, batch 23500, loss[loss=0.2797, simple_loss=0.3358, pruned_loss=0.1118, over 21835.00 frames. ], tot_loss[loss=0.3021, simple_loss=0.3569, pruned_loss=0.1237, over 4288835.85 frames. ], batch size: 298, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:57:17,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=324210.0, ans=0.125 2023-06-18 22:57:36,238 INFO [train.py:996] (3/4) Epoch 2, batch 23550, loss[loss=0.2938, simple_loss=0.3267, pruned_loss=0.1305, over 21218.00 frames. ], tot_loss[loss=0.3, simple_loss=0.3531, pruned_loss=0.1234, over 4280130.32 frames. ], batch size: 159, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:57:36,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=324270.0, ans=0.125 2023-06-18 22:57:48,488 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.262e+02 3.808e+02 4.439e+02 7.936e+02, threshold=7.617e+02, percent-clipped=1.0 2023-06-18 22:58:03,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=324330.0, ans=0.0 2023-06-18 22:58:07,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=324330.0, ans=0.0 2023-06-18 22:58:46,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=324450.0, ans=0.1 2023-06-18 22:59:14,005 INFO [train.py:996] (3/4) Epoch 2, batch 23600, loss[loss=0.34, simple_loss=0.3869, pruned_loss=0.1465, over 21686.00 frames. ], tot_loss[loss=0.2996, simple_loss=0.3523, pruned_loss=0.1234, over 4276336.79 frames. ], batch size: 351, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:59:32,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=324570.0, ans=0.025 2023-06-18 22:59:32,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=324570.0, ans=0.0 2023-06-18 22:59:39,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=324630.0, ans=0.04949747468305833 2023-06-18 22:59:46,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=324630.0, ans=0.125 2023-06-18 23:00:13,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=324690.0, ans=0.0 2023-06-18 23:00:18,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=324750.0, ans=0.0 2023-06-18 23:00:48,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=324810.0, ans=0.125 2023-06-18 23:00:57,785 INFO [train.py:996] (3/4) Epoch 2, batch 23650, loss[loss=0.2729, simple_loss=0.3396, pruned_loss=0.1031, over 21592.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.3524, pruned_loss=0.121, over 4282047.00 frames. ], batch size: 263, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:01:09,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=324870.0, ans=0.1 2023-06-18 23:01:10,453 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.426e+02 3.434e+02 4.143e+02 5.445e+02 9.457e+02, threshold=8.285e+02, percent-clipped=4.0 2023-06-18 23:01:23,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=324930.0, ans=0.2 2023-06-18 23:01:48,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=324990.0, ans=0.125 2023-06-18 23:02:02,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=325050.0, ans=0.0 2023-06-18 23:02:15,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=325050.0, ans=0.0 2023-06-18 23:02:18,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=325110.0, ans=0.125 2023-06-18 23:02:29,490 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:02:39,076 INFO [train.py:996] (3/4) Epoch 2, batch 23700, loss[loss=0.2182, simple_loss=0.2682, pruned_loss=0.08408, over 17141.00 frames. ], tot_loss[loss=0.2993, simple_loss=0.3565, pruned_loss=0.1211, over 4268925.99 frames. ], batch size: 63, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:02:42,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=325170.0, ans=0.1 2023-06-18 23:03:05,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=325230.0, ans=0.125 2023-06-18 23:03:12,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=325230.0, ans=0.125 2023-06-18 23:03:17,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=325230.0, ans=0.0 2023-06-18 23:03:25,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=325290.0, ans=0.125 2023-06-18 23:03:50,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=325350.0, ans=0.125 2023-06-18 23:03:59,203 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:04:20,798 INFO [train.py:996] (3/4) Epoch 2, batch 23750, loss[loss=0.2242, simple_loss=0.3142, pruned_loss=0.06714, over 21417.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.3583, pruned_loss=0.1221, over 4273265.84 frames. ], batch size: 194, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:04:28,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=325470.0, ans=0.125 2023-06-18 23:04:38,215 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.156e+02 3.937e+02 4.892e+02 1.167e+03, threshold=7.875e+02, percent-clipped=3.0 2023-06-18 23:05:19,782 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-18 23:05:22,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=325650.0, ans=0.125 2023-06-18 23:06:06,052 INFO [train.py:996] (3/4) Epoch 2, batch 23800, loss[loss=0.3, simple_loss=0.3802, pruned_loss=0.1099, over 21674.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3558, pruned_loss=0.1184, over 4272309.39 frames. ], batch size: 247, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:06:07,542 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.52 vs. limit=6.0 2023-06-18 23:06:07,564 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.29 vs. limit=15.0 2023-06-18 23:06:26,643 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-06-18 23:06:32,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=325830.0, ans=0.125 2023-06-18 23:06:45,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=325830.0, ans=0.0 2023-06-18 23:07:51,547 INFO [train.py:996] (3/4) Epoch 2, batch 23850, loss[loss=0.3422, simple_loss=0.4125, pruned_loss=0.136, over 21598.00 frames. ], tot_loss[loss=0.3072, simple_loss=0.3681, pruned_loss=0.1232, over 4277760.01 frames. ], batch size: 414, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:08:09,323 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 3.353e+02 4.244e+02 5.255e+02 8.980e+02, threshold=8.488e+02, percent-clipped=3.0 2023-06-18 23:08:11,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=326130.0, ans=0.0 2023-06-18 23:08:44,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=326190.0, ans=0.125 2023-06-18 23:08:46,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=326190.0, ans=0.125 2023-06-18 23:09:18,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=326310.0, ans=0.2 2023-06-18 23:09:30,866 INFO [train.py:996] (3/4) Epoch 2, batch 23900, loss[loss=0.3369, simple_loss=0.3825, pruned_loss=0.1456, over 21995.00 frames. ], tot_loss[loss=0.3135, simple_loss=0.3745, pruned_loss=0.1263, over 4270549.53 frames. ], batch size: 103, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:10:38,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=326550.0, ans=0.125 2023-06-18 23:10:55,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=326610.0, ans=0.125 2023-06-18 23:11:08,002 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.15 vs. limit=15.0 2023-06-18 23:11:09,861 INFO [train.py:996] (3/4) Epoch 2, batch 23950, loss[loss=0.3276, simple_loss=0.374, pruned_loss=0.1406, over 21756.00 frames. ], tot_loss[loss=0.3092, simple_loss=0.3676, pruned_loss=0.1254, over 4266831.44 frames. ], batch size: 282, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:11:28,006 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.351e+02 3.116e+02 3.830e+02 4.465e+02 7.558e+02, threshold=7.660e+02, percent-clipped=0.0 2023-06-18 23:12:50,756 INFO [train.py:996] (3/4) Epoch 2, batch 24000, loss[loss=0.3519, simple_loss=0.4004, pruned_loss=0.1517, over 21688.00 frames. ], tot_loss[loss=0.3131, simple_loss=0.3689, pruned_loss=0.1287, over 4271879.45 frames. ], batch size: 351, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:12:50,756 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-18 23:13:09,350 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.2897, simple_loss=0.3899, pruned_loss=0.09475, over 1796401.00 frames. 2023-06-18 23:13:09,350 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-18 23:13:36,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=327030.0, ans=22.5 2023-06-18 23:14:48,529 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.02 vs. limit=15.0 2023-06-18 23:14:48,754 INFO [train.py:996] (3/4) Epoch 2, batch 24050, loss[loss=0.2847, simple_loss=0.3598, pruned_loss=0.1048, over 21627.00 frames. ], tot_loss[loss=0.313, simple_loss=0.3694, pruned_loss=0.1283, over 4274235.26 frames. ], batch size: 414, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:15:06,320 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.521e+02 3.470e+02 4.161e+02 4.943e+02 1.064e+03, threshold=8.323e+02, percent-clipped=4.0 2023-06-18 23:15:29,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=327390.0, ans=0.125 2023-06-18 23:15:47,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=327390.0, ans=0.125 2023-06-18 23:15:48,501 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=22.5 2023-06-18 23:15:51,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=327450.0, ans=0.0 2023-06-18 23:16:20,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=327510.0, ans=0.0 2023-06-18 23:16:24,474 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:16:34,954 INFO [train.py:996] (3/4) Epoch 2, batch 24100, loss[loss=0.3582, simple_loss=0.4082, pruned_loss=0.1541, over 21178.00 frames. ], tot_loss[loss=0.3096, simple_loss=0.3686, pruned_loss=0.1253, over 4258082.54 frames. ], batch size: 143, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:16:36,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=327570.0, ans=0.125 2023-06-18 23:16:46,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=327570.0, ans=0.2 2023-06-18 23:17:00,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=327630.0, ans=10.0 2023-06-18 23:17:24,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=327690.0, ans=0.125 2023-06-18 23:17:47,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=327810.0, ans=0.125 2023-06-18 23:18:03,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=327810.0, ans=0.1 2023-06-18 23:18:03,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=327810.0, ans=0.04949747468305833 2023-06-18 23:18:14,206 INFO [train.py:996] (3/4) Epoch 2, batch 24150, loss[loss=0.3084, simple_loss=0.3558, pruned_loss=0.1305, over 21589.00 frames. ], tot_loss[loss=0.3117, simple_loss=0.3684, pruned_loss=0.1275, over 4273694.26 frames. ], batch size: 212, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:18:26,893 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.990e+02 3.404e+02 4.259e+02 8.342e+02, threshold=6.809e+02, percent-clipped=1.0 2023-06-18 23:18:47,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=327930.0, ans=0.125 2023-06-18 23:18:55,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=327990.0, ans=0.2 2023-06-18 23:19:11,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=328050.0, ans=0.025 2023-06-18 23:19:18,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=328050.0, ans=0.0 2023-06-18 23:19:18,736 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=12.0 2023-06-18 23:19:53,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=328170.0, ans=0.2 2023-06-18 23:19:55,217 INFO [train.py:996] (3/4) Epoch 2, batch 24200, loss[loss=0.2944, simple_loss=0.3507, pruned_loss=0.1191, over 21184.00 frames. ], tot_loss[loss=0.3137, simple_loss=0.3703, pruned_loss=0.1285, over 4279345.60 frames. ], batch size: 176, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:19:58,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=328170.0, ans=0.125 2023-06-18 23:20:05,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=328170.0, ans=0.125 2023-06-18 23:20:10,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=328170.0, ans=0.125 2023-06-18 23:20:36,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=328290.0, ans=0.0 2023-06-18 23:21:41,613 INFO [train.py:996] (3/4) Epoch 2, batch 24250, loss[loss=0.2349, simple_loss=0.3285, pruned_loss=0.07066, over 21747.00 frames. ], tot_loss[loss=0.3007, simple_loss=0.3641, pruned_loss=0.1186, over 4277889.28 frames. ], batch size: 298, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:21:59,858 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.648e+02 3.026e+02 3.601e+02 5.036e+02 9.709e+02, threshold=7.202e+02, percent-clipped=3.0 2023-06-18 23:22:13,507 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=15.0 2023-06-18 23:22:21,724 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=22.5 2023-06-18 23:22:32,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=328590.0, ans=0.0 2023-06-18 23:22:36,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=328590.0, ans=0.125 2023-06-18 23:22:38,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=328650.0, ans=0.125 2023-06-18 23:22:57,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=328650.0, ans=0.0 2023-06-18 23:23:21,442 INFO [train.py:996] (3/4) Epoch 2, batch 24300, loss[loss=0.2339, simple_loss=0.2953, pruned_loss=0.08629, over 21266.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.3555, pruned_loss=0.1109, over 4279934.74 frames. ], batch size: 159, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:23:44,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=328830.0, ans=10.0 2023-06-18 23:23:51,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=328830.0, ans=0.2 2023-06-18 23:24:02,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=328890.0, ans=0.125 2023-06-18 23:24:12,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=328890.0, ans=0.2 2023-06-18 23:24:48,165 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.61 vs. limit=10.0 2023-06-18 23:25:04,604 INFO [train.py:996] (3/4) Epoch 2, batch 24350, loss[loss=0.3465, simple_loss=0.3915, pruned_loss=0.1508, over 21409.00 frames. ], tot_loss[loss=0.288, simple_loss=0.3516, pruned_loss=0.1122, over 4278254.59 frames. ], batch size: 548, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:25:17,419 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.896e+02 3.511e+02 4.657e+02 9.016e+02, threshold=7.022e+02, percent-clipped=4.0 2023-06-18 23:25:23,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=329130.0, ans=0.2 2023-06-18 23:25:57,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=329190.0, ans=0.1 2023-06-18 23:26:38,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=329310.0, ans=0.2 2023-06-18 23:26:45,786 INFO [train.py:996] (3/4) Epoch 2, batch 24400, loss[loss=0.2854, simple_loss=0.3569, pruned_loss=0.1069, over 21183.00 frames. ], tot_loss[loss=0.2967, simple_loss=0.3582, pruned_loss=0.1176, over 4278480.88 frames. ], batch size: 143, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:27:04,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=329430.0, ans=0.0 2023-06-18 23:27:33,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=329490.0, ans=0.1 2023-06-18 23:27:38,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=329490.0, ans=0.125 2023-06-18 23:27:38,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=329490.0, ans=0.0 2023-06-18 23:28:25,838 INFO [train.py:996] (3/4) Epoch 2, batch 24450, loss[loss=0.3065, simple_loss=0.379, pruned_loss=0.117, over 21751.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3604, pruned_loss=0.1197, over 4283909.77 frames. ], batch size: 332, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:28:38,607 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.395e+02 4.151e+02 4.993e+02 8.571e+02, threshold=8.301e+02, percent-clipped=4.0 2023-06-18 23:28:45,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=329730.0, ans=0.2 2023-06-18 23:28:55,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=329730.0, ans=0.95 2023-06-18 23:29:12,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=329790.0, ans=0.125 2023-06-18 23:29:28,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=329850.0, ans=0.125 2023-06-18 23:29:34,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=329850.0, ans=0.125 2023-06-18 23:29:55,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=329910.0, ans=0.1 2023-06-18 23:30:04,423 INFO [train.py:996] (3/4) Epoch 2, batch 24500, loss[loss=0.3156, simple_loss=0.3645, pruned_loss=0.1333, over 21866.00 frames. ], tot_loss[loss=0.3002, simple_loss=0.3612, pruned_loss=0.1196, over 4285847.37 frames. ], batch size: 351, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:30:13,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=329970.0, ans=0.07 2023-06-18 23:31:42,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=330210.0, ans=0.125 2023-06-18 23:31:44,652 INFO [train.py:996] (3/4) Epoch 2, batch 24550, loss[loss=0.3223, simple_loss=0.3809, pruned_loss=0.1319, over 21835.00 frames. ], tot_loss[loss=0.3046, simple_loss=0.3641, pruned_loss=0.1225, over 4288986.32 frames. ], batch size: 282, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:31:58,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=330270.0, ans=0.125 2023-06-18 23:32:01,951 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 3.061e+02 3.714e+02 4.494e+02 1.254e+03, threshold=7.429e+02, percent-clipped=1.0 2023-06-18 23:32:24,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=330390.0, ans=0.0 2023-06-18 23:32:28,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=330390.0, ans=0.125 2023-06-18 23:32:54,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=330450.0, ans=0.2 2023-06-18 23:33:12,951 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=15.0 2023-06-18 23:33:19,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=330510.0, ans=0.125 2023-06-18 23:33:22,616 INFO [train.py:996] (3/4) Epoch 2, batch 24600, loss[loss=0.2912, simple_loss=0.3194, pruned_loss=0.1315, over 20698.00 frames. ], tot_loss[loss=0.3035, simple_loss=0.3595, pruned_loss=0.1237, over 4275242.78 frames. ], batch size: 607, lr: 1.51e-02, grad_scale: 64.0 2023-06-18 23:34:18,488 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-06-18 23:34:46,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=330810.0, ans=0.0 2023-06-18 23:35:01,705 INFO [train.py:996] (3/4) Epoch 2, batch 24650, loss[loss=0.2402, simple_loss=0.284, pruned_loss=0.09821, over 21464.00 frames. ], tot_loss[loss=0.2954, simple_loss=0.3492, pruned_loss=0.1208, over 4267612.01 frames. ], batch size: 195, lr: 1.51e-02, grad_scale: 64.0 2023-06-18 23:35:06,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=330870.0, ans=15.0 2023-06-18 23:35:19,496 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.192e+02 3.864e+02 5.203e+02 1.017e+03, threshold=7.727e+02, percent-clipped=5.0 2023-06-18 23:35:23,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=330930.0, ans=0.1 2023-06-18 23:35:48,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=330990.0, ans=0.125 2023-06-18 23:36:12,296 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.80 vs. limit=15.0 2023-06-18 23:36:36,253 INFO [train.py:996] (3/4) Epoch 2, batch 24700, loss[loss=0.2888, simple_loss=0.347, pruned_loss=0.1153, over 21823.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3469, pruned_loss=0.1177, over 4275884.04 frames. ], batch size: 372, lr: 1.51e-02, grad_scale: 64.0 2023-06-18 23:36:38,562 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.65 vs. limit=10.0 2023-06-18 23:37:16,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=331230.0, ans=0.0 2023-06-18 23:37:42,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=331350.0, ans=15.0 2023-06-18 23:37:50,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=331350.0, ans=0.1 2023-06-18 23:38:13,925 INFO [train.py:996] (3/4) Epoch 2, batch 24750, loss[loss=0.2392, simple_loss=0.2905, pruned_loss=0.09389, over 21508.00 frames. ], tot_loss[loss=0.2859, simple_loss=0.3411, pruned_loss=0.1153, over 4271760.27 frames. ], batch size: 230, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:38:33,455 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.320e+02 3.050e+02 3.889e+02 4.934e+02 8.372e+02, threshold=7.777e+02, percent-clipped=3.0 2023-06-18 23:38:37,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=331530.0, ans=0.0 2023-06-18 23:39:47,033 INFO [train.py:996] (3/4) Epoch 2, batch 24800, loss[loss=0.2933, simple_loss=0.334, pruned_loss=0.1263, over 21436.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3369, pruned_loss=0.1144, over 4260802.90 frames. ], batch size: 548, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:41:15,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=332010.0, ans=0.2 2023-06-18 23:41:26,056 INFO [train.py:996] (3/4) Epoch 2, batch 24850, loss[loss=0.2956, simple_loss=0.362, pruned_loss=0.1146, over 21702.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3381, pruned_loss=0.1162, over 4269536.22 frames. ], batch size: 389, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:41:32,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=332070.0, ans=0.0 2023-06-18 23:41:50,088 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 3.323e+02 4.351e+02 5.576e+02 8.938e+02, threshold=8.701e+02, percent-clipped=5.0 2023-06-18 23:41:50,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=332130.0, ans=0.1 2023-06-18 23:42:19,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=332190.0, ans=0.125 2023-06-18 23:42:19,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=332190.0, ans=0.2 2023-06-18 23:42:49,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=332310.0, ans=0.1 2023-06-18 23:42:55,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=332310.0, ans=0.125 2023-06-18 23:43:09,979 INFO [train.py:996] (3/4) Epoch 2, batch 24900, loss[loss=0.3116, simple_loss=0.3679, pruned_loss=0.1276, over 21348.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.3388, pruned_loss=0.116, over 4274166.33 frames. ], batch size: 548, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:43:45,218 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.37 vs. limit=15.0 2023-06-18 23:44:37,273 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=15.0 2023-06-18 23:44:55,488 INFO [train.py:996] (3/4) Epoch 2, batch 24950, loss[loss=0.3645, simple_loss=0.396, pruned_loss=0.1665, over 21796.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3509, pruned_loss=0.1235, over 4279150.72 frames. ], batch size: 282, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:45:12,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=332670.0, ans=0.2 2023-06-18 23:45:15,197 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.174e+02 3.427e+02 4.669e+02 5.544e+02 9.304e+02, threshold=9.338e+02, percent-clipped=1.0 2023-06-18 23:45:25,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=332730.0, ans=0.2 2023-06-18 23:45:29,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=332730.0, ans=0.0 2023-06-18 23:45:46,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-18 23:46:03,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=332850.0, ans=0.0 2023-06-18 23:46:40,055 INFO [train.py:996] (3/4) Epoch 2, batch 25000, loss[loss=0.357, simple_loss=0.3759, pruned_loss=0.1691, over 21282.00 frames. ], tot_loss[loss=0.3048, simple_loss=0.3584, pruned_loss=0.1256, over 4274183.15 frames. ], batch size: 507, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:47:13,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=333030.0, ans=0.125 2023-06-18 23:48:14,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=333210.0, ans=0.125 2023-06-18 23:48:16,994 INFO [train.py:996] (3/4) Epoch 2, batch 25050, loss[loss=0.29, simple_loss=0.3328, pruned_loss=0.1236, over 21711.00 frames. ], tot_loss[loss=0.2981, simple_loss=0.3501, pruned_loss=0.123, over 4263576.99 frames. ], batch size: 124, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:48:36,477 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 3.031e+02 3.621e+02 4.496e+02 7.145e+02, threshold=7.242e+02, percent-clipped=0.0 2023-06-18 23:49:20,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=333450.0, ans=0.2 2023-06-18 23:49:55,953 INFO [train.py:996] (3/4) Epoch 2, batch 25100, loss[loss=0.2817, simple_loss=0.328, pruned_loss=0.1177, over 21802.00 frames. ], tot_loss[loss=0.2945, simple_loss=0.3449, pruned_loss=0.122, over 4272816.52 frames. ], batch size: 98, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:49:56,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=333570.0, ans=0.125 2023-06-18 23:49:58,459 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.14 vs. limit=15.0 2023-06-18 23:50:05,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=333570.0, ans=0.1 2023-06-18 23:50:15,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=333630.0, ans=0.125 2023-06-18 23:51:02,401 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=15.0 2023-06-18 23:51:07,209 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.76 vs. limit=10.0 2023-06-18 23:51:33,427 INFO [train.py:996] (3/4) Epoch 2, batch 25150, loss[loss=0.2921, simple_loss=0.3588, pruned_loss=0.1127, over 21336.00 frames. ], tot_loss[loss=0.2925, simple_loss=0.3472, pruned_loss=0.1189, over 4272642.61 frames. ], batch size: 176, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:51:39,760 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.43 vs. limit=15.0 2023-06-18 23:51:48,177 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 3.003e+02 3.483e+02 4.487e+02 9.549e+02, threshold=6.965e+02, percent-clipped=3.0 2023-06-18 23:52:11,309 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.95 vs. limit=10.0 2023-06-18 23:52:21,979 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.09 vs. limit=10.0 2023-06-18 23:53:11,742 INFO [train.py:996] (3/4) Epoch 2, batch 25200, loss[loss=0.2417, simple_loss=0.3064, pruned_loss=0.08855, over 20010.00 frames. ], tot_loss[loss=0.2897, simple_loss=0.3461, pruned_loss=0.1166, over 4282066.94 frames. ], batch size: 702, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:53:24,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=334170.0, ans=0.0 2023-06-18 23:53:46,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=334290.0, ans=10.0 2023-06-18 23:54:02,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=334350.0, ans=0.1 2023-06-18 23:54:16,180 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=15.0 2023-06-18 23:54:18,521 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.58 vs. limit=10.0 2023-06-18 23:54:39,793 INFO [train.py:996] (3/4) Epoch 2, batch 25250, loss[loss=0.2075, simple_loss=0.2709, pruned_loss=0.07203, over 16838.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3426, pruned_loss=0.114, over 4276421.69 frames. ], batch size: 62, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:55:04,410 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 2.756e+02 3.618e+02 4.524e+02 8.260e+02, threshold=7.237e+02, percent-clipped=4.0 2023-06-18 23:55:49,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=334650.0, ans=0.0 2023-06-18 23:55:50,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=334650.0, ans=0.2 2023-06-18 23:55:52,139 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:56:24,656 INFO [train.py:996] (3/4) Epoch 2, batch 25300, loss[loss=0.3745, simple_loss=0.4154, pruned_loss=0.1668, over 21726.00 frames. ], tot_loss[loss=0.2845, simple_loss=0.3411, pruned_loss=0.1139, over 4261745.05 frames. ], batch size: 441, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:56:45,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=334830.0, ans=0.0 2023-06-18 23:56:58,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=334830.0, ans=0.0 2023-06-18 23:57:23,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=334950.0, ans=0.125 2023-06-18 23:58:09,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=335070.0, ans=0.125 2023-06-18 23:58:09,691 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=15.0 2023-06-18 23:58:10,260 INFO [train.py:996] (3/4) Epoch 2, batch 25350, loss[loss=0.2362, simple_loss=0.3106, pruned_loss=0.08089, over 21676.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3453, pruned_loss=0.1144, over 4258056.90 frames. ], batch size: 298, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:58:29,559 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.861e+02 3.471e+02 4.257e+02 9.448e+02, threshold=6.941e+02, percent-clipped=2.0 2023-06-18 23:58:58,673 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.75 vs. limit=6.0 2023-06-18 23:59:27,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=335310.0, ans=15.0 2023-06-18 23:59:44,346 INFO [train.py:996] (3/4) Epoch 2, batch 25400, loss[loss=0.291, simple_loss=0.3426, pruned_loss=0.1197, over 21323.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.34, pruned_loss=0.1129, over 4251709.03 frames. ], batch size: 548, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 00:00:00,967 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.40 vs. limit=10.0 2023-06-19 00:00:08,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=335430.0, ans=0.125 2023-06-19 00:00:18,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=335430.0, ans=0.1 2023-06-19 00:01:19,058 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-19 00:01:22,819 INFO [train.py:996] (3/4) Epoch 2, batch 25450, loss[loss=0.2525, simple_loss=0.3389, pruned_loss=0.08299, over 21539.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.3423, pruned_loss=0.1161, over 4252699.70 frames. ], batch size: 195, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 00:01:47,343 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.942e+02 3.491e+02 4.451e+02 7.396e+02, threshold=6.982e+02, percent-clipped=1.0 2023-06-19 00:02:25,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=335850.0, ans=10.0 2023-06-19 00:03:09,098 INFO [train.py:996] (3/4) Epoch 2, batch 25500, loss[loss=0.3013, simple_loss=0.3741, pruned_loss=0.1143, over 21667.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3427, pruned_loss=0.1123, over 4259909.25 frames. ], batch size: 389, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 00:03:22,043 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=22.5 2023-06-19 00:03:27,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=335970.0, ans=0.1 2023-06-19 00:03:53,424 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=15.0 2023-06-19 00:04:54,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=336210.0, ans=0.2 2023-06-19 00:04:56,937 INFO [train.py:996] (3/4) Epoch 2, batch 25550, loss[loss=0.2939, simple_loss=0.3507, pruned_loss=0.1185, over 19989.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3483, pruned_loss=0.112, over 4248803.32 frames. ], batch size: 704, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:04:57,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=336270.0, ans=0.025 2023-06-19 00:05:07,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=336270.0, ans=0.125 2023-06-19 00:05:12,146 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 2.591e+02 3.118e+02 3.638e+02 5.445e+02, threshold=6.236e+02, percent-clipped=0.0 2023-06-19 00:05:54,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=336450.0, ans=0.2 2023-06-19 00:06:04,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=336450.0, ans=0.1 2023-06-19 00:06:38,392 INFO [train.py:996] (3/4) Epoch 2, batch 25600, loss[loss=0.4149, simple_loss=0.4588, pruned_loss=0.1855, over 21788.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3522, pruned_loss=0.1135, over 4251170.28 frames. ], batch size: 118, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:06:39,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=336570.0, ans=0.125 2023-06-19 00:07:25,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=336690.0, ans=0.1 2023-06-19 00:07:48,445 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:07:54,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=336750.0, ans=0.035 2023-06-19 00:08:17,760 INFO [train.py:996] (3/4) Epoch 2, batch 25650, loss[loss=0.2891, simple_loss=0.3344, pruned_loss=0.1219, over 21446.00 frames. ], tot_loss[loss=0.2939, simple_loss=0.3537, pruned_loss=0.1171, over 4260177.31 frames. ], batch size: 389, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:08:31,777 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 3.010e+02 3.647e+02 4.694e+02 1.135e+03, threshold=7.294e+02, percent-clipped=6.0 2023-06-19 00:08:40,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=336930.0, ans=0.0 2023-06-19 00:09:19,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=337050.0, ans=0.2 2023-06-19 00:09:36,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=337110.0, ans=0.0 2023-06-19 00:09:57,153 INFO [train.py:996] (3/4) Epoch 2, batch 25700, loss[loss=0.2896, simple_loss=0.3524, pruned_loss=0.1134, over 21166.00 frames. ], tot_loss[loss=0.2942, simple_loss=0.3514, pruned_loss=0.1185, over 4257142.05 frames. ], batch size: 143, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:10:00,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=337170.0, ans=0.2 2023-06-19 00:10:35,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=337290.0, ans=0.125 2023-06-19 00:11:29,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=337410.0, ans=0.125 2023-06-19 00:11:38,888 INFO [train.py:996] (3/4) Epoch 2, batch 25750, loss[loss=0.4559, simple_loss=0.4662, pruned_loss=0.2228, over 21355.00 frames. ], tot_loss[loss=0.303, simple_loss=0.359, pruned_loss=0.1235, over 4265361.23 frames. ], batch size: 507, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:11:54,354 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 3.020e+02 3.881e+02 5.422e+02 1.342e+03, threshold=7.762e+02, percent-clipped=9.0 2023-06-19 00:12:09,799 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-19 00:13:12,010 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:13:14,368 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.55 vs. limit=15.0 2023-06-19 00:13:26,843 INFO [train.py:996] (3/4) Epoch 2, batch 25800, loss[loss=0.2962, simple_loss=0.3594, pruned_loss=0.1165, over 21610.00 frames. ], tot_loss[loss=0.3173, simple_loss=0.3746, pruned_loss=0.13, over 4268162.21 frames. ], batch size: 263, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:13:30,556 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:13:32,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=337770.0, ans=0.2 2023-06-19 00:14:17,500 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.58 vs. limit=15.0 2023-06-19 00:14:30,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=337950.0, ans=0.125 2023-06-19 00:14:46,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=338010.0, ans=0.125 2023-06-19 00:15:02,344 INFO [train.py:996] (3/4) Epoch 2, batch 25850, loss[loss=0.324, simple_loss=0.3728, pruned_loss=0.1376, over 21872.00 frames. ], tot_loss[loss=0.3154, simple_loss=0.3748, pruned_loss=0.128, over 4276191.59 frames. ], batch size: 414, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:15:26,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=338070.0, ans=0.125 2023-06-19 00:15:26,952 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 3.253e+02 3.802e+02 4.832e+02 7.273e+02, threshold=7.603e+02, percent-clipped=0.0 2023-06-19 00:15:58,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=338190.0, ans=0.125 2023-06-19 00:16:03,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=338190.0, ans=0.125 2023-06-19 00:16:07,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=338250.0, ans=0.04949747468305833 2023-06-19 00:16:18,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2023-06-19 00:16:43,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=338310.0, ans=0.0 2023-06-19 00:16:53,751 INFO [train.py:996] (3/4) Epoch 2, batch 25900, loss[loss=0.3004, simple_loss=0.3655, pruned_loss=0.1177, over 21249.00 frames. ], tot_loss[loss=0.317, simple_loss=0.3759, pruned_loss=0.129, over 4282797.62 frames. ], batch size: 143, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:16:54,664 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-19 00:17:01,531 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=15.0 2023-06-19 00:17:29,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=338490.0, ans=0.0 2023-06-19 00:17:38,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=338490.0, ans=0.0 2023-06-19 00:17:38,575 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=22.5 2023-06-19 00:17:38,645 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=12.0 2023-06-19 00:18:39,796 INFO [train.py:996] (3/4) Epoch 2, batch 25950, loss[loss=0.3077, simple_loss=0.3598, pruned_loss=0.1278, over 21318.00 frames. ], tot_loss[loss=0.3218, simple_loss=0.3806, pruned_loss=0.1315, over 4287259.75 frames. ], batch size: 159, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:18:54,222 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 3.185e+02 3.771e+02 4.566e+02 7.877e+02, threshold=7.541e+02, percent-clipped=2.0 2023-06-19 00:18:56,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=338730.0, ans=0.1 2023-06-19 00:19:22,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=338790.0, ans=0.125 2023-06-19 00:20:20,786 INFO [train.py:996] (3/4) Epoch 2, batch 26000, loss[loss=0.377, simple_loss=0.4243, pruned_loss=0.1648, over 21740.00 frames. ], tot_loss[loss=0.3188, simple_loss=0.3795, pruned_loss=0.1291, over 4288031.30 frames. ], batch size: 441, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:20:35,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=339030.0, ans=0.0 2023-06-19 00:20:43,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=339030.0, ans=0.125 2023-06-19 00:22:00,575 INFO [train.py:996] (3/4) Epoch 2, batch 26050, loss[loss=0.2941, simple_loss=0.3355, pruned_loss=0.1263, over 21058.00 frames. ], tot_loss[loss=0.3207, simple_loss=0.3797, pruned_loss=0.1308, over 4281175.83 frames. ], batch size: 608, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:22:02,921 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-19 00:22:03,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=339270.0, ans=0.125 2023-06-19 00:22:14,502 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.427e+02 3.293e+02 3.871e+02 4.573e+02 8.054e+02, threshold=7.741e+02, percent-clipped=1.0 2023-06-19 00:22:22,004 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-19 00:22:25,219 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.85 vs. limit=10.0 2023-06-19 00:23:38,897 INFO [train.py:996] (3/4) Epoch 2, batch 26100, loss[loss=0.2772, simple_loss=0.3274, pruned_loss=0.1135, over 21366.00 frames. ], tot_loss[loss=0.3171, simple_loss=0.3742, pruned_loss=0.13, over 4283117.80 frames. ], batch size: 144, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:23:39,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=339570.0, ans=0.125 2023-06-19 00:23:45,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=339570.0, ans=0.0 2023-06-19 00:24:13,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=339690.0, ans=0.0 2023-06-19 00:25:03,928 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:25:20,009 INFO [train.py:996] (3/4) Epoch 2, batch 26150, loss[loss=0.3399, simple_loss=0.3856, pruned_loss=0.1471, over 21750.00 frames. ], tot_loss[loss=0.3176, simple_loss=0.3736, pruned_loss=0.1308, over 4285023.06 frames. ], batch size: 441, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:25:26,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=339870.0, ans=0.125 2023-06-19 00:25:34,891 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.250e+02 3.966e+02 5.249e+02 8.349e+02, threshold=7.932e+02, percent-clipped=3.0 2023-06-19 00:25:43,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=339930.0, ans=0.125 2023-06-19 00:26:01,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=339990.0, ans=0.2 2023-06-19 00:26:11,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=339990.0, ans=0.0 2023-06-19 00:27:00,546 INFO [train.py:996] (3/4) Epoch 2, batch 26200, loss[loss=0.276, simple_loss=0.3649, pruned_loss=0.09349, over 21439.00 frames. ], tot_loss[loss=0.3168, simple_loss=0.3755, pruned_loss=0.1291, over 4285884.52 frames. ], batch size: 211, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:27:01,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=340170.0, ans=0.125 2023-06-19 00:27:04,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=340170.0, ans=0.0 2023-06-19 00:27:15,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=340230.0, ans=0.0 2023-06-19 00:27:20,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=340230.0, ans=0.025 2023-06-19 00:28:37,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=22.5 2023-06-19 00:28:39,125 INFO [train.py:996] (3/4) Epoch 2, batch 26250, loss[loss=0.3237, simple_loss=0.369, pruned_loss=0.1392, over 21320.00 frames. ], tot_loss[loss=0.3147, simple_loss=0.3758, pruned_loss=0.1268, over 4278205.08 frames. ], batch size: 176, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:28:44,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=340470.0, ans=0.1 2023-06-19 00:28:54,373 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.946e+02 3.629e+02 4.371e+02 7.049e+02, threshold=7.257e+02, percent-clipped=0.0 2023-06-19 00:28:59,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=340530.0, ans=0.125 2023-06-19 00:30:15,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=340710.0, ans=0.2 2023-06-19 00:30:20,173 INFO [train.py:996] (3/4) Epoch 2, batch 26300, loss[loss=0.3491, simple_loss=0.3914, pruned_loss=0.1534, over 21762.00 frames. ], tot_loss[loss=0.3141, simple_loss=0.3729, pruned_loss=0.1276, over 4283511.04 frames. ], batch size: 112, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:30:20,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=340770.0, ans=0.125 2023-06-19 00:31:15,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=340890.0, ans=0.125 2023-06-19 00:31:47,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-19 00:31:57,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=341010.0, ans=0.0 2023-06-19 00:32:00,709 INFO [train.py:996] (3/4) Epoch 2, batch 26350, loss[loss=0.2992, simple_loss=0.3568, pruned_loss=0.1208, over 21889.00 frames. ], tot_loss[loss=0.3133, simple_loss=0.3707, pruned_loss=0.1279, over 4283654.01 frames. ], batch size: 316, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:32:29,487 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.280e+02 3.077e+02 3.703e+02 4.775e+02 7.605e+02, threshold=7.406e+02, percent-clipped=1.0 2023-06-19 00:32:53,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=341190.0, ans=0.125 2023-06-19 00:33:04,880 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.37 vs. limit=15.0 2023-06-19 00:33:05,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=341250.0, ans=0.125 2023-06-19 00:33:07,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=341250.0, ans=0.035 2023-06-19 00:33:23,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=341310.0, ans=10.0 2023-06-19 00:33:34,924 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=22.5 2023-06-19 00:33:36,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=341310.0, ans=0.125 2023-06-19 00:33:38,848 INFO [train.py:996] (3/4) Epoch 2, batch 26400, loss[loss=0.2957, simple_loss=0.3298, pruned_loss=0.1308, over 21783.00 frames. ], tot_loss[loss=0.3095, simple_loss=0.3635, pruned_loss=0.1278, over 4284711.56 frames. ], batch size: 317, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:34:14,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-06-19 00:35:32,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=341670.0, ans=0.0 2023-06-19 00:35:33,177 INFO [train.py:996] (3/4) Epoch 2, batch 26450, loss[loss=0.3201, simple_loss=0.4044, pruned_loss=0.1179, over 21841.00 frames. ], tot_loss[loss=0.3087, simple_loss=0.3634, pruned_loss=0.1271, over 4272550.96 frames. ], batch size: 317, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:35:58,240 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 3.117e+02 3.740e+02 5.003e+02 1.177e+03, threshold=7.481e+02, percent-clipped=6.0 2023-06-19 00:36:23,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=341790.0, ans=0.1 2023-06-19 00:37:20,182 INFO [train.py:996] (3/4) Epoch 2, batch 26500, loss[loss=0.2731, simple_loss=0.3354, pruned_loss=0.1054, over 21622.00 frames. ], tot_loss[loss=0.3083, simple_loss=0.3657, pruned_loss=0.1255, over 4267488.43 frames. ], batch size: 263, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:37:26,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=341970.0, ans=0.125 2023-06-19 00:38:02,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=342090.0, ans=0.125 2023-06-19 00:39:03,670 INFO [train.py:996] (3/4) Epoch 2, batch 26550, loss[loss=0.2376, simple_loss=0.3056, pruned_loss=0.08486, over 21542.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.3621, pruned_loss=0.1203, over 4255251.71 frames. ], batch size: 212, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:39:09,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=342270.0, ans=0.2 2023-06-19 00:39:19,865 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.448e+02 4.337e+02 5.433e+02 9.319e+02, threshold=8.673e+02, percent-clipped=7.0 2023-06-19 00:40:42,944 INFO [train.py:996] (3/4) Epoch 2, batch 26600, loss[loss=0.248, simple_loss=0.3087, pruned_loss=0.09372, over 20720.00 frames. ], tot_loss[loss=0.2954, simple_loss=0.359, pruned_loss=0.1159, over 4251420.50 frames. ], batch size: 608, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:41:17,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=342630.0, ans=0.95 2023-06-19 00:41:48,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=342690.0, ans=0.125 2023-06-19 00:42:22,717 INFO [train.py:996] (3/4) Epoch 2, batch 26650, loss[loss=0.1992, simple_loss=0.2815, pruned_loss=0.0584, over 21656.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3506, pruned_loss=0.1138, over 4253755.57 frames. ], batch size: 298, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:42:38,276 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 3.189e+02 3.874e+02 5.287e+02 9.951e+02, threshold=7.747e+02, percent-clipped=1.0 2023-06-19 00:42:57,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=342990.0, ans=0.125 2023-06-19 00:43:03,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=342990.0, ans=0.125 2023-06-19 00:43:10,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=342990.0, ans=0.125 2023-06-19 00:43:37,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=343050.0, ans=0.125 2023-06-19 00:43:55,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=343170.0, ans=0.1 2023-06-19 00:43:56,099 INFO [train.py:996] (3/4) Epoch 2, batch 26700, loss[loss=0.2551, simple_loss=0.3093, pruned_loss=0.1004, over 21213.00 frames. ], tot_loss[loss=0.2824, simple_loss=0.343, pruned_loss=0.1109, over 4253486.11 frames. ], batch size: 608, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:44:14,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=343230.0, ans=0.125 2023-06-19 00:44:14,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=343230.0, ans=0.0 2023-06-19 00:44:29,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=343230.0, ans=0.125 2023-06-19 00:45:01,680 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.24 vs. limit=10.0 2023-06-19 00:45:02,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=343290.0, ans=0.5 2023-06-19 00:45:30,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=343410.0, ans=0.125 2023-06-19 00:45:37,236 INFO [train.py:996] (3/4) Epoch 2, batch 26750, loss[loss=0.3433, simple_loss=0.396, pruned_loss=0.1453, over 21338.00 frames. ], tot_loss[loss=0.2809, simple_loss=0.3429, pruned_loss=0.1094, over 4262546.12 frames. ], batch size: 143, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:45:58,481 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.751e+02 3.226e+02 3.870e+02 9.468e+02, threshold=6.452e+02, percent-clipped=0.0 2023-06-19 00:46:31,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=343590.0, ans=0.125 2023-06-19 00:47:18,768 INFO [train.py:996] (3/4) Epoch 2, batch 26800, loss[loss=0.3479, simple_loss=0.3881, pruned_loss=0.1538, over 21490.00 frames. ], tot_loss[loss=0.2918, simple_loss=0.3528, pruned_loss=0.1155, over 4266701.52 frames. ], batch size: 194, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:47:53,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=343830.0, ans=0.0 2023-06-19 00:48:03,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=343830.0, ans=0.1 2023-06-19 00:48:38,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=344010.0, ans=0.2 2023-06-19 00:48:41,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=344010.0, ans=0.125 2023-06-19 00:48:54,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=344010.0, ans=0.1 2023-06-19 00:48:58,641 INFO [train.py:996] (3/4) Epoch 2, batch 26850, loss[loss=0.2982, simple_loss=0.3337, pruned_loss=0.1314, over 15035.00 frames. ], tot_loss[loss=0.2975, simple_loss=0.3554, pruned_loss=0.1198, over 4261360.54 frames. ], batch size: 60, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:49:23,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=344070.0, ans=0.125 2023-06-19 00:49:24,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=344070.0, ans=0.125 2023-06-19 00:49:29,085 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.578e+02 3.525e+02 4.180e+02 5.123e+02 1.126e+03, threshold=8.361e+02, percent-clipped=11.0 2023-06-19 00:49:37,910 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.32 vs. limit=15.0 2023-06-19 00:50:10,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=344250.0, ans=0.125 2023-06-19 00:50:10,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=344250.0, ans=0.1 2023-06-19 00:50:12,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=344250.0, ans=0.0 2023-06-19 00:50:37,551 INFO [train.py:996] (3/4) Epoch 2, batch 26900, loss[loss=0.2499, simple_loss=0.299, pruned_loss=0.1004, over 21657.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3463, pruned_loss=0.1184, over 4265615.45 frames. ], batch size: 282, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:51:05,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=344430.0, ans=0.0 2023-06-19 00:52:04,899 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:52:12,712 INFO [train.py:996] (3/4) Epoch 2, batch 26950, loss[loss=0.3234, simple_loss=0.3822, pruned_loss=0.1323, over 21507.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3449, pruned_loss=0.1175, over 4262914.12 frames. ], batch size: 212, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:52:41,409 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-19 00:52:43,524 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.137e+02 3.675e+02 4.908e+02 1.022e+03, threshold=7.351e+02, percent-clipped=1.0 2023-06-19 00:53:12,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=344790.0, ans=0.125 2023-06-19 00:53:17,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=344850.0, ans=0.125 2023-06-19 00:54:08,206 INFO [train.py:996] (3/4) Epoch 2, batch 27000, loss[loss=0.2505, simple_loss=0.3337, pruned_loss=0.08369, over 21705.00 frames. ], tot_loss[loss=0.2875, simple_loss=0.3459, pruned_loss=0.1146, over 4272109.39 frames. ], batch size: 298, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:54:08,206 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 00:54:22,053 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.8122, 3.2060, 2.6320, 3.6943], device='cuda:3') 2023-06-19 00:54:25,471 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.2623, simple_loss=0.361, pruned_loss=0.08186, over 1796401.00 frames. 2023-06-19 00:54:25,472 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-19 00:54:30,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=344970.0, ans=0.125 2023-06-19 00:55:25,280 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:55:33,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=345150.0, ans=0.0 2023-06-19 00:55:35,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=345150.0, ans=0.0 2023-06-19 00:56:01,623 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-19 00:56:06,864 INFO [train.py:996] (3/4) Epoch 2, batch 27050, loss[loss=0.2272, simple_loss=0.3357, pruned_loss=0.05933, over 19791.00 frames. ], tot_loss[loss=0.284, simple_loss=0.3476, pruned_loss=0.1102, over 4273383.54 frames. ], batch size: 702, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:56:23,033 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.868e+02 3.454e+02 4.544e+02 1.088e+03, threshold=6.909e+02, percent-clipped=2.0 2023-06-19 00:56:30,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=345330.0, ans=0.125 2023-06-19 00:56:35,462 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-19 00:56:51,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=345390.0, ans=0.2 2023-06-19 00:57:32,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=345510.0, ans=0.1 2023-06-19 00:57:43,854 INFO [train.py:996] (3/4) Epoch 2, batch 27100, loss[loss=0.2875, simple_loss=0.3594, pruned_loss=0.1078, over 21462.00 frames. ], tot_loss[loss=0.289, simple_loss=0.3514, pruned_loss=0.1133, over 4281030.77 frames. ], batch size: 548, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:57:49,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=345570.0, ans=0.0 2023-06-19 00:58:19,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=345690.0, ans=0.125 2023-06-19 00:58:21,704 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.93 vs. limit=15.0 2023-06-19 00:59:17,992 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.50 vs. limit=6.0 2023-06-19 00:59:18,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=345870.0, ans=0.0 2023-06-19 00:59:20,175 INFO [train.py:996] (3/4) Epoch 2, batch 27150, loss[loss=0.3437, simple_loss=0.4155, pruned_loss=0.1359, over 21625.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.363, pruned_loss=0.1174, over 4281483.78 frames. ], batch size: 263, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 00:59:26,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=345870.0, ans=0.0 2023-06-19 00:59:36,356 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.470e+02 4.217e+02 5.291e+02 1.062e+03, threshold=8.433e+02, percent-clipped=9.0 2023-06-19 00:59:41,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=345930.0, ans=0.125 2023-06-19 00:59:54,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=345990.0, ans=0.1 2023-06-19 01:00:02,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=345990.0, ans=0.2 2023-06-19 01:00:03,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2023-06-19 01:00:16,274 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-19 01:00:54,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=346170.0, ans=0.0 2023-06-19 01:00:55,725 INFO [train.py:996] (3/4) Epoch 2, batch 27200, loss[loss=0.3018, simple_loss=0.3743, pruned_loss=0.1146, over 21762.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.3688, pruned_loss=0.1188, over 4282425.29 frames. ], batch size: 332, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:00:57,159 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.62 vs. limit=15.0 2023-06-19 01:01:01,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=346170.0, ans=0.125 2023-06-19 01:01:28,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=346230.0, ans=0.125 2023-06-19 01:01:59,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=346290.0, ans=0.125 2023-06-19 01:02:16,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=346350.0, ans=0.0 2023-06-19 01:02:32,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=346410.0, ans=0.125 2023-06-19 01:02:34,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=346410.0, ans=0.2 2023-06-19 01:02:34,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=346410.0, ans=0.125 2023-06-19 01:02:39,311 INFO [train.py:996] (3/4) Epoch 2, batch 27250, loss[loss=0.3841, simple_loss=0.4224, pruned_loss=0.1729, over 21576.00 frames. ], tot_loss[loss=0.3125, simple_loss=0.374, pruned_loss=0.1255, over 4284992.00 frames. ], batch size: 389, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:02:49,434 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:02:59,700 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.364e+02 3.070e+02 3.624e+02 4.371e+02 7.633e+02, threshold=7.247e+02, percent-clipped=0.0 2023-06-19 01:03:10,808 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-19 01:03:34,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=346590.0, ans=0.0 2023-06-19 01:03:37,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=346590.0, ans=0.125 2023-06-19 01:03:38,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=346590.0, ans=0.0 2023-06-19 01:03:39,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=346590.0, ans=0.0 2023-06-19 01:04:21,664 INFO [train.py:996] (3/4) Epoch 2, batch 27300, loss[loss=0.3278, simple_loss=0.406, pruned_loss=0.1248, over 21588.00 frames. ], tot_loss[loss=0.3153, simple_loss=0.3765, pruned_loss=0.127, over 4285552.49 frames. ], batch size: 414, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:04:25,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=346770.0, ans=0.5 2023-06-19 01:05:25,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=346950.0, ans=0.0 2023-06-19 01:05:31,194 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.13 vs. limit=15.0 2023-06-19 01:06:09,249 INFO [train.py:996] (3/4) Epoch 2, batch 27350, loss[loss=0.325, simple_loss=0.3756, pruned_loss=0.1371, over 21254.00 frames. ], tot_loss[loss=0.3187, simple_loss=0.3794, pruned_loss=0.1291, over 4281192.71 frames. ], batch size: 143, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:06:23,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=347070.0, ans=0.125 2023-06-19 01:06:31,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=347070.0, ans=0.0 2023-06-19 01:06:35,166 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.251e+02 3.372e+02 3.927e+02 4.716e+02 8.245e+02, threshold=7.854e+02, percent-clipped=1.0 2023-06-19 01:06:51,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=347190.0, ans=0.0 2023-06-19 01:07:48,533 INFO [train.py:996] (3/4) Epoch 2, batch 27400, loss[loss=0.246, simple_loss=0.2958, pruned_loss=0.0981, over 21598.00 frames. ], tot_loss[loss=0.3162, simple_loss=0.3752, pruned_loss=0.1286, over 4288874.17 frames. ], batch size: 231, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:07:55,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=347370.0, ans=0.0 2023-06-19 01:08:32,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=347490.0, ans=0.1 2023-06-19 01:09:00,947 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-19 01:09:10,876 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=12.0 2023-06-19 01:09:29,478 INFO [train.py:996] (3/4) Epoch 2, batch 27450, loss[loss=0.3481, simple_loss=0.3899, pruned_loss=0.1532, over 21293.00 frames. ], tot_loss[loss=0.3089, simple_loss=0.3673, pruned_loss=0.1253, over 4276868.38 frames. ], batch size: 143, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:09:34,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=347670.0, ans=0.0 2023-06-19 01:09:45,225 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.415e+02 3.960e+02 5.232e+02 1.053e+03, threshold=7.919e+02, percent-clipped=3.0 2023-06-19 01:10:25,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=347850.0, ans=0.125 2023-06-19 01:10:35,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=347850.0, ans=0.125 2023-06-19 01:10:36,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=347910.0, ans=0.0 2023-06-19 01:10:41,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=347910.0, ans=0.5 2023-06-19 01:11:08,115 INFO [train.py:996] (3/4) Epoch 2, batch 27500, loss[loss=0.2937, simple_loss=0.349, pruned_loss=0.1192, over 21337.00 frames. ], tot_loss[loss=0.3082, simple_loss=0.3657, pruned_loss=0.1254, over 4281314.74 frames. ], batch size: 143, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:11:22,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=348030.0, ans=0.125 2023-06-19 01:11:40,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=348090.0, ans=10.0 2023-06-19 01:11:44,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=348090.0, ans=0.125 2023-06-19 01:11:49,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=348090.0, ans=0.0 2023-06-19 01:12:06,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=348150.0, ans=0.1 2023-06-19 01:12:44,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=348210.0, ans=0.125 2023-06-19 01:12:47,545 INFO [train.py:996] (3/4) Epoch 2, batch 27550, loss[loss=0.2306, simple_loss=0.3009, pruned_loss=0.08012, over 21662.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3582, pruned_loss=0.1204, over 4284002.82 frames. ], batch size: 247, lr: 1.47e-02, grad_scale: 16.0 2023-06-19 01:12:58,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=348270.0, ans=0.0 2023-06-19 01:13:01,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=348330.0, ans=0.125 2023-06-19 01:13:05,021 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.210e+02 3.897e+02 4.749e+02 7.014e+02, threshold=7.795e+02, percent-clipped=0.0 2023-06-19 01:13:05,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=348330.0, ans=10.0 2023-06-19 01:13:19,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=348390.0, ans=0.0 2023-06-19 01:13:29,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=348390.0, ans=0.09899494936611666 2023-06-19 01:13:31,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=348390.0, ans=0.125 2023-06-19 01:13:57,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=348510.0, ans=0.125 2023-06-19 01:14:29,321 INFO [train.py:996] (3/4) Epoch 2, batch 27600, loss[loss=0.3013, simple_loss=0.3387, pruned_loss=0.132, over 21592.00 frames. ], tot_loss[loss=0.2955, simple_loss=0.3522, pruned_loss=0.1194, over 4279555.47 frames. ], batch size: 415, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:14:38,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=348570.0, ans=0.1 2023-06-19 01:14:43,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=348630.0, ans=0.125 2023-06-19 01:14:51,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=348630.0, ans=0.125 2023-06-19 01:15:07,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=348690.0, ans=0.0 2023-06-19 01:15:11,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=348690.0, ans=0.0 2023-06-19 01:15:18,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=348750.0, ans=0.0 2023-06-19 01:15:23,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=348750.0, ans=0.2 2023-06-19 01:15:48,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=348810.0, ans=0.0 2023-06-19 01:16:03,369 INFO [train.py:996] (3/4) Epoch 2, batch 27650, loss[loss=0.2733, simple_loss=0.3375, pruned_loss=0.1046, over 21436.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3466, pruned_loss=0.1188, over 4271538.51 frames. ], batch size: 131, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:16:15,411 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.74 vs. limit=10.0 2023-06-19 01:16:25,634 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.734e+02 3.535e+02 4.526e+02 5.756e+02 1.207e+03, threshold=9.051e+02, percent-clipped=8.0 2023-06-19 01:16:35,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=348930.0, ans=0.2 2023-06-19 01:16:45,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=348990.0, ans=0.125 2023-06-19 01:17:16,173 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:17:48,127 INFO [train.py:996] (3/4) Epoch 2, batch 27700, loss[loss=0.2994, simple_loss=0.366, pruned_loss=0.1164, over 21778.00 frames. ], tot_loss[loss=0.2913, simple_loss=0.3476, pruned_loss=0.1175, over 4271313.14 frames. ], batch size: 332, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:18:11,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=349230.0, ans=0.125 2023-06-19 01:18:16,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=349230.0, ans=0.125 2023-06-19 01:18:21,810 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:18:29,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=349290.0, ans=0.2 2023-06-19 01:19:27,717 INFO [train.py:996] (3/4) Epoch 2, batch 27750, loss[loss=0.2271, simple_loss=0.3, pruned_loss=0.07712, over 21229.00 frames. ], tot_loss[loss=0.2924, simple_loss=0.3503, pruned_loss=0.1172, over 4279718.00 frames. ], batch size: 176, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:19:38,203 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=22.5 2023-06-19 01:19:38,218 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-19 01:19:44,821 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.93 vs. limit=15.0 2023-06-19 01:19:45,250 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.361e+02 3.964e+02 5.094e+02 9.268e+02, threshold=7.928e+02, percent-clipped=1.0 2023-06-19 01:20:01,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=349590.0, ans=0.1 2023-06-19 01:20:18,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=349650.0, ans=0.125 2023-06-19 01:21:06,408 INFO [train.py:996] (3/4) Epoch 2, batch 27800, loss[loss=0.3542, simple_loss=0.3892, pruned_loss=0.1596, over 21770.00 frames. ], tot_loss[loss=0.2924, simple_loss=0.3494, pruned_loss=0.1177, over 4277521.24 frames. ], batch size: 441, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:21:15,155 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-19 01:21:17,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=349770.0, ans=0.0 2023-06-19 01:21:19,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=349770.0, ans=0.2 2023-06-19 01:21:37,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=349890.0, ans=0.125 2023-06-19 01:21:39,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=349890.0, ans=0.0 2023-06-19 01:22:00,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=349950.0, ans=0.125 2023-06-19 01:22:47,535 INFO [train.py:996] (3/4) Epoch 2, batch 27850, loss[loss=0.297, simple_loss=0.3574, pruned_loss=0.1183, over 21860.00 frames. ], tot_loss[loss=0.2944, simple_loss=0.3498, pruned_loss=0.1195, over 4284375.91 frames. ], batch size: 371, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:23:06,194 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.520e+02 3.389e+02 4.361e+02 6.049e+02 9.596e+02, threshold=8.723e+02, percent-clipped=7.0 2023-06-19 01:23:23,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=350190.0, ans=0.125 2023-06-19 01:23:25,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=350190.0, ans=15.0 2023-06-19 01:24:18,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=350310.0, ans=0.0 2023-06-19 01:24:23,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=350310.0, ans=0.1 2023-06-19 01:24:30,943 INFO [train.py:996] (3/4) Epoch 2, batch 27900, loss[loss=0.2777, simple_loss=0.3592, pruned_loss=0.09809, over 21569.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3586, pruned_loss=0.1202, over 4288436.59 frames. ], batch size: 230, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:24:36,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=350370.0, ans=0.125 2023-06-19 01:24:39,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=350370.0, ans=0.125 2023-06-19 01:25:49,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=350550.0, ans=0.05 2023-06-19 01:26:07,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=350610.0, ans=0.1 2023-06-19 01:26:13,832 INFO [train.py:996] (3/4) Epoch 2, batch 27950, loss[loss=0.2549, simple_loss=0.3423, pruned_loss=0.08375, over 21732.00 frames. ], tot_loss[loss=0.2946, simple_loss=0.3587, pruned_loss=0.1152, over 4284234.25 frames. ], batch size: 351, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:26:32,397 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 3.236e+02 3.896e+02 4.908e+02 8.483e+02, threshold=7.791e+02, percent-clipped=0.0 2023-06-19 01:27:09,787 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.49 vs. limit=15.0 2023-06-19 01:27:36,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=350910.0, ans=0.125 2023-06-19 01:27:53,647 INFO [train.py:996] (3/4) Epoch 2, batch 28000, loss[loss=0.2674, simple_loss=0.3174, pruned_loss=0.1087, over 21692.00 frames. ], tot_loss[loss=0.2903, simple_loss=0.3551, pruned_loss=0.1128, over 4287258.76 frames. ], batch size: 263, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:28:04,253 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.92 vs. limit=10.0 2023-06-19 01:29:18,008 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.72 vs. limit=15.0 2023-06-19 01:29:21,480 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=11.79 vs. limit=15.0 2023-06-19 01:29:35,433 INFO [train.py:996] (3/4) Epoch 2, batch 28050, loss[loss=0.2814, simple_loss=0.3549, pruned_loss=0.104, over 21703.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3539, pruned_loss=0.1153, over 4289185.24 frames. ], batch size: 389, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:29:37,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=351270.0, ans=0.1 2023-06-19 01:29:57,844 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.799e+02 3.165e+02 3.817e+02 7.021e+02, threshold=6.330e+02, percent-clipped=0.0 2023-06-19 01:30:33,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=351390.0, ans=0.0 2023-06-19 01:30:41,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=351450.0, ans=0.125 2023-06-19 01:31:15,447 INFO [train.py:996] (3/4) Epoch 2, batch 28100, loss[loss=0.2548, simple_loss=0.3181, pruned_loss=0.09573, over 21830.00 frames. ], tot_loss[loss=0.29, simple_loss=0.3503, pruned_loss=0.1148, over 4282797.12 frames. ], batch size: 107, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:31:16,366 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-19 01:31:25,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=351570.0, ans=0.1 2023-06-19 01:31:52,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=351630.0, ans=0.1 2023-06-19 01:32:05,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=351690.0, ans=0.125 2023-06-19 01:32:05,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=351690.0, ans=0.125 2023-06-19 01:32:26,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=351750.0, ans=0.025 2023-06-19 01:32:30,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=351750.0, ans=0.2 2023-06-19 01:32:54,277 INFO [train.py:996] (3/4) Epoch 2, batch 28150, loss[loss=0.2642, simple_loss=0.308, pruned_loss=0.1102, over 21618.00 frames. ], tot_loss[loss=0.2875, simple_loss=0.3436, pruned_loss=0.1156, over 4287709.37 frames. ], batch size: 282, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:33:04,128 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:33:05,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=351870.0, ans=0.125 2023-06-19 01:33:07,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=351870.0, ans=0.2 2023-06-19 01:33:11,883 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.407e+02 3.356e+02 3.949e+02 5.361e+02 1.113e+03, threshold=7.898e+02, percent-clipped=11.0 2023-06-19 01:33:24,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=351930.0, ans=0.125 2023-06-19 01:33:46,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=351990.0, ans=0.015 2023-06-19 01:33:49,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=351990.0, ans=0.0 2023-06-19 01:34:29,889 INFO [train.py:996] (3/4) Epoch 2, batch 28200, loss[loss=0.3158, simple_loss=0.3576, pruned_loss=0.1369, over 21254.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3401, pruned_loss=0.117, over 4288892.52 frames. ], batch size: 143, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:34:33,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=352170.0, ans=0.0 2023-06-19 01:35:25,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=352290.0, ans=0.1 2023-06-19 01:35:30,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=352290.0, ans=0.04949747468305833 2023-06-19 01:36:08,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=352410.0, ans=0.0 2023-06-19 01:36:10,823 INFO [train.py:996] (3/4) Epoch 2, batch 28250, loss[loss=0.2768, simple_loss=0.3289, pruned_loss=0.1123, over 21791.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3437, pruned_loss=0.1203, over 4288423.57 frames. ], batch size: 352, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:36:38,450 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.575e+02 3.660e+02 4.283e+02 5.277e+02 9.711e+02, threshold=8.566e+02, percent-clipped=2.0 2023-06-19 01:37:37,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=352710.0, ans=0.2 2023-06-19 01:37:43,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=352710.0, ans=0.125 2023-06-19 01:37:47,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=352710.0, ans=0.2 2023-06-19 01:37:51,595 INFO [train.py:996] (3/4) Epoch 2, batch 28300, loss[loss=0.2311, simple_loss=0.3032, pruned_loss=0.07951, over 21376.00 frames. ], tot_loss[loss=0.2884, simple_loss=0.3426, pruned_loss=0.1171, over 4277675.42 frames. ], batch size: 211, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:39:12,276 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.31 vs. limit=22.5 2023-06-19 01:39:44,098 INFO [train.py:996] (3/4) Epoch 2, batch 28350, loss[loss=0.2775, simple_loss=0.3256, pruned_loss=0.1147, over 21321.00 frames. ], tot_loss[loss=0.2812, simple_loss=0.3411, pruned_loss=0.1107, over 4270207.71 frames. ], batch size: 211, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:40:07,599 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.865e+02 3.652e+02 5.364e+02 1.153e+03, threshold=7.304e+02, percent-clipped=2.0 2023-06-19 01:40:17,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=353130.0, ans=0.125 2023-06-19 01:40:44,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=353250.0, ans=0.2 2023-06-19 01:40:58,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=353310.0, ans=0.2 2023-06-19 01:41:30,007 INFO [train.py:996] (3/4) Epoch 2, batch 28400, loss[loss=0.2748, simple_loss=0.3223, pruned_loss=0.1136, over 21363.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.337, pruned_loss=0.1099, over 4264392.04 frames. ], batch size: 211, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:41:59,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=353430.0, ans=0.0 2023-06-19 01:42:01,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=353490.0, ans=0.025 2023-06-19 01:42:05,105 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.25 vs. limit=22.5 2023-06-19 01:42:22,850 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=15.0 2023-06-19 01:43:09,874 INFO [train.py:996] (3/4) Epoch 2, batch 28450, loss[loss=0.3251, simple_loss=0.3783, pruned_loss=0.136, over 21776.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.3445, pruned_loss=0.1155, over 4267851.99 frames. ], batch size: 112, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:43:27,697 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 3.368e+02 4.115e+02 5.202e+02 1.060e+03, threshold=8.231e+02, percent-clipped=7.0 2023-06-19 01:44:38,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=353910.0, ans=0.2 2023-06-19 01:44:39,427 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-19 01:44:42,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=353910.0, ans=0.0 2023-06-19 01:44:50,963 INFO [train.py:996] (3/4) Epoch 2, batch 28500, loss[loss=0.3197, simple_loss=0.3743, pruned_loss=0.1326, over 21768.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3468, pruned_loss=0.1177, over 4271397.57 frames. ], batch size: 124, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:46:04,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=354150.0, ans=0.125 2023-06-19 01:46:15,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=354150.0, ans=0.125 2023-06-19 01:46:28,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=354210.0, ans=0.0 2023-06-19 01:46:32,532 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-19 01:46:34,616 INFO [train.py:996] (3/4) Epoch 2, batch 28550, loss[loss=0.3262, simple_loss=0.4003, pruned_loss=0.126, over 21472.00 frames. ], tot_loss[loss=0.3, simple_loss=0.3569, pruned_loss=0.1216, over 4275930.16 frames. ], batch size: 211, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:46:39,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=354270.0, ans=0.1 2023-06-19 01:46:45,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-19 01:46:46,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=354270.0, ans=0.1 2023-06-19 01:46:52,892 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 3.021e+02 3.809e+02 4.877e+02 1.502e+03, threshold=7.617e+02, percent-clipped=6.0 2023-06-19 01:46:53,758 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-19 01:47:16,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=354390.0, ans=0.125 2023-06-19 01:47:55,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=354450.0, ans=0.07 2023-06-19 01:48:11,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=354510.0, ans=0.0 2023-06-19 01:48:17,806 INFO [train.py:996] (3/4) Epoch 2, batch 28600, loss[loss=0.3081, simple_loss=0.3646, pruned_loss=0.1258, over 21573.00 frames. ], tot_loss[loss=0.3059, simple_loss=0.3639, pruned_loss=0.124, over 4275724.74 frames. ], batch size: 230, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:48:18,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=354570.0, ans=0.04949747468305833 2023-06-19 01:48:23,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=12.0 2023-06-19 01:48:47,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=354630.0, ans=0.125 2023-06-19 01:49:28,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=354750.0, ans=0.2 2023-06-19 01:49:58,540 INFO [train.py:996] (3/4) Epoch 2, batch 28650, loss[loss=0.2366, simple_loss=0.2816, pruned_loss=0.09579, over 21499.00 frames. ], tot_loss[loss=0.3009, simple_loss=0.3566, pruned_loss=0.1225, over 4262241.58 frames. ], batch size: 196, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:50:01,278 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.49 vs. limit=10.0 2023-06-19 01:50:21,117 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.376e+02 3.990e+02 4.916e+02 8.510e+02, threshold=7.981e+02, percent-clipped=3.0 2023-06-19 01:50:21,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=354930.0, ans=0.125 2023-06-19 01:51:24,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=355110.0, ans=0.125 2023-06-19 01:51:38,415 INFO [train.py:996] (3/4) Epoch 2, batch 28700, loss[loss=0.3264, simple_loss=0.3698, pruned_loss=0.1415, over 21222.00 frames. ], tot_loss[loss=0.3018, simple_loss=0.3562, pruned_loss=0.1237, over 4259500.91 frames. ], batch size: 143, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:51:53,782 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-06-19 01:51:57,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=355230.0, ans=0.0 2023-06-19 01:52:02,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=355230.0, ans=0.1 2023-06-19 01:53:00,988 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-19 01:53:05,994 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-19 01:53:18,221 INFO [train.py:996] (3/4) Epoch 2, batch 28750, loss[loss=0.3286, simple_loss=0.3728, pruned_loss=0.1422, over 21250.00 frames. ], tot_loss[loss=0.3031, simple_loss=0.3563, pruned_loss=0.125, over 4265112.36 frames. ], batch size: 143, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 01:53:46,504 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 3.065e+02 3.653e+02 4.286e+02 6.736e+02, threshold=7.306e+02, percent-clipped=0.0 2023-06-19 01:53:49,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=355530.0, ans=0.2 2023-06-19 01:54:09,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=355590.0, ans=0.0 2023-06-19 01:54:33,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=355650.0, ans=0.125 2023-06-19 01:54:58,828 INFO [train.py:996] (3/4) Epoch 2, batch 28800, loss[loss=0.3656, simple_loss=0.4137, pruned_loss=0.1588, over 21474.00 frames. ], tot_loss[loss=0.306, simple_loss=0.3606, pruned_loss=0.1257, over 4269401.76 frames. ], batch size: 471, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 01:55:04,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=355770.0, ans=0.125 2023-06-19 01:55:38,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=355830.0, ans=0.0 2023-06-19 01:55:48,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=355890.0, ans=0.1 2023-06-19 01:55:54,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=355890.0, ans=0.125 2023-06-19 01:56:17,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=356010.0, ans=0.0 2023-06-19 01:56:25,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=356010.0, ans=0.125 2023-06-19 01:56:40,222 INFO [train.py:996] (3/4) Epoch 2, batch 28850, loss[loss=0.3079, simple_loss=0.3458, pruned_loss=0.135, over 21628.00 frames. ], tot_loss[loss=0.3084, simple_loss=0.3627, pruned_loss=0.127, over 4266168.92 frames. ], batch size: 212, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 01:57:12,525 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.103e+02 3.762e+02 4.499e+02 8.286e+02, threshold=7.524e+02, percent-clipped=2.0 2023-06-19 01:57:13,007 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:57:19,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=356130.0, ans=6.0 2023-06-19 01:57:34,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=356190.0, ans=0.125 2023-06-19 01:57:47,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=356250.0, ans=0.05 2023-06-19 01:58:02,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=356250.0, ans=0.125 2023-06-19 01:58:26,581 INFO [train.py:996] (3/4) Epoch 2, batch 28900, loss[loss=0.3439, simple_loss=0.3968, pruned_loss=0.1455, over 21422.00 frames. ], tot_loss[loss=0.3108, simple_loss=0.3646, pruned_loss=0.1286, over 4271542.98 frames. ], batch size: 548, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 01:58:39,743 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-19 01:58:54,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=356430.0, ans=0.0 2023-06-19 01:59:20,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=356490.0, ans=0.0 2023-06-19 02:00:19,027 INFO [train.py:996] (3/4) Epoch 2, batch 28950, loss[loss=0.3702, simple_loss=0.4267, pruned_loss=0.1569, over 21526.00 frames. ], tot_loss[loss=0.3121, simple_loss=0.3678, pruned_loss=0.1282, over 4265734.87 frames. ], batch size: 507, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:00:27,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=356670.0, ans=0.1 2023-06-19 02:00:37,075 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.464e+02 3.281e+02 4.028e+02 5.318e+02 1.006e+03, threshold=8.055e+02, percent-clipped=4.0 2023-06-19 02:01:00,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=356790.0, ans=0.125 2023-06-19 02:01:54,765 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=15.0 2023-06-19 02:01:58,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=356910.0, ans=0.0 2023-06-19 02:02:01,645 INFO [train.py:996] (3/4) Epoch 2, batch 29000, loss[loss=0.3095, simple_loss=0.3722, pruned_loss=0.1234, over 21306.00 frames. ], tot_loss[loss=0.3135, simple_loss=0.3722, pruned_loss=0.1274, over 4263445.47 frames. ], batch size: 548, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:03:44,470 INFO [train.py:996] (3/4) Epoch 2, batch 29050, loss[loss=0.2711, simple_loss=0.3261, pruned_loss=0.108, over 21413.00 frames. ], tot_loss[loss=0.3116, simple_loss=0.3688, pruned_loss=0.1272, over 4268201.15 frames. ], batch size: 131, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:03:57,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=357270.0, ans=0.125 2023-06-19 02:04:02,376 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.232e+02 3.182e+02 3.653e+02 4.375e+02 6.472e+02, threshold=7.306e+02, percent-clipped=0.0 2023-06-19 02:04:05,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=357330.0, ans=0.0 2023-06-19 02:05:00,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=357450.0, ans=0.05 2023-06-19 02:05:04,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=357450.0, ans=0.5 2023-06-19 02:05:25,361 INFO [train.py:996] (3/4) Epoch 2, batch 29100, loss[loss=0.2818, simple_loss=0.334, pruned_loss=0.1148, over 21819.00 frames. ], tot_loss[loss=0.3029, simple_loss=0.3581, pruned_loss=0.1239, over 4276863.08 frames. ], batch size: 98, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:05:48,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=357630.0, ans=0.0 2023-06-19 02:05:48,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=357630.0, ans=0.1 2023-06-19 02:06:03,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=357690.0, ans=0.0 2023-06-19 02:06:25,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=357690.0, ans=10.0 2023-06-19 02:06:59,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=357810.0, ans=0.05 2023-06-19 02:07:04,286 INFO [train.py:996] (3/4) Epoch 2, batch 29150, loss[loss=0.314, simple_loss=0.3762, pruned_loss=0.1259, over 21666.00 frames. ], tot_loss[loss=0.3016, simple_loss=0.3576, pruned_loss=0.1228, over 4271885.69 frames. ], batch size: 332, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:07:04,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=357870.0, ans=0.125 2023-06-19 02:07:21,742 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 3.617e+02 4.258e+02 5.180e+02 9.047e+02, threshold=8.516e+02, percent-clipped=9.0 2023-06-19 02:07:57,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=357990.0, ans=0.125 2023-06-19 02:08:44,195 INFO [train.py:996] (3/4) Epoch 2, batch 29200, loss[loss=0.2922, simple_loss=0.3413, pruned_loss=0.1215, over 21802.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.352, pruned_loss=0.1219, over 4275601.66 frames. ], batch size: 102, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:09:14,750 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.78 vs. limit=15.0 2023-06-19 02:10:23,990 INFO [train.py:996] (3/4) Epoch 2, batch 29250, loss[loss=0.2905, simple_loss=0.3666, pruned_loss=0.1072, over 21704.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3493, pruned_loss=0.1177, over 4275115.19 frames. ], batch size: 298, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:10:27,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=358470.0, ans=0.0 2023-06-19 02:10:36,731 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-19 02:10:39,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=358530.0, ans=0.125 2023-06-19 02:10:46,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.694e+02 3.473e+02 5.021e+02 8.866e+02, threshold=6.946e+02, percent-clipped=1.0 2023-06-19 02:12:04,056 INFO [train.py:996] (3/4) Epoch 2, batch 29300, loss[loss=0.2883, simple_loss=0.3299, pruned_loss=0.1233, over 21558.00 frames. ], tot_loss[loss=0.291, simple_loss=0.3503, pruned_loss=0.1159, over 4272112.49 frames. ], batch size: 132, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:12:55,252 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.11 vs. limit=15.0 2023-06-19 02:13:07,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=358890.0, ans=0.1 2023-06-19 02:13:44,854 INFO [train.py:996] (3/4) Epoch 2, batch 29350, loss[loss=0.305, simple_loss=0.3401, pruned_loss=0.1349, over 21843.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.3445, pruned_loss=0.115, over 4256903.02 frames. ], batch size: 107, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:14:13,782 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.309e+02 2.964e+02 3.404e+02 4.114e+02 7.296e+02, threshold=6.809e+02, percent-clipped=1.0 2023-06-19 02:14:54,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=359250.0, ans=0.0 2023-06-19 02:15:10,413 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.65 vs. limit=15.0 2023-06-19 02:15:20,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=359370.0, ans=0.0 2023-06-19 02:15:21,990 INFO [train.py:996] (3/4) Epoch 2, batch 29400, loss[loss=0.3083, simple_loss=0.3689, pruned_loss=0.1238, over 21710.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3443, pruned_loss=0.112, over 4260707.88 frames. ], batch size: 298, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:15:26,382 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.90 vs. limit=15.0 2023-06-19 02:15:55,813 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.94 vs. limit=8.0 2023-06-19 02:16:24,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=359550.0, ans=0.04949747468305833 2023-06-19 02:17:03,073 INFO [train.py:996] (3/4) Epoch 2, batch 29450, loss[loss=0.2866, simple_loss=0.3535, pruned_loss=0.1098, over 20687.00 frames. ], tot_loss[loss=0.2816, simple_loss=0.3416, pruned_loss=0.1108, over 4259378.57 frames. ], batch size: 607, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:17:20,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=359670.0, ans=0.0 2023-06-19 02:17:24,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=359730.0, ans=0.125 2023-06-19 02:17:25,756 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 3.078e+02 3.690e+02 4.615e+02 7.103e+02, threshold=7.380e+02, percent-clipped=1.0 2023-06-19 02:17:49,491 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.61 vs. limit=22.5 2023-06-19 02:17:50,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=359790.0, ans=0.05 2023-06-19 02:17:54,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=12.0 2023-06-19 02:18:31,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=359910.0, ans=0.0 2023-06-19 02:18:37,418 INFO [train.py:996] (3/4) Epoch 2, batch 29500, loss[loss=0.1869, simple_loss=0.2327, pruned_loss=0.07055, over 21820.00 frames. ], tot_loss[loss=0.2886, simple_loss=0.3472, pruned_loss=0.115, over 4268479.55 frames. ], batch size: 102, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:19:14,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=360030.0, ans=0.125 2023-06-19 02:19:15,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=360030.0, ans=0.0 2023-06-19 02:20:17,230 INFO [train.py:996] (3/4) Epoch 2, batch 29550, loss[loss=0.3075, simple_loss=0.3594, pruned_loss=0.1278, over 21936.00 frames. ], tot_loss[loss=0.2916, simple_loss=0.3474, pruned_loss=0.1179, over 4281303.07 frames. ], batch size: 113, lr: 1.45e-02, grad_scale: 64.0 2023-06-19 02:20:17,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=360270.0, ans=0.125 2023-06-19 02:20:33,075 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=22.5 2023-06-19 02:20:35,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=360270.0, ans=0.125 2023-06-19 02:20:49,797 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 3.148e+02 3.536e+02 4.853e+02 9.360e+02, threshold=7.072e+02, percent-clipped=2.0 2023-06-19 02:21:12,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=360390.0, ans=0.125 2023-06-19 02:22:07,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=360510.0, ans=0.125 2023-06-19 02:22:08,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=360570.0, ans=0.125 2023-06-19 02:22:08,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=360570.0, ans=0.125 2023-06-19 02:22:10,003 INFO [train.py:996] (3/4) Epoch 2, batch 29600, loss[loss=0.3196, simple_loss=0.3928, pruned_loss=0.1232, over 21753.00 frames. ], tot_loss[loss=0.2985, simple_loss=0.3554, pruned_loss=0.1209, over 4287064.09 frames. ], batch size: 351, lr: 1.44e-02, grad_scale: 64.0 2023-06-19 02:22:14,017 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-06-19 02:22:30,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=15.0 2023-06-19 02:22:40,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=360630.0, ans=0.125 2023-06-19 02:22:53,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=360690.0, ans=0.015 2023-06-19 02:23:06,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=360750.0, ans=0.1 2023-06-19 02:23:43,623 INFO [train.py:996] (3/4) Epoch 2, batch 29650, loss[loss=0.2511, simple_loss=0.3195, pruned_loss=0.09137, over 21849.00 frames. ], tot_loss[loss=0.2914, simple_loss=0.3508, pruned_loss=0.116, over 4286445.17 frames. ], batch size: 332, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:23:43,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=360870.0, ans=0.0 2023-06-19 02:24:06,692 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-19 02:24:07,291 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 3.035e+02 3.587e+02 4.924e+02 8.544e+02, threshold=7.175e+02, percent-clipped=8.0 2023-06-19 02:25:04,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=361110.0, ans=0.125 2023-06-19 02:25:23,741 INFO [train.py:996] (3/4) Epoch 2, batch 29700, loss[loss=0.369, simple_loss=0.4559, pruned_loss=0.1411, over 21529.00 frames. ], tot_loss[loss=0.2945, simple_loss=0.3548, pruned_loss=0.1171, over 4277194.76 frames. ], batch size: 471, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:27:00,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=361410.0, ans=0.125 2023-06-19 02:27:03,531 INFO [train.py:996] (3/4) Epoch 2, batch 29750, loss[loss=0.2498, simple_loss=0.3148, pruned_loss=0.09238, over 21881.00 frames. ], tot_loss[loss=0.2946, simple_loss=0.3581, pruned_loss=0.1155, over 4264411.27 frames. ], batch size: 98, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:27:27,921 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 3.188e+02 3.972e+02 5.342e+02 1.059e+03, threshold=7.944e+02, percent-clipped=5.0 2023-06-19 02:27:42,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=361590.0, ans=0.125 2023-06-19 02:28:42,365 INFO [train.py:996] (3/4) Epoch 2, batch 29800, loss[loss=0.2913, simple_loss=0.3415, pruned_loss=0.1206, over 21251.00 frames. ], tot_loss[loss=0.2965, simple_loss=0.3593, pruned_loss=0.1168, over 4267350.33 frames. ], batch size: 608, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:29:36,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=361950.0, ans=0.0 2023-06-19 02:30:03,198 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=4.449e-02 2023-06-19 02:30:22,309 INFO [train.py:996] (3/4) Epoch 2, batch 29850, loss[loss=0.2395, simple_loss=0.3151, pruned_loss=0.08191, over 21748.00 frames. ], tot_loss[loss=0.2917, simple_loss=0.3549, pruned_loss=0.1143, over 4274671.24 frames. ], batch size: 332, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:30:25,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=362070.0, ans=0.2 2023-06-19 02:30:46,102 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.139e+02 2.912e+02 3.664e+02 4.469e+02 7.842e+02, threshold=7.327e+02, percent-clipped=0.0 2023-06-19 02:30:53,547 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=22.5 2023-06-19 02:30:55,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=362130.0, ans=0.125 2023-06-19 02:31:00,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=362190.0, ans=0.0 2023-06-19 02:31:15,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=362250.0, ans=0.1 2023-06-19 02:31:31,154 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:31:48,313 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.22 vs. limit=12.0 2023-06-19 02:32:06,351 INFO [train.py:996] (3/4) Epoch 2, batch 29900, loss[loss=0.3122, simple_loss=0.3604, pruned_loss=0.132, over 21368.00 frames. ], tot_loss[loss=0.2938, simple_loss=0.355, pruned_loss=0.1163, over 4270503.25 frames. ], batch size: 176, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:32:08,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=362370.0, ans=0.125 2023-06-19 02:32:13,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=362370.0, ans=0.125 2023-06-19 02:32:27,988 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=22.5 2023-06-19 02:32:31,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=362430.0, ans=0.0 2023-06-19 02:33:25,784 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.23 vs. limit=6.0 2023-06-19 02:33:42,277 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-19 02:33:44,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=362610.0, ans=0.0 2023-06-19 02:33:46,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=362610.0, ans=0.07 2023-06-19 02:33:48,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=362670.0, ans=0.125 2023-06-19 02:33:49,213 INFO [train.py:996] (3/4) Epoch 2, batch 29950, loss[loss=0.3079, simple_loss=0.3639, pruned_loss=0.126, over 21648.00 frames. ], tot_loss[loss=0.3011, simple_loss=0.3597, pruned_loss=0.1213, over 4272391.36 frames. ], batch size: 351, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:33:51,139 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:34:09,817 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 3.188e+02 4.013e+02 5.057e+02 1.029e+03, threshold=8.025e+02, percent-clipped=2.0 2023-06-19 02:34:36,255 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=15.0 2023-06-19 02:35:20,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=362910.0, ans=0.125 2023-06-19 02:35:22,777 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-19 02:35:29,982 INFO [train.py:996] (3/4) Epoch 2, batch 30000, loss[loss=0.2606, simple_loss=0.3463, pruned_loss=0.0875, over 21878.00 frames. ], tot_loss[loss=0.3038, simple_loss=0.3632, pruned_loss=0.1222, over 4271778.04 frames. ], batch size: 316, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:35:29,982 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 02:35:47,463 INFO [train.py:1028] (3/4) Epoch 2, validation: loss=0.2693, simple_loss=0.3684, pruned_loss=0.08513, over 1796401.00 frames. 2023-06-19 02:35:47,464 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-19 02:36:55,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=363090.0, ans=0.2 2023-06-19 02:37:03,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=363150.0, ans=0.04949747468305833 2023-06-19 02:37:20,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=363210.0, ans=0.125 2023-06-19 02:37:36,934 INFO [train.py:996] (3/4) Epoch 2, batch 30050, loss[loss=0.3496, simple_loss=0.4392, pruned_loss=0.13, over 21696.00 frames. ], tot_loss[loss=0.2996, simple_loss=0.3639, pruned_loss=0.1177, over 4261085.32 frames. ], batch size: 389, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:37:40,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=363270.0, ans=0.0 2023-06-19 02:37:50,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=363270.0, ans=0.1 2023-06-19 02:38:06,070 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.812e+02 3.422e+02 4.683e+02 8.613e+02, threshold=6.845e+02, percent-clipped=2.0 2023-06-19 02:38:18,209 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=12.0 2023-06-19 02:38:46,344 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=12.0 2023-06-19 02:38:56,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=363510.0, ans=10.0 2023-06-19 02:39:15,581 INFO [train.py:996] (3/4) Epoch 2, batch 30100, loss[loss=0.2564, simple_loss=0.3017, pruned_loss=0.1056, over 21279.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.361, pruned_loss=0.1169, over 4252804.20 frames. ], batch size: 176, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:40:12,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=363690.0, ans=0.0 2023-06-19 02:41:02,501 INFO [train.py:996] (3/4) Epoch 2, batch 30150, loss[loss=0.3124, simple_loss=0.3629, pruned_loss=0.1309, over 21598.00 frames. ], tot_loss[loss=0.3, simple_loss=0.3593, pruned_loss=0.1204, over 4257439.61 frames. ], batch size: 230, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:41:11,909 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:41:28,248 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:41:30,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=363930.0, ans=0.125 2023-06-19 02:41:32,554 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 3.181e+02 3.774e+02 4.610e+02 8.129e+02, threshold=7.548e+02, percent-clipped=2.0 2023-06-19 02:41:33,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=363930.0, ans=0.1 2023-06-19 02:41:42,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=363930.0, ans=0.125 2023-06-19 02:41:58,551 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.25 vs. limit=10.0 2023-06-19 02:42:56,643 INFO [train.py:996] (3/4) Epoch 2, batch 30200, loss[loss=0.339, simple_loss=0.4109, pruned_loss=0.1336, over 20757.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.3636, pruned_loss=0.1195, over 4258982.23 frames. ], batch size: 607, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:43:14,186 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.64 vs. limit=6.0 2023-06-19 02:43:57,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=364350.0, ans=0.125 2023-06-19 02:44:18,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=364410.0, ans=0.125 2023-06-19 02:44:34,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=364410.0, ans=0.025 2023-06-19 02:44:39,225 INFO [train.py:996] (3/4) Epoch 2, batch 30250, loss[loss=0.4145, simple_loss=0.4841, pruned_loss=0.1725, over 21533.00 frames. ], tot_loss[loss=0.3077, simple_loss=0.3712, pruned_loss=0.1221, over 4265470.13 frames. ], batch size: 471, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:44:51,429 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.11 vs. limit=6.0 2023-06-19 02:44:58,836 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.179e+02 3.107e+02 3.710e+02 5.079e+02 9.516e+02, threshold=7.420e+02, percent-clipped=5.0 2023-06-19 02:45:45,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=364650.0, ans=0.05 2023-06-19 02:46:19,608 INFO [train.py:996] (3/4) Epoch 2, batch 30300, loss[loss=0.288, simple_loss=0.3324, pruned_loss=0.1218, over 21755.00 frames. ], tot_loss[loss=0.3066, simple_loss=0.3687, pruned_loss=0.1223, over 4253601.94 frames. ], batch size: 318, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:47:21,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=364890.0, ans=0.0 2023-06-19 02:47:47,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=365010.0, ans=0.125 2023-06-19 02:48:03,254 INFO [train.py:996] (3/4) Epoch 2, batch 30350, loss[loss=0.1917, simple_loss=0.2194, pruned_loss=0.08201, over 17262.00 frames. ], tot_loss[loss=0.309, simple_loss=0.37, pruned_loss=0.124, over 4251504.10 frames. ], batch size: 65, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:48:25,998 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.273e+02 3.339e+02 3.934e+02 4.976e+02 9.196e+02, threshold=7.868e+02, percent-clipped=1.0 2023-06-19 02:49:31,180 INFO [train.py:996] (3/4) Epoch 2, batch 30400, loss[loss=0.2619, simple_loss=0.3065, pruned_loss=0.1087, over 20095.00 frames. ], tot_loss[loss=0.3037, simple_loss=0.3638, pruned_loss=0.1218, over 4245797.52 frames. ], batch size: 702, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:50:11,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=365490.0, ans=0.125 2023-06-19 02:50:56,367 INFO [train.py:996] (3/4) Epoch 2, batch 30450, loss[loss=0.3582, simple_loss=0.4597, pruned_loss=0.1283, over 19819.00 frames. ], tot_loss[loss=0.3065, simple_loss=0.3672, pruned_loss=0.1229, over 4189880.66 frames. ], batch size: 702, lr: 1.43e-02, grad_scale: 32.0 2023-06-19 02:51:01,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=365670.0, ans=0.1 2023-06-19 02:51:10,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=365670.0, ans=0.07 2023-06-19 02:51:15,872 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.508e+02 4.343e+02 5.750e+02 8.532e+02 2.294e+03, threshold=1.150e+03, percent-clipped=29.0 2023-06-19 02:51:22,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=365730.0, ans=0.125 2023-06-19 02:51:25,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=365730.0, ans=10.0 2023-06-19 02:51:50,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=365850.0, ans=0.025 2023-06-19 02:51:59,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=365910.0, ans=0.0 2023-06-19 02:53:32,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=365934.0, ans=0.125 2023-06-19 02:53:41,468 INFO [train.py:996] (3/4) Epoch 3, batch 0, loss[loss=0.2832, simple_loss=0.3286, pruned_loss=0.1189, over 21279.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.3286, pruned_loss=0.1189, over 21279.00 frames. ], batch size: 551, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 02:53:41,468 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 02:53:57,717 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2735, simple_loss=0.3782, pruned_loss=0.08435, over 1796401.00 frames. 2023-06-19 02:53:57,717 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-19 02:53:58,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=365934.0, ans=0.125 2023-06-19 02:54:38,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=366054.0, ans=0.125 2023-06-19 02:54:49,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=366054.0, ans=0.0 2023-06-19 02:54:50,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=366054.0, ans=0.2 2023-06-19 02:55:36,565 INFO [train.py:996] (3/4) Epoch 3, batch 50, loss[loss=0.3437, simple_loss=0.4044, pruned_loss=0.1415, over 21416.00 frames. ], tot_loss[loss=0.3009, simple_loss=0.3586, pruned_loss=0.1216, over 949405.61 frames. ], batch size: 471, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 02:56:10,882 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 3.611e+02 4.559e+02 6.599e+02 1.492e+03, threshold=9.117e+02, percent-clipped=9.0 2023-06-19 02:56:28,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=366354.0, ans=0.125 2023-06-19 02:56:46,088 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:56:55,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=366474.0, ans=0.0 2023-06-19 02:56:56,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=366474.0, ans=0.0 2023-06-19 02:57:15,637 INFO [train.py:996] (3/4) Epoch 3, batch 100, loss[loss=0.3157, simple_loss=0.3798, pruned_loss=0.1258, over 21733.00 frames. ], tot_loss[loss=0.3172, simple_loss=0.3815, pruned_loss=0.1264, over 1674879.44 frames. ], batch size: 298, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 02:57:46,871 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=5.081e-03 2023-06-19 02:57:51,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=366654.0, ans=0.1 2023-06-19 02:57:53,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=366654.0, ans=0.0 2023-06-19 02:58:51,825 INFO [train.py:996] (3/4) Epoch 3, batch 150, loss[loss=0.3735, simple_loss=0.4444, pruned_loss=0.1513, over 21655.00 frames. ], tot_loss[loss=0.3138, simple_loss=0.3806, pruned_loss=0.1235, over 2255593.71 frames. ], batch size: 441, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 02:59:22,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=366894.0, ans=0.125 2023-06-19 02:59:24,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=366894.0, ans=0.2 2023-06-19 02:59:25,659 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 3.063e+02 3.532e+02 4.732e+02 9.517e+02, threshold=7.065e+02, percent-clipped=1.0 2023-06-19 02:59:46,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=366954.0, ans=0.125 2023-06-19 03:00:02,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=367014.0, ans=0.0 2023-06-19 03:00:30,814 INFO [train.py:996] (3/4) Epoch 3, batch 200, loss[loss=0.2848, simple_loss=0.3814, pruned_loss=0.09407, over 19844.00 frames. ], tot_loss[loss=0.3084, simple_loss=0.3754, pruned_loss=0.1207, over 2706236.65 frames. ], batch size: 702, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 03:00:46,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=367194.0, ans=0.125 2023-06-19 03:01:21,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=367254.0, ans=0.125 2023-06-19 03:02:09,300 INFO [train.py:996] (3/4) Epoch 3, batch 250, loss[loss=0.3496, simple_loss=0.4063, pruned_loss=0.1465, over 21567.00 frames. ], tot_loss[loss=0.309, simple_loss=0.3741, pruned_loss=0.122, over 3043103.98 frames. ], batch size: 414, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 03:02:41,848 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.28 vs. limit=22.5 2023-06-19 03:02:42,270 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.228e+02 2.832e+02 3.615e+02 5.126e+02 8.493e+02, threshold=7.230e+02, percent-clipped=8.0 2023-06-19 03:02:52,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=367554.0, ans=0.0 2023-06-19 03:03:49,716 INFO [train.py:996] (3/4) Epoch 3, batch 300, loss[loss=0.2755, simple_loss=0.3292, pruned_loss=0.1109, over 21355.00 frames. ], tot_loss[loss=0.3026, simple_loss=0.3656, pruned_loss=0.1198, over 3309841.87 frames. ], batch size: 176, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:04:04,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=367794.0, ans=0.2 2023-06-19 03:04:27,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=367854.0, ans=0.0 2023-06-19 03:04:31,817 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.27 vs. limit=15.0 2023-06-19 03:04:54,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=367914.0, ans=0.125 2023-06-19 03:05:25,833 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 03:05:31,305 INFO [train.py:996] (3/4) Epoch 3, batch 350, loss[loss=0.3562, simple_loss=0.3947, pruned_loss=0.1589, over 21429.00 frames. ], tot_loss[loss=0.2955, simple_loss=0.3566, pruned_loss=0.1172, over 3527506.20 frames. ], batch size: 473, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:05:59,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=368094.0, ans=0.1 2023-06-19 03:06:01,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=368094.0, ans=0.0 2023-06-19 03:06:05,950 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.954e+02 3.445e+02 4.197e+02 6.448e+02, threshold=6.891e+02, percent-clipped=0.0 2023-06-19 03:06:46,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=368274.0, ans=0.125 2023-06-19 03:06:48,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=368274.0, ans=0.0 2023-06-19 03:07:12,359 INFO [train.py:996] (3/4) Epoch 3, batch 400, loss[loss=0.3037, simple_loss=0.3841, pruned_loss=0.1117, over 21386.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3464, pruned_loss=0.1139, over 3685230.97 frames. ], batch size: 131, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:07:24,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=368334.0, ans=0.125 2023-06-19 03:07:59,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=368454.0, ans=0.2 2023-06-19 03:08:53,080 INFO [train.py:996] (3/4) Epoch 3, batch 450, loss[loss=0.3301, simple_loss=0.3493, pruned_loss=0.1554, over 21420.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3428, pruned_loss=0.1127, over 3822957.19 frames. ], batch size: 509, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:08:58,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=368634.0, ans=0.2 2023-06-19 03:08:58,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=368634.0, ans=0.5 2023-06-19 03:09:27,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.891e+02 3.614e+02 4.402e+02 7.378e+02, threshold=7.228e+02, percent-clipped=3.0 2023-06-19 03:10:28,814 INFO [train.py:996] (3/4) Epoch 3, batch 500, loss[loss=0.2309, simple_loss=0.2899, pruned_loss=0.08593, over 21465.00 frames. ], tot_loss[loss=0.2845, simple_loss=0.3464, pruned_loss=0.1113, over 3918755.98 frames. ], batch size: 212, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:10:55,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=368994.0, ans=0.125 2023-06-19 03:10:58,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=368994.0, ans=0.125 2023-06-19 03:11:06,292 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 03:11:13,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=369054.0, ans=0.125 2023-06-19 03:11:30,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=369114.0, ans=0.125 2023-06-19 03:12:08,123 INFO [train.py:996] (3/4) Epoch 3, batch 550, loss[loss=0.3087, simple_loss=0.4031, pruned_loss=0.1071, over 21654.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.3524, pruned_loss=0.112, over 3999663.24 frames. ], batch size: 414, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:12:46,955 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.117e+02 3.637e+02 4.984e+02 1.103e+03, threshold=7.274e+02, percent-clipped=1.0 2023-06-19 03:12:50,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=369354.0, ans=0.2 2023-06-19 03:12:57,589 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-06-19 03:13:47,826 INFO [train.py:996] (3/4) Epoch 3, batch 600, loss[loss=0.2532, simple_loss=0.2958, pruned_loss=0.1053, over 20805.00 frames. ], tot_loss[loss=0.2893, simple_loss=0.354, pruned_loss=0.1123, over 4063105.40 frames. ], batch size: 609, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:14:22,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=369594.0, ans=0.1 2023-06-19 03:15:19,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=369774.0, ans=0.0 2023-06-19 03:15:28,832 INFO [train.py:996] (3/4) Epoch 3, batch 650, loss[loss=0.3083, simple_loss=0.3625, pruned_loss=0.1271, over 21889.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3555, pruned_loss=0.1131, over 4113034.58 frames. ], batch size: 332, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:15:44,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=369834.0, ans=0.1 2023-06-19 03:15:50,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=369894.0, ans=0.125 2023-06-19 03:16:02,413 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 3.260e+02 4.172e+02 5.495e+02 8.347e+02, threshold=8.343e+02, percent-clipped=4.0 2023-06-19 03:16:22,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=369954.0, ans=0.5 2023-06-19 03:16:33,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=370014.0, ans=0.0 2023-06-19 03:17:09,751 INFO [train.py:996] (3/4) Epoch 3, batch 700, loss[loss=0.3755, simple_loss=0.4434, pruned_loss=0.1538, over 21689.00 frames. ], tot_loss[loss=0.2945, simple_loss=0.3599, pruned_loss=0.1146, over 4153355.65 frames. ], batch size: 441, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:17:28,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=370194.0, ans=0.05 2023-06-19 03:18:11,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=370314.0, ans=15.0 2023-06-19 03:18:40,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-19 03:18:49,213 INFO [train.py:996] (3/4) Epoch 3, batch 750, loss[loss=0.2813, simple_loss=0.3306, pruned_loss=0.116, over 21592.00 frames. ], tot_loss[loss=0.295, simple_loss=0.3581, pruned_loss=0.1159, over 4190732.03 frames. ], batch size: 391, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:19:17,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=370494.0, ans=0.125 2023-06-19 03:19:27,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=370494.0, ans=10.0 2023-06-19 03:19:28,420 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.172e+02 3.022e+02 3.507e+02 4.070e+02 7.167e+02, threshold=7.014e+02, percent-clipped=0.0 2023-06-19 03:19:40,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=370554.0, ans=0.1 2023-06-19 03:19:49,113 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.09 vs. limit=6.0 2023-06-19 03:20:31,091 INFO [train.py:996] (3/4) Epoch 3, batch 800, loss[loss=0.3595, simple_loss=0.3913, pruned_loss=0.1639, over 21584.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3548, pruned_loss=0.1154, over 4210940.84 frames. ], batch size: 471, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:21:00,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=370794.0, ans=0.1 2023-06-19 03:21:22,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=370854.0, ans=0.0 2023-06-19 03:21:25,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=370854.0, ans=0.07 2023-06-19 03:21:27,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=370914.0, ans=0.125 2023-06-19 03:22:06,187 INFO [train.py:996] (3/4) Epoch 3, batch 850, loss[loss=0.284, simple_loss=0.3299, pruned_loss=0.1191, over 21573.00 frames. ], tot_loss[loss=0.291, simple_loss=0.3522, pruned_loss=0.1149, over 4225474.93 frames. ], batch size: 194, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:22:08,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=371034.0, ans=0.125 2023-06-19 03:22:24,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=371034.0, ans=0.125 2023-06-19 03:22:33,654 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 03:22:46,295 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 3.110e+02 3.682e+02 5.059e+02 8.553e+02, threshold=7.364e+02, percent-clipped=4.0 2023-06-19 03:23:21,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=371214.0, ans=0.125 2023-06-19 03:23:22,483 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.45 vs. limit=6.0 2023-06-19 03:23:29,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=371274.0, ans=0.125 2023-06-19 03:23:37,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=371274.0, ans=0.125 2023-06-19 03:23:43,032 INFO [train.py:996] (3/4) Epoch 3, batch 900, loss[loss=0.2301, simple_loss=0.2987, pruned_loss=0.08068, over 21243.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.3483, pruned_loss=0.1132, over 4245003.75 frames. ], batch size: 159, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:24:03,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=371394.0, ans=0.0 2023-06-19 03:24:31,986 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-19 03:25:00,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.07 vs. limit=15.0 2023-06-19 03:25:02,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=371574.0, ans=0.0 2023-06-19 03:25:07,502 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-19 03:25:10,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=371574.0, ans=0.125 2023-06-19 03:25:18,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=371634.0, ans=0.2 2023-06-19 03:25:24,167 INFO [train.py:996] (3/4) Epoch 3, batch 950, loss[loss=0.2586, simple_loss=0.3464, pruned_loss=0.08536, over 21746.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3471, pruned_loss=0.1122, over 4258229.86 frames. ], batch size: 332, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:25:48,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=371694.0, ans=0.125 2023-06-19 03:25:59,359 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 2.854e+02 3.566e+02 4.630e+02 9.213e+02, threshold=7.133e+02, percent-clipped=4.0 2023-06-19 03:26:01,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=371754.0, ans=0.125 2023-06-19 03:26:04,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=371754.0, ans=0.05 2023-06-19 03:26:12,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=371754.0, ans=0.125 2023-06-19 03:26:12,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=371754.0, ans=0.125 2023-06-19 03:26:50,484 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.93 vs. limit=12.0 2023-06-19 03:27:03,896 INFO [train.py:996] (3/4) Epoch 3, batch 1000, loss[loss=0.2919, simple_loss=0.3437, pruned_loss=0.1201, over 21770.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.347, pruned_loss=0.1127, over 4272612.74 frames. ], batch size: 247, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:27:13,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=371934.0, ans=0.0 2023-06-19 03:27:23,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=371994.0, ans=0.125 2023-06-19 03:27:27,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=371994.0, ans=0.125 2023-06-19 03:27:27,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=371994.0, ans=0.125 2023-06-19 03:27:37,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=371994.0, ans=0.1 2023-06-19 03:27:40,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=372054.0, ans=0.125 2023-06-19 03:28:03,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=372114.0, ans=0.2 2023-06-19 03:28:20,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=372114.0, ans=0.125 2023-06-19 03:28:25,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=372174.0, ans=0.2 2023-06-19 03:28:48,331 INFO [train.py:996] (3/4) Epoch 3, batch 1050, loss[loss=0.3146, simple_loss=0.3569, pruned_loss=0.1361, over 21570.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3469, pruned_loss=0.1127, over 4281410.39 frames. ], batch size: 471, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:29:11,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=372294.0, ans=0.0 2023-06-19 03:29:15,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=372294.0, ans=0.125 2023-06-19 03:29:24,717 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 3.077e+02 3.761e+02 4.435e+02 8.515e+02, threshold=7.523e+02, percent-clipped=2.0 2023-06-19 03:30:22,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=372474.0, ans=0.125 2023-06-19 03:30:31,851 INFO [train.py:996] (3/4) Epoch 3, batch 1100, loss[loss=0.2795, simple_loss=0.3469, pruned_loss=0.106, over 21428.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3463, pruned_loss=0.1121, over 4274128.85 frames. ], batch size: 194, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:30:47,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=372534.0, ans=0.125 2023-06-19 03:30:50,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=372534.0, ans=0.0 2023-06-19 03:30:52,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=372594.0, ans=0.125 2023-06-19 03:31:11,237 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.86 vs. limit=22.5 2023-06-19 03:31:52,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=372714.0, ans=0.2 2023-06-19 03:31:56,295 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.25 vs. limit=6.0 2023-06-19 03:32:17,140 INFO [train.py:996] (3/4) Epoch 3, batch 1150, loss[loss=0.2627, simple_loss=0.3276, pruned_loss=0.09894, over 21367.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3464, pruned_loss=0.113, over 4273222.34 frames. ], batch size: 131, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:32:32,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=372834.0, ans=0.0 2023-06-19 03:33:03,639 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.941e+02 3.564e+02 4.361e+02 9.852e+02, threshold=7.128e+02, percent-clipped=2.0 2023-06-19 03:33:36,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=373014.0, ans=0.125 2023-06-19 03:34:05,564 INFO [train.py:996] (3/4) Epoch 3, batch 1200, loss[loss=0.2419, simple_loss=0.3183, pruned_loss=0.0828, over 21272.00 frames. ], tot_loss[loss=0.286, simple_loss=0.3471, pruned_loss=0.1125, over 4278424.10 frames. ], batch size: 176, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:34:13,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=373134.0, ans=15.0 2023-06-19 03:34:29,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=373194.0, ans=0.1 2023-06-19 03:34:55,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=373254.0, ans=0.0 2023-06-19 03:35:49,161 INFO [train.py:996] (3/4) Epoch 3, batch 1250, loss[loss=0.3067, simple_loss=0.3571, pruned_loss=0.1282, over 21848.00 frames. ], tot_loss[loss=0.2878, simple_loss=0.3494, pruned_loss=0.1131, over 4283719.14 frames. ], batch size: 107, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:36:08,570 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-19 03:36:30,654 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.020e+02 3.067e+02 3.657e+02 4.609e+02 8.051e+02, threshold=7.314e+02, percent-clipped=2.0 2023-06-19 03:36:48,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=373554.0, ans=0.125 2023-06-19 03:36:55,829 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=22.5 2023-06-19 03:36:56,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=373614.0, ans=0.1 2023-06-19 03:37:00,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=373614.0, ans=0.1 2023-06-19 03:37:00,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=373614.0, ans=0.125 2023-06-19 03:37:14,814 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.50 vs. limit=15.0 2023-06-19 03:37:32,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=373734.0, ans=0.1 2023-06-19 03:37:33,724 INFO [train.py:996] (3/4) Epoch 3, batch 1300, loss[loss=0.3357, simple_loss=0.3882, pruned_loss=0.1416, over 21732.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.3522, pruned_loss=0.114, over 4282886.69 frames. ], batch size: 332, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:38:31,380 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=22.5 2023-06-19 03:38:34,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=373854.0, ans=0.1 2023-06-19 03:38:38,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=373854.0, ans=0.95 2023-06-19 03:39:08,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=373974.0, ans=0.125 2023-06-19 03:39:18,826 INFO [train.py:996] (3/4) Epoch 3, batch 1350, loss[loss=0.2582, simple_loss=0.3251, pruned_loss=0.09563, over 21438.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3538, pruned_loss=0.1153, over 4289270.80 frames. ], batch size: 194, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:40:01,530 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.495e+02 3.516e+02 4.679e+02 5.899e+02 9.616e+02, threshold=9.359e+02, percent-clipped=8.0 2023-06-19 03:40:26,794 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-19 03:41:00,587 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-19 03:41:03,055 INFO [train.py:996] (3/4) Epoch 3, batch 1400, loss[loss=0.2544, simple_loss=0.3115, pruned_loss=0.09869, over 21753.00 frames. ], tot_loss[loss=0.2925, simple_loss=0.354, pruned_loss=0.1155, over 4295080.50 frames. ], batch size: 124, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:42:32,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=374574.0, ans=0.125 2023-06-19 03:42:37,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=374574.0, ans=0.125 2023-06-19 03:42:47,131 INFO [train.py:996] (3/4) Epoch 3, batch 1450, loss[loss=0.2681, simple_loss=0.3146, pruned_loss=0.1107, over 21636.00 frames. ], tot_loss[loss=0.2936, simple_loss=0.3543, pruned_loss=0.1165, over 4291973.70 frames. ], batch size: 415, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:43:28,905 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 3.105e+02 3.604e+02 4.454e+02 7.120e+02, threshold=7.209e+02, percent-clipped=0.0 2023-06-19 03:43:31,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=374754.0, ans=0.2 2023-06-19 03:44:29,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=374874.0, ans=0.0 2023-06-19 03:44:32,076 INFO [train.py:996] (3/4) Epoch 3, batch 1500, loss[loss=0.3389, simple_loss=0.3826, pruned_loss=0.1476, over 21738.00 frames. ], tot_loss[loss=0.2937, simple_loss=0.3531, pruned_loss=0.1171, over 4299771.52 frames. ], batch size: 112, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:45:52,398 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.81 vs. limit=10.0 2023-06-19 03:46:17,197 INFO [train.py:996] (3/4) Epoch 3, batch 1550, loss[loss=0.2335, simple_loss=0.3043, pruned_loss=0.08134, over 21137.00 frames. ], tot_loss[loss=0.2905, simple_loss=0.3504, pruned_loss=0.1153, over 4300898.85 frames. ], batch size: 143, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:47:05,565 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.746e+02 3.313e+02 3.948e+02 6.762e+02, threshold=6.626e+02, percent-clipped=0.0 2023-06-19 03:47:52,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=375474.0, ans=0.0 2023-06-19 03:48:14,551 INFO [train.py:996] (3/4) Epoch 3, batch 1600, loss[loss=0.3586, simple_loss=0.4206, pruned_loss=0.1483, over 21635.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.347, pruned_loss=0.1128, over 4296868.15 frames. ], batch size: 414, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:48:16,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=375534.0, ans=0.1 2023-06-19 03:48:52,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=375654.0, ans=0.0 2023-06-19 03:49:01,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=375654.0, ans=0.1 2023-06-19 03:49:24,179 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.58 vs. limit=22.5 2023-06-19 03:49:44,412 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.50 vs. limit=15.0 2023-06-19 03:50:00,854 INFO [train.py:996] (3/4) Epoch 3, batch 1650, loss[loss=0.2623, simple_loss=0.31, pruned_loss=0.1073, over 21472.00 frames. ], tot_loss[loss=0.2851, simple_loss=0.3458, pruned_loss=0.1122, over 4292279.80 frames. ], batch size: 212, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:50:02,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=375834.0, ans=10.0 2023-06-19 03:50:32,096 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 03:50:38,342 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.775e+02 3.357e+02 4.211e+02 7.088e+02, threshold=6.714e+02, percent-clipped=2.0 2023-06-19 03:51:13,361 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-06-19 03:51:45,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=376074.0, ans=0.125 2023-06-19 03:51:49,328 INFO [train.py:996] (3/4) Epoch 3, batch 1700, loss[loss=0.3321, simple_loss=0.3766, pruned_loss=0.1438, over 21593.00 frames. ], tot_loss[loss=0.2905, simple_loss=0.3521, pruned_loss=0.1145, over 4290361.24 frames. ], batch size: 263, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:52:09,655 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-19 03:52:41,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=376254.0, ans=0.1 2023-06-19 03:52:49,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=376254.0, ans=0.125 2023-06-19 03:53:43,156 INFO [train.py:996] (3/4) Epoch 3, batch 1750, loss[loss=0.2742, simple_loss=0.3292, pruned_loss=0.1096, over 20144.00 frames. ], tot_loss[loss=0.2885, simple_loss=0.3516, pruned_loss=0.1127, over 4289506.81 frames. ], batch size: 704, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 03:53:45,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=376434.0, ans=0.0 2023-06-19 03:54:23,383 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.216e+02 3.144e+02 4.448e+02 5.330e+02 9.147e+02, threshold=8.897e+02, percent-clipped=12.0 2023-06-19 03:54:41,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=376554.0, ans=0.2 2023-06-19 03:55:16,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=376674.0, ans=0.1 2023-06-19 03:55:20,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=376674.0, ans=0.1 2023-06-19 03:55:31,661 INFO [train.py:996] (3/4) Epoch 3, batch 1800, loss[loss=0.2887, simple_loss=0.3512, pruned_loss=0.1131, over 21047.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3462, pruned_loss=0.1089, over 4277481.34 frames. ], batch size: 607, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 03:55:32,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=376734.0, ans=0.125 2023-06-19 03:56:17,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=376854.0, ans=0.2 2023-06-19 03:56:43,862 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-19 03:56:55,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=376974.0, ans=0.2 2023-06-19 03:57:11,728 INFO [train.py:996] (3/4) Epoch 3, batch 1850, loss[loss=0.2981, simple_loss=0.3559, pruned_loss=0.1201, over 21425.00 frames. ], tot_loss[loss=0.2796, simple_loss=0.3466, pruned_loss=0.1063, over 4276231.99 frames. ], batch size: 144, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 03:57:22,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=377034.0, ans=0.2 2023-06-19 03:57:24,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=377034.0, ans=0.125 2023-06-19 03:57:39,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=377094.0, ans=0.125 2023-06-19 03:57:41,862 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-06-19 03:57:44,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=377094.0, ans=0.125 2023-06-19 03:57:44,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=377094.0, ans=0.0 2023-06-19 03:58:00,687 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 2.940e+02 3.521e+02 4.849e+02 8.658e+02, threshold=7.043e+02, percent-clipped=0.0 2023-06-19 03:58:34,194 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-19 03:59:02,446 INFO [train.py:996] (3/4) Epoch 3, batch 1900, loss[loss=0.3282, simple_loss=0.3608, pruned_loss=0.1478, over 21801.00 frames. ], tot_loss[loss=0.2801, simple_loss=0.3463, pruned_loss=0.107, over 4271480.24 frames. ], batch size: 508, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 03:59:15,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=377334.0, ans=0.125 2023-06-19 04:00:17,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=377514.0, ans=0.125 2023-06-19 04:00:46,507 INFO [train.py:996] (3/4) Epoch 3, batch 1950, loss[loss=0.2406, simple_loss=0.3165, pruned_loss=0.08239, over 21611.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3419, pruned_loss=0.107, over 4278170.08 frames. ], batch size: 263, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 04:00:57,778 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.34 vs. limit=15.0 2023-06-19 04:01:09,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=377694.0, ans=0.125 2023-06-19 04:01:30,644 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.180e+02 3.084e+02 3.765e+02 4.629e+02 7.601e+02, threshold=7.530e+02, percent-clipped=2.0 2023-06-19 04:01:39,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=377754.0, ans=0.125 2023-06-19 04:01:41,860 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=12.0 2023-06-19 04:01:51,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=377814.0, ans=0.125 2023-06-19 04:02:32,669 INFO [train.py:996] (3/4) Epoch 3, batch 2000, loss[loss=0.2144, simple_loss=0.2749, pruned_loss=0.07698, over 21224.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.336, pruned_loss=0.1048, over 4265317.65 frames. ], batch size: 159, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:02:50,709 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=22.5 2023-06-19 04:03:03,805 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.605e-03 2023-06-19 04:03:17,426 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-19 04:03:20,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=378054.0, ans=0.125 2023-06-19 04:04:11,040 INFO [train.py:996] (3/4) Epoch 3, batch 2050, loss[loss=0.316, simple_loss=0.3725, pruned_loss=0.1298, over 21716.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.3383, pruned_loss=0.1056, over 4276204.34 frames. ], batch size: 389, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:04:21,632 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-19 04:04:51,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=378294.0, ans=0.0 2023-06-19 04:04:52,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=378354.0, ans=0.0 2023-06-19 04:04:54,236 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.024e+02 2.993e+02 3.653e+02 4.561e+02 8.702e+02, threshold=7.306e+02, percent-clipped=1.0 2023-06-19 04:05:16,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=378414.0, ans=0.0 2023-06-19 04:05:29,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378414.0, ans=0.1 2023-06-19 04:05:35,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=378474.0, ans=0.02 2023-06-19 04:05:52,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=378534.0, ans=0.125 2023-06-19 04:05:54,036 INFO [train.py:996] (3/4) Epoch 3, batch 2100, loss[loss=0.331, simple_loss=0.3811, pruned_loss=0.1405, over 21865.00 frames. ], tot_loss[loss=0.2791, simple_loss=0.3424, pruned_loss=0.108, over 4282275.35 frames. ], batch size: 98, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:05:54,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=378534.0, ans=0.125 2023-06-19 04:05:55,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=378534.0, ans=0.125 2023-06-19 04:05:57,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=378534.0, ans=0.125 2023-06-19 04:06:23,871 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.33 vs. limit=15.0 2023-06-19 04:06:45,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378654.0, ans=0.1 2023-06-19 04:07:32,323 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.58 vs. limit=10.0 2023-06-19 04:07:39,930 INFO [train.py:996] (3/4) Epoch 3, batch 2150, loss[loss=0.2182, simple_loss=0.3081, pruned_loss=0.06417, over 20827.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.3405, pruned_loss=0.1079, over 4274414.48 frames. ], batch size: 608, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:08:17,354 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=15.0 2023-06-19 04:08:30,012 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 3.207e+02 3.919e+02 5.012e+02 8.780e+02, threshold=7.837e+02, percent-clipped=4.0 2023-06-19 04:08:33,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=378954.0, ans=10.0 2023-06-19 04:09:24,826 INFO [train.py:996] (3/4) Epoch 3, batch 2200, loss[loss=0.2662, simple_loss=0.3117, pruned_loss=0.1104, over 21446.00 frames. ], tot_loss[loss=0.2816, simple_loss=0.3456, pruned_loss=0.1089, over 4273595.85 frames. ], batch size: 177, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:09:53,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=379194.0, ans=0.125 2023-06-19 04:10:25,971 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.61 vs. limit=15.0 2023-06-19 04:10:30,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=379314.0, ans=0.125 2023-06-19 04:11:09,104 INFO [train.py:996] (3/4) Epoch 3, batch 2250, loss[loss=0.2729, simple_loss=0.3263, pruned_loss=0.1097, over 21587.00 frames. ], tot_loss[loss=0.2791, simple_loss=0.3433, pruned_loss=0.1074, over 4266985.58 frames. ], batch size: 414, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:11:33,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=379434.0, ans=0.125 2023-06-19 04:11:39,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=379494.0, ans=0.95 2023-06-19 04:11:56,988 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.971e+02 3.785e+02 4.786e+02 8.748e+02, threshold=7.570e+02, percent-clipped=4.0 2023-06-19 04:11:57,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=379554.0, ans=0.2 2023-06-19 04:12:18,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=379614.0, ans=0.125 2023-06-19 04:12:52,017 INFO [train.py:996] (3/4) Epoch 3, batch 2300, loss[loss=0.3216, simple_loss=0.3929, pruned_loss=0.1252, over 20687.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3401, pruned_loss=0.1075, over 4273188.56 frames. ], batch size: 607, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:13:33,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=379794.0, ans=0.1 2023-06-19 04:13:47,111 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:13:49,377 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.88 vs. limit=15.0 2023-06-19 04:14:14,538 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=12.0 2023-06-19 04:14:42,064 INFO [train.py:996] (3/4) Epoch 3, batch 2350, loss[loss=0.3044, simple_loss=0.3534, pruned_loss=0.1277, over 21536.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.3394, pruned_loss=0.1086, over 4262837.57 frames. ], batch size: 230, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 04:15:27,979 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.593e+02 3.227e+02 3.663e+02 5.028e+02 9.666e+02, threshold=7.327e+02, percent-clipped=5.0 2023-06-19 04:16:05,373 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.32 vs. limit=10.0 2023-06-19 04:16:24,033 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-19 04:16:35,282 INFO [train.py:996] (3/4) Epoch 3, batch 2400, loss[loss=0.2636, simple_loss=0.3358, pruned_loss=0.09569, over 21470.00 frames. ], tot_loss[loss=0.2845, simple_loss=0.3443, pruned_loss=0.1123, over 4267387.44 frames. ], batch size: 131, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:16:40,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=380334.0, ans=0.2 2023-06-19 04:16:41,249 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-19 04:16:47,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=380334.0, ans=0.125 2023-06-19 04:16:56,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=380394.0, ans=0.0 2023-06-19 04:16:58,150 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=22.5 2023-06-19 04:17:11,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=380394.0, ans=0.125 2023-06-19 04:17:43,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=380514.0, ans=0.125 2023-06-19 04:18:10,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=380574.0, ans=0.2 2023-06-19 04:18:12,706 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.96 vs. limit=8.0 2023-06-19 04:18:13,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=380574.0, ans=0.125 2023-06-19 04:18:21,200 INFO [train.py:996] (3/4) Epoch 3, batch 2450, loss[loss=0.3036, simple_loss=0.3688, pruned_loss=0.1192, over 21190.00 frames. ], tot_loss[loss=0.2897, simple_loss=0.3501, pruned_loss=0.1147, over 4268288.91 frames. ], batch size: 143, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:18:35,746 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-19 04:19:00,868 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.323e+02 3.007e+02 3.779e+02 4.461e+02 8.893e+02, threshold=7.558e+02, percent-clipped=3.0 2023-06-19 04:19:04,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=380754.0, ans=0.2 2023-06-19 04:19:59,345 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.03 vs. limit=22.5 2023-06-19 04:20:04,638 INFO [train.py:996] (3/4) Epoch 3, batch 2500, loss[loss=0.2886, simple_loss=0.3737, pruned_loss=0.1018, over 21688.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3482, pruned_loss=0.113, over 4266029.13 frames. ], batch size: 332, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:20:30,893 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-19 04:20:42,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=381054.0, ans=0.125 2023-06-19 04:20:52,124 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:20:58,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=381114.0, ans=0.2 2023-06-19 04:21:11,372 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.19 vs. limit=15.0 2023-06-19 04:21:34,590 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-19 04:21:50,560 INFO [train.py:996] (3/4) Epoch 3, batch 2550, loss[loss=0.2523, simple_loss=0.3691, pruned_loss=0.06769, over 19683.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.3463, pruned_loss=0.1115, over 4263505.35 frames. ], batch size: 702, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:22:03,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=381234.0, ans=0.0 2023-06-19 04:22:23,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=381294.0, ans=0.125 2023-06-19 04:22:31,554 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 2.967e+02 3.517e+02 4.789e+02 7.584e+02, threshold=7.035e+02, percent-clipped=1.0 2023-06-19 04:22:37,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=381354.0, ans=0.0 2023-06-19 04:23:12,238 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=22.5 2023-06-19 04:23:26,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=381474.0, ans=0.2 2023-06-19 04:23:36,600 INFO [train.py:996] (3/4) Epoch 3, batch 2600, loss[loss=0.3495, simple_loss=0.3903, pruned_loss=0.1543, over 21580.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3479, pruned_loss=0.1131, over 4262898.31 frames. ], batch size: 415, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:23:57,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=381594.0, ans=0.125 2023-06-19 04:24:54,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=381774.0, ans=0.125 2023-06-19 04:25:24,296 INFO [train.py:996] (3/4) Epoch 3, batch 2650, loss[loss=0.3213, simple_loss=0.3825, pruned_loss=0.13, over 21593.00 frames. ], tot_loss[loss=0.2889, simple_loss=0.3496, pruned_loss=0.1141, over 4267725.83 frames. ], batch size: 389, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:25:32,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-19 04:26:05,631 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.347e+02 3.151e+02 3.898e+02 4.845e+02 8.708e+02, threshold=7.796e+02, percent-clipped=4.0 2023-06-19 04:26:45,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=382014.0, ans=0.1 2023-06-19 04:26:55,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=382074.0, ans=0.125 2023-06-19 04:27:09,884 INFO [train.py:996] (3/4) Epoch 3, batch 2700, loss[loss=0.2954, simple_loss=0.3567, pruned_loss=0.117, over 21803.00 frames. ], tot_loss[loss=0.285, simple_loss=0.3457, pruned_loss=0.1121, over 4263228.80 frames. ], batch size: 351, lr: 1.19e-02, grad_scale: 16.0 2023-06-19 04:27:42,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=382194.0, ans=0.2 2023-06-19 04:28:30,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=382314.0, ans=0.125 2023-06-19 04:28:55,299 INFO [train.py:996] (3/4) Epoch 3, batch 2750, loss[loss=0.301, simple_loss=0.3578, pruned_loss=0.122, over 21719.00 frames. ], tot_loss[loss=0.284, simple_loss=0.3441, pruned_loss=0.1119, over 4266806.70 frames. ], batch size: 389, lr: 1.19e-02, grad_scale: 16.0 2023-06-19 04:29:25,202 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-19 04:29:31,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=382494.0, ans=0.2 2023-06-19 04:29:37,643 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.554e+02 3.438e+02 4.314e+02 5.827e+02 1.229e+03, threshold=8.627e+02, percent-clipped=3.0 2023-06-19 04:29:51,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=382554.0, ans=0.125 2023-06-19 04:29:56,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=382614.0, ans=0.0 2023-06-19 04:30:19,910 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.53 vs. limit=15.0 2023-06-19 04:30:22,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=382674.0, ans=0.125 2023-06-19 04:30:31,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=382674.0, ans=0.0 2023-06-19 04:30:41,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=382674.0, ans=0.125 2023-06-19 04:30:45,968 INFO [train.py:996] (3/4) Epoch 3, batch 2800, loss[loss=0.3246, simple_loss=0.3914, pruned_loss=0.1289, over 21849.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.348, pruned_loss=0.1137, over 4272940.32 frames. ], batch size: 316, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:30:48,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=382734.0, ans=0.0 2023-06-19 04:31:25,060 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=15.0 2023-06-19 04:31:31,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=382854.0, ans=0.0 2023-06-19 04:31:42,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=382854.0, ans=0.125 2023-06-19 04:32:07,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=382914.0, ans=0.125 2023-06-19 04:32:32,059 INFO [train.py:996] (3/4) Epoch 3, batch 2850, loss[loss=0.2578, simple_loss=0.316, pruned_loss=0.09978, over 21558.00 frames. ], tot_loss[loss=0.2885, simple_loss=0.3485, pruned_loss=0.1142, over 4276587.75 frames. ], batch size: 212, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:33:04,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=383094.0, ans=0.07 2023-06-19 04:33:19,146 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.372e+02 3.317e+02 3.934e+02 4.710e+02 8.134e+02, threshold=7.867e+02, percent-clipped=0.0 2023-06-19 04:33:44,460 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.05 vs. limit=6.0 2023-06-19 04:34:13,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=383274.0, ans=0.125 2023-06-19 04:34:17,097 INFO [train.py:996] (3/4) Epoch 3, batch 2900, loss[loss=0.2846, simple_loss=0.3351, pruned_loss=0.1171, over 21891.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.3477, pruned_loss=0.1148, over 4274322.02 frames. ], batch size: 351, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:34:46,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=383394.0, ans=10.0 2023-06-19 04:35:12,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=383454.0, ans=0.0 2023-06-19 04:36:02,554 INFO [train.py:996] (3/4) Epoch 3, batch 2950, loss[loss=0.2605, simple_loss=0.3437, pruned_loss=0.08859, over 21799.00 frames. ], tot_loss[loss=0.2913, simple_loss=0.3504, pruned_loss=0.1161, over 4279095.31 frames. ], batch size: 247, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:36:07,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=383634.0, ans=0.2 2023-06-19 04:36:14,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=383634.0, ans=0.2 2023-06-19 04:36:50,308 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 2.976e+02 3.392e+02 4.326e+02 8.351e+02, threshold=6.785e+02, percent-clipped=1.0 2023-06-19 04:37:11,476 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:37:13,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=383814.0, ans=0.1 2023-06-19 04:37:28,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=383814.0, ans=0.0 2023-06-19 04:37:48,106 INFO [train.py:996] (3/4) Epoch 3, batch 3000, loss[loss=0.285, simple_loss=0.3512, pruned_loss=0.1093, over 20650.00 frames. ], tot_loss[loss=0.2934, simple_loss=0.3533, pruned_loss=0.1167, over 4281018.44 frames. ], batch size: 607, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:37:48,107 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 04:38:05,906 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2668, simple_loss=0.3633, pruned_loss=0.08521, over 1796401.00 frames. 2023-06-19 04:38:05,907 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-19 04:38:06,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=383934.0, ans=0.2 2023-06-19 04:38:06,427 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:38:27,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-19 04:38:51,571 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-19 04:39:10,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=384054.0, ans=0.1 2023-06-19 04:39:16,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=384054.0, ans=0.125 2023-06-19 04:39:37,024 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-06-19 04:39:43,710 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.03 vs. limit=22.5 2023-06-19 04:39:46,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=384174.0, ans=0.0 2023-06-19 04:39:52,627 INFO [train.py:996] (3/4) Epoch 3, batch 3050, loss[loss=0.207, simple_loss=0.2788, pruned_loss=0.06765, over 21319.00 frames. ], tot_loss[loss=0.291, simple_loss=0.3532, pruned_loss=0.1145, over 4280652.14 frames. ], batch size: 176, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:40:00,289 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2023-06-19 04:40:44,355 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 3.108e+02 3.737e+02 4.686e+02 8.351e+02, threshold=7.474e+02, percent-clipped=4.0 2023-06-19 04:41:03,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=384414.0, ans=0.125 2023-06-19 04:41:09,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=384414.0, ans=0.1 2023-06-19 04:41:37,414 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-19 04:41:41,699 INFO [train.py:996] (3/4) Epoch 3, batch 3100, loss[loss=0.2407, simple_loss=0.3227, pruned_loss=0.07931, over 21694.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3517, pruned_loss=0.1117, over 4290938.93 frames. ], batch size: 247, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:42:43,713 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.70 vs. limit=12.0 2023-06-19 04:42:56,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=384714.0, ans=0.125 2023-06-19 04:43:31,995 INFO [train.py:996] (3/4) Epoch 3, batch 3150, loss[loss=0.2378, simple_loss=0.3232, pruned_loss=0.07617, over 21590.00 frames. ], tot_loss[loss=0.291, simple_loss=0.3542, pruned_loss=0.1138, over 4284757.75 frames. ], batch size: 263, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:43:44,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=384834.0, ans=0.0 2023-06-19 04:43:51,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=384834.0, ans=0.125 2023-06-19 04:43:59,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=384894.0, ans=0.0 2023-06-19 04:44:08,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=384894.0, ans=0.1 2023-06-19 04:44:19,354 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.477e+02 3.177e+02 3.933e+02 4.816e+02 8.908e+02, threshold=7.865e+02, percent-clipped=2.0 2023-06-19 04:44:42,345 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2023-06-19 04:44:48,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=385014.0, ans=0.0 2023-06-19 04:44:48,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=385014.0, ans=0.0 2023-06-19 04:45:04,801 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=22.5 2023-06-19 04:45:23,739 INFO [train.py:996] (3/4) Epoch 3, batch 3200, loss[loss=0.2595, simple_loss=0.3345, pruned_loss=0.09221, over 21673.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3538, pruned_loss=0.1133, over 4280905.16 frames. ], batch size: 298, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:45:51,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=385194.0, ans=0.0 2023-06-19 04:47:08,033 INFO [train.py:996] (3/4) Epoch 3, batch 3250, loss[loss=0.2921, simple_loss=0.3535, pruned_loss=0.1153, over 21466.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.354, pruned_loss=0.1153, over 4274040.66 frames. ], batch size: 211, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:47:50,331 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.327e+02 4.160e+02 5.584e+02 8.725e+02, threshold=8.319e+02, percent-clipped=2.0 2023-06-19 04:48:58,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=385734.0, ans=10.0 2023-06-19 04:48:59,514 INFO [train.py:996] (3/4) Epoch 3, batch 3300, loss[loss=0.2685, simple_loss=0.339, pruned_loss=0.09906, over 21490.00 frames. ], tot_loss[loss=0.288, simple_loss=0.3481, pruned_loss=0.114, over 4270372.43 frames. ], batch size: 389, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:49:15,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=385794.0, ans=0.0 2023-06-19 04:49:42,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=385854.0, ans=0.0 2023-06-19 04:50:24,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=385974.0, ans=0.0 2023-06-19 04:50:44,348 INFO [train.py:996] (3/4) Epoch 3, batch 3350, loss[loss=0.2765, simple_loss=0.3306, pruned_loss=0.1112, over 21829.00 frames. ], tot_loss[loss=0.2888, simple_loss=0.3493, pruned_loss=0.1141, over 4258806.75 frames. ], batch size: 282, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:51:20,252 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.290e+02 3.790e+02 4.247e+02 7.031e+02, threshold=7.579e+02, percent-clipped=0.0 2023-06-19 04:51:54,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=386214.0, ans=0.125 2023-06-19 04:51:56,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=386214.0, ans=0.125 2023-06-19 04:52:27,716 INFO [train.py:996] (3/4) Epoch 3, batch 3400, loss[loss=0.2999, simple_loss=0.3776, pruned_loss=0.1111, over 21479.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.3511, pruned_loss=0.1145, over 4265374.26 frames. ], batch size: 211, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:52:47,988 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-06-19 04:53:40,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=386514.0, ans=0.1 2023-06-19 04:53:44,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=386514.0, ans=0.125 2023-06-19 04:53:44,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=386514.0, ans=0.0 2023-06-19 04:54:13,031 INFO [train.py:996] (3/4) Epoch 3, batch 3450, loss[loss=0.3518, simple_loss=0.3814, pruned_loss=0.1611, over 21381.00 frames. ], tot_loss[loss=0.2889, simple_loss=0.3483, pruned_loss=0.1147, over 4258030.01 frames. ], batch size: 507, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:54:13,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=386634.0, ans=0.125 2023-06-19 04:54:32,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=386694.0, ans=0.1 2023-06-19 04:54:39,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=386694.0, ans=0.125 2023-06-19 04:55:06,155 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.192e+02 3.930e+02 4.779e+02 8.558e+02, threshold=7.861e+02, percent-clipped=2.0 2023-06-19 04:55:07,032 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.77 vs. limit=12.0 2023-06-19 04:55:12,166 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-06-19 04:55:48,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=386874.0, ans=0.125 2023-06-19 04:55:57,930 INFO [train.py:996] (3/4) Epoch 3, batch 3500, loss[loss=0.2907, simple_loss=0.381, pruned_loss=0.1002, over 19823.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.3575, pruned_loss=0.1179, over 4252115.55 frames. ], batch size: 703, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 04:56:00,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=386934.0, ans=0.1 2023-06-19 04:56:23,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=386994.0, ans=0.0 2023-06-19 04:57:37,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=387174.0, ans=0.0 2023-06-19 04:57:44,016 INFO [train.py:996] (3/4) Epoch 3, batch 3550, loss[loss=0.2335, simple_loss=0.2816, pruned_loss=0.09271, over 20221.00 frames. ], tot_loss[loss=0.2977, simple_loss=0.3586, pruned_loss=0.1184, over 4251322.13 frames. ], batch size: 703, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 04:58:23,646 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-19 04:58:38,184 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 3.279e+02 3.936e+02 4.776e+02 8.299e+02, threshold=7.873e+02, percent-clipped=2.0 2023-06-19 04:59:17,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=387474.0, ans=0.0 2023-06-19 04:59:31,867 INFO [train.py:996] (3/4) Epoch 3, batch 3600, loss[loss=0.3343, simple_loss=0.3801, pruned_loss=0.1442, over 21602.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3555, pruned_loss=0.1182, over 4252506.25 frames. ], batch size: 263, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 04:59:48,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=387534.0, ans=0.125 2023-06-19 05:00:17,291 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:00:18,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=387594.0, ans=0.125 2023-06-19 05:00:23,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=387654.0, ans=0.1 2023-06-19 05:01:16,063 INFO [train.py:996] (3/4) Epoch 3, batch 3650, loss[loss=0.2501, simple_loss=0.3114, pruned_loss=0.09442, over 21466.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.3578, pruned_loss=0.1179, over 4253817.19 frames. ], batch size: 194, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:01:36,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=387834.0, ans=0.1 2023-06-19 05:02:02,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=387894.0, ans=0.125 2023-06-19 05:02:05,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=387954.0, ans=0.125 2023-06-19 05:02:08,764 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 3.313e+02 3.848e+02 4.708e+02 1.033e+03, threshold=7.696e+02, percent-clipped=4.0 2023-06-19 05:02:32,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=388014.0, ans=0.07 2023-06-19 05:02:59,721 INFO [train.py:996] (3/4) Epoch 3, batch 3700, loss[loss=0.2732, simple_loss=0.3402, pruned_loss=0.1031, over 21877.00 frames. ], tot_loss[loss=0.2967, simple_loss=0.357, pruned_loss=0.1182, over 4257139.79 frames. ], batch size: 371, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:04:04,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=388254.0, ans=0.1 2023-06-19 05:04:08,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=388254.0, ans=0.125 2023-06-19 05:04:12,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=388314.0, ans=0.0 2023-06-19 05:04:22,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=388314.0, ans=0.05 2023-06-19 05:04:56,326 INFO [train.py:996] (3/4) Epoch 3, batch 3750, loss[loss=0.1899, simple_loss=0.2347, pruned_loss=0.0725, over 17107.00 frames. ], tot_loss[loss=0.2942, simple_loss=0.3546, pruned_loss=0.1169, over 4254569.91 frames. ], batch size: 63, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:05:30,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=388494.0, ans=0.125 2023-06-19 05:05:43,604 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 3.137e+02 4.357e+02 5.330e+02 7.776e+02, threshold=8.713e+02, percent-clipped=1.0 2023-06-19 05:05:50,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=388554.0, ans=0.0 2023-06-19 05:05:51,354 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.80 vs. limit=8.0 2023-06-19 05:05:54,829 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.65 vs. limit=15.0 2023-06-19 05:06:46,815 INFO [train.py:996] (3/4) Epoch 3, batch 3800, loss[loss=0.3043, simple_loss=0.3557, pruned_loss=0.1265, over 21254.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3533, pruned_loss=0.116, over 4263087.26 frames. ], batch size: 159, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:06:59,567 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=15.0 2023-06-19 05:07:03,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=388794.0, ans=0.0 2023-06-19 05:07:34,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=388854.0, ans=0.07 2023-06-19 05:08:20,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=388974.0, ans=0.125 2023-06-19 05:08:23,796 INFO [train.py:996] (3/4) Epoch 3, batch 3850, loss[loss=0.2546, simple_loss=0.3045, pruned_loss=0.1024, over 21668.00 frames. ], tot_loss[loss=0.2912, simple_loss=0.3494, pruned_loss=0.1165, over 4269274.51 frames. ], batch size: 282, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:08:49,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=389094.0, ans=0.0 2023-06-19 05:09:10,444 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.298e+02 3.065e+02 3.544e+02 4.567e+02 7.617e+02, threshold=7.087e+02, percent-clipped=0.0 2023-06-19 05:09:22,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=389214.0, ans=0.125 2023-06-19 05:09:47,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=389274.0, ans=0.0 2023-06-19 05:09:49,922 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=22.5 2023-06-19 05:10:01,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=389274.0, ans=0.125 2023-06-19 05:10:07,276 INFO [train.py:996] (3/4) Epoch 3, batch 3900, loss[loss=0.3084, simple_loss=0.3549, pruned_loss=0.1309, over 21746.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.3464, pruned_loss=0.1169, over 4275606.38 frames. ], batch size: 441, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:10:43,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=389394.0, ans=0.2 2023-06-19 05:10:51,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=389454.0, ans=0.1 2023-06-19 05:11:11,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=389514.0, ans=0.0 2023-06-19 05:11:51,750 INFO [train.py:996] (3/4) Epoch 3, batch 3950, loss[loss=0.191, simple_loss=0.2683, pruned_loss=0.05685, over 21761.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3453, pruned_loss=0.1149, over 4276281.61 frames. ], batch size: 282, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:12:06,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=389634.0, ans=0.125 2023-06-19 05:12:11,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=389694.0, ans=10.0 2023-06-19 05:12:38,256 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 3.040e+02 3.554e+02 4.206e+02 5.675e+02, threshold=7.109e+02, percent-clipped=0.0 2023-06-19 05:13:28,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=389874.0, ans=0.2 2023-06-19 05:13:36,453 INFO [train.py:996] (3/4) Epoch 3, batch 4000, loss[loss=0.2679, simple_loss=0.3134, pruned_loss=0.1113, over 21829.00 frames. ], tot_loss[loss=0.2808, simple_loss=0.3397, pruned_loss=0.111, over 4271122.36 frames. ], batch size: 107, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:14:36,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=390054.0, ans=0.1 2023-06-19 05:14:38,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=390114.0, ans=0.05 2023-06-19 05:15:20,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=390174.0, ans=0.1 2023-06-19 05:15:23,365 INFO [train.py:996] (3/4) Epoch 3, batch 4050, loss[loss=0.265, simple_loss=0.3123, pruned_loss=0.1088, over 21474.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3393, pruned_loss=0.1087, over 4265788.24 frames. ], batch size: 441, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:15:47,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=390294.0, ans=0.125 2023-06-19 05:16:04,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=390294.0, ans=0.0 2023-06-19 05:16:06,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=390354.0, ans=0.1 2023-06-19 05:16:10,707 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.094e+02 3.976e+02 4.759e+02 9.787e+02, threshold=7.952e+02, percent-clipped=5.0 2023-06-19 05:16:28,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=390414.0, ans=0.1 2023-06-19 05:17:13,601 INFO [train.py:996] (3/4) Epoch 3, batch 4100, loss[loss=0.249, simple_loss=0.322, pruned_loss=0.08802, over 21627.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3432, pruned_loss=0.1097, over 4270089.06 frames. ], batch size: 263, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:17:19,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=390534.0, ans=0.125 2023-06-19 05:18:08,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=390714.0, ans=0.1 2023-06-19 05:18:11,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=390714.0, ans=0.125 2023-06-19 05:18:51,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=390774.0, ans=0.2 2023-06-19 05:18:58,803 INFO [train.py:996] (3/4) Epoch 3, batch 4150, loss[loss=0.2816, simple_loss=0.3522, pruned_loss=0.1055, over 21580.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.3408, pruned_loss=0.1049, over 4277625.13 frames. ], batch size: 389, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:19:02,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=390834.0, ans=0.2 2023-06-19 05:19:10,200 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.66 vs. limit=15.0 2023-06-19 05:19:11,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=390834.0, ans=0.125 2023-06-19 05:19:30,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=390894.0, ans=0.125 2023-06-19 05:19:41,648 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 3.003e+02 3.732e+02 5.110e+02 9.922e+02, threshold=7.464e+02, percent-clipped=2.0 2023-06-19 05:20:50,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=391134.0, ans=0.2 2023-06-19 05:20:51,277 INFO [train.py:996] (3/4) Epoch 3, batch 4200, loss[loss=0.2684, simple_loss=0.3392, pruned_loss=0.09878, over 21684.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3418, pruned_loss=0.1059, over 4276553.45 frames. ], batch size: 247, lr: 1.18e-02, grad_scale: 16.0 2023-06-19 05:20:58,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=391134.0, ans=0.015 2023-06-19 05:21:10,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=15.0 2023-06-19 05:21:13,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=391194.0, ans=0.125 2023-06-19 05:21:17,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=391194.0, ans=0.05 2023-06-19 05:22:37,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=391374.0, ans=10.0 2023-06-19 05:22:40,214 INFO [train.py:996] (3/4) Epoch 3, batch 4250, loss[loss=0.2859, simple_loss=0.3512, pruned_loss=0.1103, over 21520.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3491, pruned_loss=0.1084, over 4280939.80 frames. ], batch size: 441, lr: 1.18e-02, grad_scale: 16.0 2023-06-19 05:22:56,770 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-19 05:22:57,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=391494.0, ans=0.1 2023-06-19 05:23:30,821 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.359e+02 3.309e+02 4.046e+02 4.889e+02 9.500e+02, threshold=8.092e+02, percent-clipped=4.0 2023-06-19 05:23:34,660 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:24:23,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=391674.0, ans=0.125 2023-06-19 05:24:27,988 INFO [train.py:996] (3/4) Epoch 3, batch 4300, loss[loss=0.2655, simple_loss=0.334, pruned_loss=0.09849, over 21216.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.3538, pruned_loss=0.1102, over 4277117.85 frames. ], batch size: 548, lr: 1.18e-02, grad_scale: 16.0 2023-06-19 05:25:04,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=391794.0, ans=0.0 2023-06-19 05:25:41,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=391914.0, ans=0.1 2023-06-19 05:26:13,014 INFO [train.py:996] (3/4) Epoch 3, batch 4350, loss[loss=0.2992, simple_loss=0.3472, pruned_loss=0.1256, over 21773.00 frames. ], tot_loss[loss=0.2852, simple_loss=0.3512, pruned_loss=0.1096, over 4272699.53 frames. ], batch size: 351, lr: 1.18e-02, grad_scale: 16.0 2023-06-19 05:27:03,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=392154.0, ans=0.125 2023-06-19 05:27:08,351 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 3.136e+02 3.673e+02 4.293e+02 1.094e+03, threshold=7.346e+02, percent-clipped=4.0 2023-06-19 05:27:33,200 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.90 vs. limit=6.0 2023-06-19 05:27:59,229 INFO [train.py:996] (3/4) Epoch 3, batch 4400, loss[loss=0.3174, simple_loss=0.39, pruned_loss=0.1224, over 21619.00 frames. ], tot_loss[loss=0.2845, simple_loss=0.3489, pruned_loss=0.11, over 4273976.22 frames. ], batch size: 389, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:28:04,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=392334.0, ans=0.1 2023-06-19 05:28:06,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=392334.0, ans=0.2 2023-06-19 05:28:30,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=392394.0, ans=0.1 2023-06-19 05:29:18,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=392514.0, ans=0.04949747468305833 2023-06-19 05:29:19,179 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-06-19 05:29:30,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=392574.0, ans=0.125 2023-06-19 05:29:33,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=392574.0, ans=0.1 2023-06-19 05:29:36,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=392574.0, ans=0.125 2023-06-19 05:29:49,755 INFO [train.py:996] (3/4) Epoch 3, batch 4450, loss[loss=0.3065, simple_loss=0.3715, pruned_loss=0.1208, over 21740.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3559, pruned_loss=0.1116, over 4276861.24 frames. ], batch size: 247, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:29:51,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=392634.0, ans=0.2 2023-06-19 05:30:20,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=392694.0, ans=0.125 2023-06-19 05:30:33,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=392754.0, ans=10.0 2023-06-19 05:30:40,185 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 3.016e+02 3.680e+02 4.427e+02 7.679e+02, threshold=7.360e+02, percent-clipped=2.0 2023-06-19 05:30:51,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=392814.0, ans=0.04949747468305833 2023-06-19 05:31:36,807 INFO [train.py:996] (3/4) Epoch 3, batch 4500, loss[loss=0.3329, simple_loss=0.4009, pruned_loss=0.1324, over 21748.00 frames. ], tot_loss[loss=0.2918, simple_loss=0.3566, pruned_loss=0.1135, over 4283503.40 frames. ], batch size: 389, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:31:37,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=392934.0, ans=0.125 2023-06-19 05:33:13,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=393174.0, ans=0.125 2023-06-19 05:33:25,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=393174.0, ans=0.2 2023-06-19 05:33:28,833 INFO [train.py:996] (3/4) Epoch 3, batch 4550, loss[loss=0.3196, simple_loss=0.3811, pruned_loss=0.1291, over 21572.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3627, pruned_loss=0.115, over 4280210.30 frames. ], batch size: 230, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:34:13,197 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 3.526e+02 4.619e+02 6.028e+02 1.155e+03, threshold=9.238e+02, percent-clipped=14.0 2023-06-19 05:34:54,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=393474.0, ans=0.125 2023-06-19 05:35:14,600 INFO [train.py:996] (3/4) Epoch 3, batch 4600, loss[loss=0.2608, simple_loss=0.3253, pruned_loss=0.09815, over 21357.00 frames. ], tot_loss[loss=0.2987, simple_loss=0.3646, pruned_loss=0.1164, over 4276313.34 frames. ], batch size: 143, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:35:52,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=393654.0, ans=0.1 2023-06-19 05:35:57,756 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.90 vs. limit=15.0 2023-06-19 05:36:14,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=393654.0, ans=0.0 2023-06-19 05:36:17,193 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.56 vs. limit=15.0 2023-06-19 05:36:28,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=393714.0, ans=0.125 2023-06-19 05:36:30,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=393714.0, ans=0.0 2023-06-19 05:37:01,733 INFO [train.py:996] (3/4) Epoch 3, batch 4650, loss[loss=0.2137, simple_loss=0.2819, pruned_loss=0.07278, over 21754.00 frames. ], tot_loss[loss=0.2917, simple_loss=0.3565, pruned_loss=0.1135, over 4285178.73 frames. ], batch size: 298, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:37:08,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=393834.0, ans=0.0 2023-06-19 05:37:26,816 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.18 vs. limit=15.0 2023-06-19 05:37:44,141 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.775e+02 3.310e+02 3.723e+02 7.638e+02, threshold=6.620e+02, percent-clipped=0.0 2023-06-19 05:38:11,871 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-19 05:38:26,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=394074.0, ans=0.2 2023-06-19 05:38:27,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=394074.0, ans=0.1 2023-06-19 05:38:38,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=394074.0, ans=0.125 2023-06-19 05:38:53,903 INFO [train.py:996] (3/4) Epoch 3, batch 4700, loss[loss=0.2306, simple_loss=0.2861, pruned_loss=0.08752, over 21551.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.347, pruned_loss=0.1108, over 4277199.44 frames. ], batch size: 263, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:39:28,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=394254.0, ans=6.0 2023-06-19 05:39:58,285 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:40:32,932 INFO [train.py:996] (3/4) Epoch 3, batch 4750, loss[loss=0.3263, simple_loss=0.3646, pruned_loss=0.144, over 21623.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3421, pruned_loss=0.111, over 4280461.94 frames. ], batch size: 473, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:40:35,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=394434.0, ans=0.0 2023-06-19 05:41:21,969 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 3.052e+02 3.874e+02 5.001e+02 1.083e+03, threshold=7.748e+02, percent-clipped=9.0 2023-06-19 05:42:04,687 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=22.5 2023-06-19 05:42:12,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=394674.0, ans=0.125 2023-06-19 05:42:25,157 INFO [train.py:996] (3/4) Epoch 3, batch 4800, loss[loss=0.2444, simple_loss=0.3286, pruned_loss=0.08008, over 21406.00 frames. ], tot_loss[loss=0.2826, simple_loss=0.343, pruned_loss=0.1111, over 4285133.14 frames. ], batch size: 176, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:43:34,182 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=12.0 2023-06-19 05:43:42,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=22.5 2023-06-19 05:44:10,683 INFO [train.py:996] (3/4) Epoch 3, batch 4850, loss[loss=0.3288, simple_loss=0.3788, pruned_loss=0.1394, over 21531.00 frames. ], tot_loss[loss=0.2815, simple_loss=0.3422, pruned_loss=0.1104, over 4293050.76 frames. ], batch size: 471, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:44:12,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=395034.0, ans=0.125 2023-06-19 05:44:39,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=395094.0, ans=0.125 2023-06-19 05:44:49,492 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.29 vs. limit=15.0 2023-06-19 05:44:54,935 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.516e+02 4.493e+02 6.112e+02 1.101e+03, threshold=8.986e+02, percent-clipped=11.0 2023-06-19 05:44:58,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=395154.0, ans=0.125 2023-06-19 05:45:13,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=395214.0, ans=0.1 2023-06-19 05:45:55,753 INFO [train.py:996] (3/4) Epoch 3, batch 4900, loss[loss=0.2949, simple_loss=0.377, pruned_loss=0.1065, over 21740.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3429, pruned_loss=0.111, over 4294074.93 frames. ], batch size: 247, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:46:40,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=395454.0, ans=0.2 2023-06-19 05:47:06,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=395514.0, ans=0.025 2023-06-19 05:47:28,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=395574.0, ans=0.2 2023-06-19 05:47:41,158 INFO [train.py:996] (3/4) Epoch 3, batch 4950, loss[loss=0.278, simple_loss=0.3933, pruned_loss=0.08136, over 20744.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3471, pruned_loss=0.1095, over 4280428.40 frames. ], batch size: 608, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:47:41,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=395634.0, ans=0.0 2023-06-19 05:47:41,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=395634.0, ans=0.125 2023-06-19 05:48:31,573 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.805e+02 3.354e+02 4.068e+02 9.306e+02, threshold=6.708e+02, percent-clipped=1.0 2023-06-19 05:49:21,546 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-19 05:49:27,324 INFO [train.py:996] (3/4) Epoch 3, batch 5000, loss[loss=0.3343, simple_loss=0.3839, pruned_loss=0.1424, over 21633.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.3456, pruned_loss=0.1057, over 4284753.33 frames. ], batch size: 471, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:49:48,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=22.5 2023-06-19 05:50:28,182 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=12.0 2023-06-19 05:50:58,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=396174.0, ans=0.125 2023-06-19 05:51:12,144 INFO [train.py:996] (3/4) Epoch 3, batch 5050, loss[loss=0.2828, simple_loss=0.3439, pruned_loss=0.1108, over 21849.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3473, pruned_loss=0.1083, over 4288933.45 frames. ], batch size: 391, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:51:35,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=396294.0, ans=0.125 2023-06-19 05:51:37,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=396294.0, ans=0.1 2023-06-19 05:51:56,270 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.428e+02 4.062e+02 4.972e+02 8.550e+02, threshold=8.125e+02, percent-clipped=7.0 2023-06-19 05:52:13,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=396414.0, ans=0.2 2023-06-19 05:52:25,655 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.33 vs. limit=15.0 2023-06-19 05:52:48,641 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.54 vs. limit=10.0 2023-06-19 05:52:52,317 INFO [train.py:996] (3/4) Epoch 3, batch 5100, loss[loss=0.2479, simple_loss=0.3098, pruned_loss=0.09299, over 21930.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3465, pruned_loss=0.1089, over 4284568.89 frames. ], batch size: 316, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:53:01,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=396534.0, ans=0.125 2023-06-19 05:53:01,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=396534.0, ans=0.125 2023-06-19 05:53:40,793 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.97 vs. limit=10.0 2023-06-19 05:53:46,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=396654.0, ans=0.0 2023-06-19 05:53:55,744 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=22.5 2023-06-19 05:54:36,998 INFO [train.py:996] (3/4) Epoch 3, batch 5150, loss[loss=0.287, simple_loss=0.3343, pruned_loss=0.1199, over 21367.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3449, pruned_loss=0.1096, over 4282958.35 frames. ], batch size: 176, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:54:38,066 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=15.0 2023-06-19 05:55:07,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=396894.0, ans=0.125 2023-06-19 05:55:08,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=396894.0, ans=10.0 2023-06-19 05:55:18,801 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=15.0 2023-06-19 05:55:27,272 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.247e+02 3.957e+02 4.711e+02 9.896e+02, threshold=7.915e+02, percent-clipped=1.0 2023-06-19 05:55:32,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=396954.0, ans=0.0 2023-06-19 05:55:56,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=397014.0, ans=0.0 2023-06-19 05:56:23,665 INFO [train.py:996] (3/4) Epoch 3, batch 5200, loss[loss=0.2546, simple_loss=0.3417, pruned_loss=0.08376, over 21403.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3432, pruned_loss=0.1086, over 4278262.30 frames. ], batch size: 211, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:56:24,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=397134.0, ans=0.0 2023-06-19 05:56:36,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=397134.0, ans=0.1 2023-06-19 05:56:38,278 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2023-06-19 05:57:05,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=397194.0, ans=0.5 2023-06-19 05:57:25,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=397314.0, ans=0.125 2023-06-19 05:58:05,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=397374.0, ans=0.125 2023-06-19 05:58:09,827 INFO [train.py:996] (3/4) Epoch 3, batch 5250, loss[loss=0.3142, simple_loss=0.4405, pruned_loss=0.09391, over 19640.00 frames. ], tot_loss[loss=0.2806, simple_loss=0.3475, pruned_loss=0.1068, over 4280825.13 frames. ], batch size: 702, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:58:59,531 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 3.349e+02 3.883e+02 5.144e+02 8.715e+02, threshold=7.765e+02, percent-clipped=1.0 2023-06-19 05:59:53,201 INFO [train.py:996] (3/4) Epoch 3, batch 5300, loss[loss=0.2928, simple_loss=0.3372, pruned_loss=0.1242, over 21468.00 frames. ], tot_loss[loss=0.2817, simple_loss=0.3474, pruned_loss=0.108, over 4289282.19 frames. ], batch size: 194, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:00:00,754 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-19 06:00:53,836 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=15.0 2023-06-19 06:01:09,615 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-19 06:01:33,542 INFO [train.py:996] (3/4) Epoch 3, batch 5350, loss[loss=0.2877, simple_loss=0.3407, pruned_loss=0.1174, over 21990.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.3484, pruned_loss=0.1104, over 4290328.46 frames. ], batch size: 113, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:01:38,198 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=15.0 2023-06-19 06:01:56,181 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.26 vs. limit=15.0 2023-06-19 06:01:58,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=398094.0, ans=0.0 2023-06-19 06:02:17,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=398154.0, ans=0.1 2023-06-19 06:02:23,868 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 3.150e+02 3.600e+02 4.539e+02 9.021e+02, threshold=7.200e+02, percent-clipped=2.0 2023-06-19 06:02:27,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=398154.0, ans=0.125 2023-06-19 06:03:09,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=398274.0, ans=0.025 2023-06-19 06:03:18,164 INFO [train.py:996] (3/4) Epoch 3, batch 5400, loss[loss=0.3316, simple_loss=0.3821, pruned_loss=0.1405, over 20003.00 frames. ], tot_loss[loss=0.2856, simple_loss=0.3479, pruned_loss=0.1116, over 4294425.87 frames. ], batch size: 702, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:04:06,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=398454.0, ans=0.5 2023-06-19 06:04:32,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=398514.0, ans=0.0 2023-06-19 06:04:56,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=398574.0, ans=0.1 2023-06-19 06:05:02,974 INFO [train.py:996] (3/4) Epoch 3, batch 5450, loss[loss=0.27, simple_loss=0.3745, pruned_loss=0.08275, over 21693.00 frames. ], tot_loss[loss=0.2828, simple_loss=0.3476, pruned_loss=0.109, over 4291072.74 frames. ], batch size: 247, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:05:07,059 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=22.5 2023-06-19 06:05:24,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=398634.0, ans=0.125 2023-06-19 06:05:47,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=398754.0, ans=0.1 2023-06-19 06:06:00,063 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.869e+02 3.395e+02 4.566e+02 8.866e+02, threshold=6.789e+02, percent-clipped=3.0 2023-06-19 06:06:59,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=398874.0, ans=0.125 2023-06-19 06:07:02,445 INFO [train.py:996] (3/4) Epoch 3, batch 5500, loss[loss=0.3219, simple_loss=0.4029, pruned_loss=0.1204, over 21630.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3525, pruned_loss=0.1062, over 4286146.25 frames. ], batch size: 441, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:07:22,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=398994.0, ans=0.1 2023-06-19 06:07:54,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=399054.0, ans=0.125 2023-06-19 06:08:10,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=399114.0, ans=0.125 2023-06-19 06:08:39,193 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-19 06:08:46,532 INFO [train.py:996] (3/4) Epoch 3, batch 5550, loss[loss=0.2436, simple_loss=0.3378, pruned_loss=0.07473, over 21574.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3512, pruned_loss=0.1032, over 4277397.82 frames. ], batch size: 441, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:08:49,598 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-19 06:09:20,233 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=15.0 2023-06-19 06:09:32,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=399354.0, ans=0.0 2023-06-19 06:09:38,644 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.755e+02 3.259e+02 4.197e+02 7.319e+02, threshold=6.518e+02, percent-clipped=2.0 2023-06-19 06:09:44,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=399354.0, ans=0.125 2023-06-19 06:09:59,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=399414.0, ans=0.2 2023-06-19 06:10:31,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.54 vs. limit=15.0 2023-06-19 06:10:34,000 INFO [train.py:996] (3/4) Epoch 3, batch 5600, loss[loss=0.3438, simple_loss=0.4567, pruned_loss=0.1155, over 19789.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3486, pruned_loss=0.1001, over 4274946.41 frames. ], batch size: 702, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:11:06,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=399594.0, ans=0.1 2023-06-19 06:11:16,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=399654.0, ans=0.0 2023-06-19 06:11:47,994 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:12:01,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=399774.0, ans=0.0 2023-06-19 06:12:08,590 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0 2023-06-19 06:12:11,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=399774.0, ans=0.04949747468305833 2023-06-19 06:12:17,385 INFO [train.py:996] (3/4) Epoch 3, batch 5650, loss[loss=0.344, simple_loss=0.3989, pruned_loss=0.1446, over 20164.00 frames. ], tot_loss[loss=0.2808, simple_loss=0.3538, pruned_loss=0.1039, over 4283889.31 frames. ], batch size: 702, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:13:13,882 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 3.102e+02 3.958e+02 5.147e+02 8.863e+02, threshold=7.916e+02, percent-clipped=12.0 2023-06-19 06:13:27,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=399954.0, ans=0.035 2023-06-19 06:13:36,988 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=15.0 2023-06-19 06:13:45,794 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.89 vs. limit=15.0 2023-06-19 06:13:52,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=400074.0, ans=0.0 2023-06-19 06:14:16,669 INFO [train.py:996] (3/4) Epoch 3, batch 5700, loss[loss=0.2712, simple_loss=0.3371, pruned_loss=0.1026, over 21655.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.3523, pruned_loss=0.1058, over 4291226.99 frames. ], batch size: 263, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:14:49,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=400194.0, ans=0.0 2023-06-19 06:15:24,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-06-19 06:15:35,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=400314.0, ans=0.2 2023-06-19 06:15:54,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=400374.0, ans=0.125 2023-06-19 06:15:55,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=400374.0, ans=0.1 2023-06-19 06:16:03,947 INFO [train.py:996] (3/4) Epoch 3, batch 5750, loss[loss=0.2358, simple_loss=0.3219, pruned_loss=0.07483, over 21747.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3493, pruned_loss=0.1028, over 4288120.00 frames. ], batch size: 298, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:16:41,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=400494.0, ans=0.0 2023-06-19 06:16:54,225 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.901e+02 3.353e+02 4.192e+02 8.562e+02, threshold=6.706e+02, percent-clipped=1.0 2023-06-19 06:16:55,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.56 vs. limit=6.0 2023-06-19 06:16:56,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=400554.0, ans=0.1 2023-06-19 06:17:48,969 INFO [train.py:996] (3/4) Epoch 3, batch 5800, loss[loss=0.2939, simple_loss=0.3705, pruned_loss=0.1087, over 21659.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3464, pruned_loss=0.1011, over 4283104.01 frames. ], batch size: 263, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:18:05,268 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=22.5 2023-06-19 06:18:45,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=400854.0, ans=0.125 2023-06-19 06:18:55,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=400914.0, ans=0.0 2023-06-19 06:19:41,059 INFO [train.py:996] (3/4) Epoch 3, batch 5850, loss[loss=0.2184, simple_loss=0.3196, pruned_loss=0.05864, over 21591.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3428, pruned_loss=0.09528, over 4282807.54 frames. ], batch size: 263, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:20:12,806 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-19 06:20:32,171 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 2.443e+02 2.875e+02 3.533e+02 5.012e+02, threshold=5.751e+02, percent-clipped=0.0 2023-06-19 06:21:08,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=401274.0, ans=0.125 2023-06-19 06:21:31,744 INFO [train.py:996] (3/4) Epoch 3, batch 5900, loss[loss=0.2889, simple_loss=0.397, pruned_loss=0.0904, over 20774.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3334, pruned_loss=0.0881, over 4272916.32 frames. ], batch size: 607, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:21:53,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=401394.0, ans=0.09899494936611666 2023-06-19 06:22:06,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=401454.0, ans=0.05 2023-06-19 06:22:39,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=401514.0, ans=0.125 2023-06-19 06:22:51,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=401574.0, ans=0.025 2023-06-19 06:23:09,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=401574.0, ans=0.0 2023-06-19 06:23:13,883 INFO [train.py:996] (3/4) Epoch 3, batch 5950, loss[loss=0.3114, simple_loss=0.3514, pruned_loss=0.1357, over 21638.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3343, pruned_loss=0.09383, over 4278698.67 frames. ], batch size: 414, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:23:22,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=401634.0, ans=0.0 2023-06-19 06:23:26,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=401634.0, ans=0.1 2023-06-19 06:23:58,567 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 2.851e+02 3.351e+02 4.142e+02 6.067e+02, threshold=6.702e+02, percent-clipped=3.0 2023-06-19 06:25:00,252 INFO [train.py:996] (3/4) Epoch 3, batch 6000, loss[loss=0.2695, simple_loss=0.315, pruned_loss=0.112, over 15041.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3308, pruned_loss=0.09811, over 4271020.34 frames. ], batch size: 60, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:25:00,252 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 06:25:17,465 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2818, simple_loss=0.374, pruned_loss=0.0948, over 1796401.00 frames. 2023-06-19 06:25:17,466 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-19 06:25:18,621 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=15.0 2023-06-19 06:25:34,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=401994.0, ans=0.125 2023-06-19 06:26:03,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=402054.0, ans=0.0 2023-06-19 06:26:06,636 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-06-19 06:26:39,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=402114.0, ans=0.0 2023-06-19 06:26:58,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=402234.0, ans=0.0 2023-06-19 06:26:59,504 INFO [train.py:996] (3/4) Epoch 3, batch 6050, loss[loss=0.2835, simple_loss=0.3342, pruned_loss=0.1164, over 21360.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3256, pruned_loss=0.09882, over 4272915.33 frames. ], batch size: 507, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:27:04,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=402234.0, ans=0.025 2023-06-19 06:27:43,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=402354.0, ans=0.2 2023-06-19 06:27:49,161 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 2.918e+02 3.547e+02 4.372e+02 9.416e+02, threshold=7.093e+02, percent-clipped=6.0 2023-06-19 06:28:13,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=402414.0, ans=0.125 2023-06-19 06:28:43,513 INFO [train.py:996] (3/4) Epoch 3, batch 6100, loss[loss=0.2311, simple_loss=0.2998, pruned_loss=0.08113, over 21706.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3239, pruned_loss=0.09729, over 4278051.72 frames. ], batch size: 230, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:30:13,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=402774.0, ans=0.125 2023-06-19 06:30:30,053 INFO [train.py:996] (3/4) Epoch 3, batch 6150, loss[loss=0.2364, simple_loss=0.3081, pruned_loss=0.08239, over 21504.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3295, pruned_loss=0.1004, over 4277963.77 frames. ], batch size: 212, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:30:47,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=402834.0, ans=0.125 2023-06-19 06:31:15,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=402954.0, ans=0.1 2023-06-19 06:31:21,143 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 3.076e+02 3.570e+02 4.379e+02 8.300e+02, threshold=7.140e+02, percent-clipped=3.0 2023-06-19 06:31:43,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=403014.0, ans=0.125 2023-06-19 06:32:15,744 INFO [train.py:996] (3/4) Epoch 3, batch 6200, loss[loss=0.2709, simple_loss=0.3454, pruned_loss=0.09816, over 21441.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3339, pruned_loss=0.1009, over 4287197.06 frames. ], batch size: 548, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:32:32,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=403134.0, ans=0.125 2023-06-19 06:32:55,140 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-19 06:33:00,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=403254.0, ans=0.125 2023-06-19 06:33:13,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=403254.0, ans=0.0 2023-06-19 06:33:53,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=403374.0, ans=0.0 2023-06-19 06:34:06,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=403434.0, ans=0.2 2023-06-19 06:34:08,033 INFO [train.py:996] (3/4) Epoch 3, batch 6250, loss[loss=0.2732, simple_loss=0.3633, pruned_loss=0.09156, over 21791.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3378, pruned_loss=0.1001, over 4279893.75 frames. ], batch size: 332, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:34:23,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=403434.0, ans=0.125 2023-06-19 06:34:24,016 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-19 06:35:08,837 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 3.015e+02 3.767e+02 4.898e+02 1.129e+03, threshold=7.534e+02, percent-clipped=8.0 2023-06-19 06:35:13,225 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=12.0 2023-06-19 06:35:21,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-19 06:36:01,198 INFO [train.py:996] (3/4) Epoch 3, batch 6300, loss[loss=0.3138, simple_loss=0.3649, pruned_loss=0.1313, over 21878.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3401, pruned_loss=0.0988, over 4274343.28 frames. ], batch size: 414, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:36:06,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=403734.0, ans=0.025 2023-06-19 06:36:07,168 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2023-06-19 06:36:19,141 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.20 vs. limit=12.0 2023-06-19 06:36:52,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=403854.0, ans=0.0 2023-06-19 06:37:45,950 INFO [train.py:996] (3/4) Epoch 3, batch 6350, loss[loss=0.3655, simple_loss=0.408, pruned_loss=0.1615, over 21527.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3448, pruned_loss=0.1042, over 4280768.48 frames. ], batch size: 131, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:37:59,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=404034.0, ans=0.125 2023-06-19 06:38:38,185 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.497e+02 3.096e+02 3.646e+02 4.304e+02 8.936e+02, threshold=7.293e+02, percent-clipped=1.0 2023-06-19 06:38:39,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=404154.0, ans=0.125 2023-06-19 06:38:52,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=404214.0, ans=0.125 2023-06-19 06:39:23,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=404274.0, ans=0.1 2023-06-19 06:39:23,524 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=22.5 2023-06-19 06:39:31,458 INFO [train.py:996] (3/4) Epoch 3, batch 6400, loss[loss=0.2973, simple_loss=0.3636, pruned_loss=0.1155, over 21677.00 frames. ], tot_loss[loss=0.2859, simple_loss=0.3524, pruned_loss=0.1097, over 4282355.26 frames. ], batch size: 351, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:40:23,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=404454.0, ans=0.125 2023-06-19 06:40:35,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=404454.0, ans=0.0 2023-06-19 06:40:53,070 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:41:09,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=404574.0, ans=0.0 2023-06-19 06:41:17,044 INFO [train.py:996] (3/4) Epoch 3, batch 6450, loss[loss=0.2701, simple_loss=0.3377, pruned_loss=0.1012, over 21696.00 frames. ], tot_loss[loss=0.2859, simple_loss=0.3544, pruned_loss=0.1087, over 4284890.94 frames. ], batch size: 282, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:41:17,905 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=12.0 2023-06-19 06:42:01,795 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:42:09,370 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 3.087e+02 4.205e+02 5.976e+02 1.329e+03, threshold=8.410e+02, percent-clipped=11.0 2023-06-19 06:42:49,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=404874.0, ans=0.125 2023-06-19 06:43:02,077 INFO [train.py:996] (3/4) Epoch 3, batch 6500, loss[loss=0.273, simple_loss=0.3437, pruned_loss=0.1012, over 21739.00 frames. ], tot_loss[loss=0.2814, simple_loss=0.3474, pruned_loss=0.1076, over 4281639.65 frames. ], batch size: 282, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:43:55,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=405054.0, ans=0.0 2023-06-19 06:44:27,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=405174.0, ans=10.0 2023-06-19 06:44:28,355 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-19 06:44:30,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=405174.0, ans=0.125 2023-06-19 06:44:54,395 INFO [train.py:996] (3/4) Epoch 3, batch 6550, loss[loss=0.3014, simple_loss=0.3531, pruned_loss=0.1248, over 21649.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3463, pruned_loss=0.1071, over 4278696.44 frames. ], batch size: 441, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:45:19,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=405294.0, ans=0.0 2023-06-19 06:45:36,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=405354.0, ans=0.0 2023-06-19 06:45:41,970 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 2.916e+02 3.526e+02 4.372e+02 9.339e+02, threshold=7.052e+02, percent-clipped=1.0 2023-06-19 06:46:06,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=405414.0, ans=0.0 2023-06-19 06:46:38,351 INFO [train.py:996] (3/4) Epoch 3, batch 6600, loss[loss=0.2504, simple_loss=0.304, pruned_loss=0.09841, over 21661.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3393, pruned_loss=0.1062, over 4270214.68 frames. ], batch size: 282, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:46:47,713 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-19 06:47:46,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=405714.0, ans=0.04949747468305833 2023-06-19 06:47:54,165 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2023-06-19 06:48:17,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=405774.0, ans=0.125 2023-06-19 06:48:23,550 INFO [train.py:996] (3/4) Epoch 3, batch 6650, loss[loss=0.2591, simple_loss=0.3151, pruned_loss=0.1016, over 21695.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3298, pruned_loss=0.1023, over 4272611.45 frames. ], batch size: 282, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:49:17,956 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.970e+02 3.451e+02 4.323e+02 7.420e+02, threshold=6.902e+02, percent-clipped=1.0 2023-06-19 06:50:07,768 INFO [train.py:996] (3/4) Epoch 3, batch 6700, loss[loss=0.2173, simple_loss=0.2754, pruned_loss=0.07958, over 21796.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3247, pruned_loss=0.1025, over 4281298.14 frames. ], batch size: 102, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:51:26,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=406374.0, ans=0.125 2023-06-19 06:51:52,461 INFO [train.py:996] (3/4) Epoch 3, batch 6750, loss[loss=0.2842, simple_loss=0.3403, pruned_loss=0.1141, over 21424.00 frames. ], tot_loss[loss=0.264, simple_loss=0.323, pruned_loss=0.1025, over 4277846.61 frames. ], batch size: 194, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:52:40,103 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.875e+02 3.278e+02 4.228e+02 8.254e+02, threshold=6.556e+02, percent-clipped=2.0 2023-06-19 06:52:55,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=406614.0, ans=0.125 2023-06-19 06:53:04,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=406614.0, ans=0.0 2023-06-19 06:53:32,818 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-19 06:53:35,007 INFO [train.py:996] (3/4) Epoch 3, batch 6800, loss[loss=0.2827, simple_loss=0.3373, pruned_loss=0.1141, over 21612.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3261, pruned_loss=0.1057, over 4286175.19 frames. ], batch size: 389, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:53:50,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=406794.0, ans=0.0 2023-06-19 06:54:21,513 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2023-06-19 06:55:19,153 INFO [train.py:996] (3/4) Epoch 3, batch 6850, loss[loss=0.2781, simple_loss=0.3268, pruned_loss=0.1147, over 21847.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3255, pruned_loss=0.1076, over 4292717.83 frames. ], batch size: 98, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:55:29,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=407034.0, ans=0.125 2023-06-19 06:55:38,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=407094.0, ans=0.0 2023-06-19 06:55:51,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=407154.0, ans=0.04949747468305833 2023-06-19 06:56:08,177 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 3.177e+02 3.620e+02 4.749e+02 9.271e+02, threshold=7.240e+02, percent-clipped=3.0 2023-06-19 06:56:38,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=407274.0, ans=0.0 2023-06-19 06:56:59,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=407334.0, ans=0.125 2023-06-19 06:57:00,788 INFO [train.py:996] (3/4) Epoch 3, batch 6900, loss[loss=0.2259, simple_loss=0.3253, pruned_loss=0.06322, over 21841.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3286, pruned_loss=0.107, over 4284795.29 frames. ], batch size: 371, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:57:09,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=407334.0, ans=0.125 2023-06-19 06:57:14,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=407334.0, ans=0.125 2023-06-19 06:57:39,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=407454.0, ans=0.5 2023-06-19 06:58:46,759 INFO [train.py:996] (3/4) Epoch 3, batch 6950, loss[loss=0.2871, simple_loss=0.3598, pruned_loss=0.1072, over 21446.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3291, pruned_loss=0.1026, over 4285471.38 frames. ], batch size: 131, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 06:58:47,918 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=14.27 vs. limit=15.0 2023-06-19 06:59:25,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=407754.0, ans=0.0 2023-06-19 06:59:26,645 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.37 vs. limit=15.0 2023-06-19 06:59:42,592 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.993e+02 3.659e+02 4.526e+02 7.412e+02, threshold=7.319e+02, percent-clipped=1.0 2023-06-19 06:59:48,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=407814.0, ans=0.125 2023-06-19 07:00:10,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=407814.0, ans=0.1 2023-06-19 07:00:10,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=407814.0, ans=0.2 2023-06-19 07:00:25,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=407874.0, ans=0.1 2023-06-19 07:00:32,082 INFO [train.py:996] (3/4) Epoch 3, batch 7000, loss[loss=0.2745, simple_loss=0.3189, pruned_loss=0.115, over 21329.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3336, pruned_loss=0.1066, over 4286489.42 frames. ], batch size: 194, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:00:45,872 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 07:01:10,268 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=22.5 2023-06-19 07:01:21,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=408054.0, ans=0.125 2023-06-19 07:01:25,807 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=15.0 2023-06-19 07:01:28,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=408054.0, ans=0.125 2023-06-19 07:02:11,709 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=22.5 2023-06-19 07:02:19,505 INFO [train.py:996] (3/4) Epoch 3, batch 7050, loss[loss=0.2597, simple_loss=0.3245, pruned_loss=0.09744, over 21478.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3323, pruned_loss=0.1058, over 4284583.35 frames. ], batch size: 194, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:02:28,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.08 vs. limit=15.0 2023-06-19 07:03:19,711 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.081e+02 3.762e+02 4.566e+02 1.137e+03, threshold=7.524e+02, percent-clipped=2.0 2023-06-19 07:03:39,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=408414.0, ans=0.0 2023-06-19 07:03:55,548 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-19 07:04:10,020 INFO [train.py:996] (3/4) Epoch 3, batch 7100, loss[loss=0.3442, simple_loss=0.3937, pruned_loss=0.1474, over 21426.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3357, pruned_loss=0.1073, over 4286206.93 frames. ], batch size: 507, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:04:20,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=408534.0, ans=0.025 2023-06-19 07:04:36,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=408594.0, ans=0.0 2023-06-19 07:05:17,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=408714.0, ans=0.0 2023-06-19 07:05:55,539 INFO [train.py:996] (3/4) Epoch 3, batch 7150, loss[loss=0.366, simple_loss=0.4055, pruned_loss=0.1632, over 21455.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3323, pruned_loss=0.1036, over 4278972.09 frames. ], batch size: 510, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:05:59,481 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 07:06:03,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=408834.0, ans=0.0 2023-06-19 07:06:28,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=408894.0, ans=0.0 2023-06-19 07:06:37,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=408894.0, ans=0.125 2023-06-19 07:06:54,898 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.75 vs. limit=10.0 2023-06-19 07:06:57,320 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 3.041e+02 3.413e+02 3.887e+02 5.883e+02, threshold=6.826e+02, percent-clipped=0.0 2023-06-19 07:07:31,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=409074.0, ans=0.125 2023-06-19 07:07:40,870 INFO [train.py:996] (3/4) Epoch 3, batch 7200, loss[loss=0.2393, simple_loss=0.2917, pruned_loss=0.09352, over 21377.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.3358, pruned_loss=0.1069, over 4283242.28 frames. ], batch size: 211, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:08:04,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=409194.0, ans=0.07 2023-06-19 07:08:32,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=409254.0, ans=0.125 2023-06-19 07:09:02,971 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-19 07:09:32,310 INFO [train.py:996] (3/4) Epoch 3, batch 7250, loss[loss=0.2729, simple_loss=0.3196, pruned_loss=0.1132, over 21768.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3328, pruned_loss=0.1068, over 4270085.47 frames. ], batch size: 352, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:09:34,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=409434.0, ans=22.5 2023-06-19 07:09:46,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=409434.0, ans=0.0 2023-06-19 07:10:24,985 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.049e+02 3.903e+02 5.201e+02 1.242e+03, threshold=7.806e+02, percent-clipped=6.0 2023-06-19 07:11:06,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=409674.0, ans=0.125 2023-06-19 07:11:10,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=409674.0, ans=0.125 2023-06-19 07:11:13,529 INFO [train.py:996] (3/4) Epoch 3, batch 7300, loss[loss=0.2348, simple_loss=0.2818, pruned_loss=0.09392, over 21207.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3263, pruned_loss=0.106, over 4267830.64 frames. ], batch size: 159, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:11:44,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.95 vs. limit=15.0 2023-06-19 07:11:44,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=409794.0, ans=0.125 2023-06-19 07:11:55,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=409794.0, ans=0.0 2023-06-19 07:12:24,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=409914.0, ans=0.0 2023-06-19 07:12:27,992 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-19 07:12:54,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=409974.0, ans=0.1 2023-06-19 07:12:58,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=410034.0, ans=0.1 2023-06-19 07:12:59,459 INFO [train.py:996] (3/4) Epoch 3, batch 7350, loss[loss=0.3206, simple_loss=0.3887, pruned_loss=0.1263, over 21802.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3227, pruned_loss=0.1056, over 4271149.92 frames. ], batch size: 124, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:13:59,101 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.286e+02 3.138e+02 3.671e+02 4.789e+02 1.075e+03, threshold=7.343e+02, percent-clipped=3.0 2023-06-19 07:13:59,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=410154.0, ans=0.0 2023-06-19 07:14:55,694 INFO [train.py:996] (3/4) Epoch 3, batch 7400, loss[loss=0.2446, simple_loss=0.3237, pruned_loss=0.0827, over 21688.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.329, pruned_loss=0.1081, over 4268850.57 frames. ], batch size: 247, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:15:35,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=410454.0, ans=0.2 2023-06-19 07:16:27,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=410574.0, ans=0.0 2023-06-19 07:16:38,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=410574.0, ans=0.1 2023-06-19 07:16:48,336 INFO [train.py:996] (3/4) Epoch 3, batch 7450, loss[loss=0.2546, simple_loss=0.3081, pruned_loss=0.1005, over 21589.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.329, pruned_loss=0.1073, over 4260455.88 frames. ], batch size: 247, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:17:02,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=410634.0, ans=0.125 2023-06-19 07:17:09,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=410694.0, ans=0.0 2023-06-19 07:17:09,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=410694.0, ans=0.0 2023-06-19 07:17:41,915 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.990e+02 3.597e+02 4.537e+02 7.540e+02, threshold=7.195e+02, percent-clipped=1.0 2023-06-19 07:17:42,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=410754.0, ans=0.07 2023-06-19 07:18:35,615 INFO [train.py:996] (3/4) Epoch 3, batch 7500, loss[loss=0.299, simple_loss=0.394, pruned_loss=0.102, over 21786.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3342, pruned_loss=0.108, over 4270349.86 frames. ], batch size: 282, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:19:06,539 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.60 vs. limit=10.0 2023-06-19 07:19:46,537 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=22.5 2023-06-19 07:19:58,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=411114.0, ans=0.125 2023-06-19 07:20:13,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-06-19 07:20:24,674 INFO [train.py:996] (3/4) Epoch 3, batch 7550, loss[loss=0.3743, simple_loss=0.4326, pruned_loss=0.158, over 21469.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.3419, pruned_loss=0.1067, over 4269244.62 frames. ], batch size: 507, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:20:27,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=411234.0, ans=0.125 2023-06-19 07:20:50,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=411294.0, ans=0.09899494936611666 2023-06-19 07:20:56,536 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.76 vs. limit=15.0 2023-06-19 07:21:22,336 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 3.181e+02 3.735e+02 4.565e+02 8.412e+02, threshold=7.470e+02, percent-clipped=4.0 2023-06-19 07:21:41,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=411414.0, ans=0.04949747468305833 2023-06-19 07:22:09,821 INFO [train.py:996] (3/4) Epoch 3, batch 7600, loss[loss=0.2523, simple_loss=0.3125, pruned_loss=0.09608, over 21309.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3418, pruned_loss=0.1063, over 4271495.73 frames. ], batch size: 159, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:22:11,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=411534.0, ans=0.2 2023-06-19 07:22:33,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=411594.0, ans=0.125 2023-06-19 07:23:09,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=411714.0, ans=0.125 2023-06-19 07:23:14,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=411714.0, ans=0.0 2023-06-19 07:23:20,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=411714.0, ans=0.125 2023-06-19 07:23:35,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=411774.0, ans=0.1 2023-06-19 07:23:51,917 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.72 vs. limit=10.0 2023-06-19 07:23:55,464 INFO [train.py:996] (3/4) Epoch 3, batch 7650, loss[loss=0.2728, simple_loss=0.334, pruned_loss=0.1058, over 21844.00 frames. ], tot_loss[loss=0.2791, simple_loss=0.3414, pruned_loss=0.1084, over 4279111.68 frames. ], batch size: 124, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:24:54,780 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 2.811e+02 3.352e+02 3.855e+02 5.541e+02, threshold=6.704e+02, percent-clipped=0.0 2023-06-19 07:25:09,123 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-06-19 07:25:44,615 INFO [train.py:996] (3/4) Epoch 3, batch 7700, loss[loss=0.3347, simple_loss=0.3853, pruned_loss=0.1421, over 21868.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3458, pruned_loss=0.1124, over 4285016.52 frames. ], batch size: 371, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:26:31,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=412254.0, ans=0.125 2023-06-19 07:26:31,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=412254.0, ans=0.04949747468305833 2023-06-19 07:27:08,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=412314.0, ans=0.5 2023-06-19 07:27:08,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=412314.0, ans=0.125 2023-06-19 07:27:15,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=412374.0, ans=0.125 2023-06-19 07:27:25,928 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-19 07:27:32,038 INFO [train.py:996] (3/4) Epoch 3, batch 7750, loss[loss=0.332, simple_loss=0.3857, pruned_loss=0.1391, over 21421.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.3528, pruned_loss=0.1131, over 4280234.27 frames. ], batch size: 548, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:27:48,920 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.23 vs. limit=15.0 2023-06-19 07:28:22,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=412554.0, ans=0.125 2023-06-19 07:28:46,860 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.272e+02 3.582e+02 4.541e+02 5.903e+02 1.038e+03, threshold=9.082e+02, percent-clipped=9.0 2023-06-19 07:28:47,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=412554.0, ans=0.05 2023-06-19 07:29:10,258 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.07 vs. limit=15.0 2023-06-19 07:29:24,305 INFO [train.py:996] (3/4) Epoch 3, batch 7800, loss[loss=0.312, simple_loss=0.3747, pruned_loss=0.1247, over 21833.00 frames. ], tot_loss[loss=0.287, simple_loss=0.351, pruned_loss=0.1116, over 4276871.17 frames. ], batch size: 372, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:31:13,605 INFO [train.py:996] (3/4) Epoch 3, batch 7850, loss[loss=0.3294, simple_loss=0.3614, pruned_loss=0.1487, over 21341.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.3449, pruned_loss=0.1114, over 4270849.66 frames. ], batch size: 473, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:31:24,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=413034.0, ans=0.1 2023-06-19 07:31:46,240 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-19 07:32:11,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=413154.0, ans=0.0 2023-06-19 07:32:22,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=413154.0, ans=0.125 2023-06-19 07:32:23,194 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.221e+02 3.181e+02 3.685e+02 4.397e+02 7.326e+02, threshold=7.370e+02, percent-clipped=0.0 2023-06-19 07:32:41,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=413214.0, ans=0.1 2023-06-19 07:33:06,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=413334.0, ans=0.125 2023-06-19 07:33:08,237 INFO [train.py:996] (3/4) Epoch 3, batch 7900, loss[loss=0.2907, simple_loss=0.3771, pruned_loss=0.1021, over 21785.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3406, pruned_loss=0.1102, over 4264284.34 frames. ], batch size: 371, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:33:23,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=413334.0, ans=0.125 2023-06-19 07:34:56,177 INFO [train.py:996] (3/4) Epoch 3, batch 7950, loss[loss=0.2882, simple_loss=0.4016, pruned_loss=0.08746, over 19793.00 frames. ], tot_loss[loss=0.2823, simple_loss=0.3461, pruned_loss=0.1093, over 4265353.72 frames. ], batch size: 702, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:35:36,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=413694.0, ans=0.0 2023-06-19 07:35:48,919 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.25 vs. limit=15.0 2023-06-19 07:35:56,703 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.292e+02 2.859e+02 3.738e+02 4.773e+02 1.037e+03, threshold=7.477e+02, percent-clipped=3.0 2023-06-19 07:36:44,338 INFO [train.py:996] (3/4) Epoch 3, batch 8000, loss[loss=0.307, simple_loss=0.4254, pruned_loss=0.09433, over 20798.00 frames. ], tot_loss[loss=0.2875, simple_loss=0.3507, pruned_loss=0.1122, over 4270554.91 frames. ], batch size: 607, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:37:10,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.74 vs. limit=22.5 2023-06-19 07:37:38,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=414054.0, ans=0.95 2023-06-19 07:38:11,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=414114.0, ans=22.5 2023-06-19 07:38:20,494 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.74 vs. limit=10.0 2023-06-19 07:38:22,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=414174.0, ans=0.025 2023-06-19 07:38:46,660 INFO [train.py:996] (3/4) Epoch 3, batch 8050, loss[loss=0.2643, simple_loss=0.3367, pruned_loss=0.09598, over 21817.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.3532, pruned_loss=0.1115, over 4268140.07 frames. ], batch size: 282, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:39:04,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=414294.0, ans=0.1 2023-06-19 07:39:14,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=414294.0, ans=0.0 2023-06-19 07:39:47,879 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 3.438e+02 3.985e+02 5.129e+02 7.856e+02, threshold=7.969e+02, percent-clipped=2.0 2023-06-19 07:40:08,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-06-19 07:40:10,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=414474.0, ans=0.0 2023-06-19 07:40:35,173 INFO [train.py:996] (3/4) Epoch 3, batch 8100, loss[loss=0.3105, simple_loss=0.3609, pruned_loss=0.1301, over 20882.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.3492, pruned_loss=0.1109, over 4264032.81 frames. ], batch size: 608, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:40:44,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=414534.0, ans=0.2 2023-06-19 07:42:24,518 INFO [train.py:996] (3/4) Epoch 3, batch 8150, loss[loss=0.2682, simple_loss=0.3595, pruned_loss=0.08843, over 21704.00 frames. ], tot_loss[loss=0.2941, simple_loss=0.3608, pruned_loss=0.1137, over 4269514.83 frames. ], batch size: 298, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:43:38,471 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.178e+02 2.875e+02 3.410e+02 4.043e+02 9.100e+02, threshold=6.821e+02, percent-clipped=2.0 2023-06-19 07:44:13,617 INFO [train.py:996] (3/4) Epoch 3, batch 8200, loss[loss=0.2478, simple_loss=0.3044, pruned_loss=0.0956, over 20801.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3512, pruned_loss=0.1107, over 4268584.24 frames. ], batch size: 609, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:44:27,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=415134.0, ans=0.1 2023-06-19 07:45:45,987 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.72 vs. limit=6.0 2023-06-19 07:45:58,711 INFO [train.py:996] (3/4) Epoch 3, batch 8250, loss[loss=0.2672, simple_loss=0.3437, pruned_loss=0.09529, over 21673.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.349, pruned_loss=0.1101, over 4263701.66 frames. ], batch size: 247, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:46:47,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=415494.0, ans=0.125 2023-06-19 07:46:49,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=415554.0, ans=0.125 2023-06-19 07:46:52,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=415554.0, ans=0.1 2023-06-19 07:47:12,979 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.241e+02 3.026e+02 3.791e+02 5.480e+02 8.265e+02, threshold=7.583e+02, percent-clipped=10.0 2023-06-19 07:47:30,524 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.06 vs. limit=12.0 2023-06-19 07:47:52,037 INFO [train.py:996] (3/4) Epoch 3, batch 8300, loss[loss=0.2992, simple_loss=0.4352, pruned_loss=0.0816, over 20737.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3467, pruned_loss=0.1062, over 4273184.28 frames. ], batch size: 607, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:48:29,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=415794.0, ans=0.0 2023-06-19 07:49:08,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=415914.0, ans=0.0 2023-06-19 07:49:38,484 INFO [train.py:996] (3/4) Epoch 3, batch 8350, loss[loss=0.2721, simple_loss=0.3369, pruned_loss=0.1037, over 21726.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3438, pruned_loss=0.1032, over 4274715.28 frames. ], batch size: 351, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:49:41,687 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.57 vs. limit=22.5 2023-06-19 07:49:54,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=416034.0, ans=0.04949747468305833 2023-06-19 07:50:23,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=416154.0, ans=0.125 2023-06-19 07:50:46,334 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.962e+02 3.800e+02 4.795e+02 8.641e+02, threshold=7.601e+02, percent-clipped=3.0 2023-06-19 07:51:09,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=416274.0, ans=0.1 2023-06-19 07:51:23,695 INFO [train.py:996] (3/4) Epoch 3, batch 8400, loss[loss=0.2133, simple_loss=0.2887, pruned_loss=0.0689, over 21199.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3414, pruned_loss=0.1013, over 4270414.48 frames. ], batch size: 143, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 07:51:37,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=416334.0, ans=0.125 2023-06-19 07:51:42,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=416334.0, ans=0.125 2023-06-19 07:51:47,759 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=22.5 2023-06-19 07:52:27,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=416514.0, ans=0.0 2023-06-19 07:53:04,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=416574.0, ans=10.0 2023-06-19 07:53:07,335 INFO [train.py:996] (3/4) Epoch 3, batch 8450, loss[loss=0.2463, simple_loss=0.308, pruned_loss=0.09229, over 21852.00 frames. ], tot_loss[loss=0.2688, simple_loss=0.3383, pruned_loss=0.0997, over 4279423.52 frames. ], batch size: 282, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 07:54:05,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=416754.0, ans=0.1 2023-06-19 07:54:15,143 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.539e+02 3.236e+02 3.912e+02 6.365e+02, threshold=6.471e+02, percent-clipped=0.0 2023-06-19 07:54:30,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=416874.0, ans=0.125 2023-06-19 07:54:45,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=416874.0, ans=0.1 2023-06-19 07:54:59,966 INFO [train.py:996] (3/4) Epoch 3, batch 8500, loss[loss=0.2484, simple_loss=0.2963, pruned_loss=0.1003, over 21245.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3362, pruned_loss=0.1027, over 4272386.21 frames. ], batch size: 548, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 07:55:09,788 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=22.5 2023-06-19 07:55:14,750 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=22.5 2023-06-19 07:55:15,010 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.81 vs. limit=22.5 2023-06-19 07:55:22,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=416994.0, ans=10.0 2023-06-19 07:55:51,896 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.08 vs. limit=10.0 2023-06-19 07:55:57,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=417054.0, ans=0.125 2023-06-19 07:55:59,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=417054.0, ans=0.125 2023-06-19 07:56:13,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=417114.0, ans=0.125 2023-06-19 07:56:48,417 INFO [train.py:996] (3/4) Epoch 3, batch 8550, loss[loss=0.2795, simple_loss=0.3595, pruned_loss=0.09972, over 21810.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.341, pruned_loss=0.1064, over 4282346.10 frames. ], batch size: 282, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:56:55,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=417234.0, ans=0.0 2023-06-19 07:57:03,075 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2023-06-19 07:57:29,999 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=15.0 2023-06-19 07:57:32,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=417354.0, ans=0.05 2023-06-19 07:57:44,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=417414.0, ans=0.125 2023-06-19 07:57:47,728 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.294e+02 4.189e+02 5.052e+02 1.014e+03, threshold=8.378e+02, percent-clipped=9.0 2023-06-19 07:58:03,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=417414.0, ans=0.05 2023-06-19 07:58:10,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=15.0 2023-06-19 07:58:32,100 INFO [train.py:996] (3/4) Epoch 3, batch 8600, loss[loss=0.2795, simple_loss=0.3475, pruned_loss=0.1057, over 21705.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3506, pruned_loss=0.11, over 4279900.46 frames. ], batch size: 298, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:58:48,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=417534.0, ans=0.0 2023-06-19 08:00:19,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=417834.0, ans=0.125 2023-06-19 08:00:20,156 INFO [train.py:996] (3/4) Epoch 3, batch 8650, loss[loss=0.2391, simple_loss=0.3311, pruned_loss=0.07352, over 21839.00 frames. ], tot_loss[loss=0.2925, simple_loss=0.359, pruned_loss=0.113, over 4281947.26 frames. ], batch size: 316, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 08:00:28,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=417834.0, ans=0.0 2023-06-19 08:00:30,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=417834.0, ans=0.125 2023-06-19 08:00:39,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=417834.0, ans=0.125 2023-06-19 08:01:16,821 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:01:26,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=418014.0, ans=0.125 2023-06-19 08:01:28,270 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 2.885e+02 3.395e+02 4.112e+02 7.467e+02, threshold=6.789e+02, percent-clipped=0.0 2023-06-19 08:01:34,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=418014.0, ans=0.09899494936611666 2023-06-19 08:02:05,235 INFO [train.py:996] (3/4) Epoch 3, batch 8700, loss[loss=0.2456, simple_loss=0.3021, pruned_loss=0.0945, over 21535.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3485, pruned_loss=0.1076, over 4282795.66 frames. ], batch size: 196, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 08:02:37,529 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-06-19 08:03:11,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=418314.0, ans=0.1 2023-06-19 08:03:43,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=418374.0, ans=0.2 2023-06-19 08:03:57,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=12.0 2023-06-19 08:03:57,797 INFO [train.py:996] (3/4) Epoch 3, batch 8750, loss[loss=0.2529, simple_loss=0.3188, pruned_loss=0.09357, over 21820.00 frames. ], tot_loss[loss=0.2806, simple_loss=0.3448, pruned_loss=0.1082, over 4286724.75 frames. ], batch size: 298, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 08:04:27,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=418494.0, ans=0.125 2023-06-19 08:05:06,652 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.171e+02 3.018e+02 3.630e+02 4.545e+02 8.299e+02, threshold=7.260e+02, percent-clipped=2.0 2023-06-19 08:05:07,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=418614.0, ans=0.0 2023-06-19 08:05:11,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=418614.0, ans=0.0 2023-06-19 08:05:44,847 INFO [train.py:996] (3/4) Epoch 3, batch 8800, loss[loss=0.2482, simple_loss=0.3566, pruned_loss=0.06988, over 20827.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.3531, pruned_loss=0.1111, over 4285905.39 frames. ], batch size: 608, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:05:49,112 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-06-19 08:05:53,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=418734.0, ans=0.0 2023-06-19 08:06:28,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-19 08:07:22,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=418974.0, ans=0.125 2023-06-19 08:07:41,990 INFO [train.py:996] (3/4) Epoch 3, batch 8850, loss[loss=0.3119, simple_loss=0.3846, pruned_loss=0.1196, over 21718.00 frames. ], tot_loss[loss=0.2945, simple_loss=0.3605, pruned_loss=0.1142, over 4288540.95 frames. ], batch size: 332, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:08:30,319 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.47 vs. limit=10.0 2023-06-19 08:08:46,147 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.341e+02 3.554e+02 4.291e+02 5.667e+02 9.091e+02, threshold=8.581e+02, percent-clipped=5.0 2023-06-19 08:09:01,519 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-19 08:09:02,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=419214.0, ans=0.125 2023-06-19 08:09:09,895 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-19 08:09:29,172 INFO [train.py:996] (3/4) Epoch 3, batch 8900, loss[loss=0.2819, simple_loss=0.3663, pruned_loss=0.0987, over 21602.00 frames. ], tot_loss[loss=0.2917, simple_loss=0.3562, pruned_loss=0.1136, over 4274336.69 frames. ], batch size: 414, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:09:46,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=419394.0, ans=0.125 2023-06-19 08:10:17,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=419454.0, ans=0.04949747468305833 2023-06-19 08:10:33,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=419454.0, ans=0.125 2023-06-19 08:10:44,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=419514.0, ans=0.0 2023-06-19 08:11:10,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=419574.0, ans=0.125 2023-06-19 08:11:13,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=419574.0, ans=0.1 2023-06-19 08:11:18,144 INFO [train.py:996] (3/4) Epoch 3, batch 8950, loss[loss=0.292, simple_loss=0.345, pruned_loss=0.1195, over 21644.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3563, pruned_loss=0.1129, over 4278822.93 frames. ], batch size: 263, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:11:24,287 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-19 08:11:32,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=419634.0, ans=0.95 2023-06-19 08:11:57,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=419694.0, ans=0.125 2023-06-19 08:12:01,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-19 08:12:27,662 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 3.236e+02 4.208e+02 5.168e+02 9.134e+02, threshold=8.417e+02, percent-clipped=1.0 2023-06-19 08:12:28,641 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.47 vs. limit=10.0 2023-06-19 08:13:04,995 INFO [train.py:996] (3/4) Epoch 3, batch 9000, loss[loss=0.235, simple_loss=0.2885, pruned_loss=0.0908, over 21831.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.351, pruned_loss=0.1127, over 4279919.07 frames. ], batch size: 118, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:13:04,996 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 08:13:24,323 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2787, simple_loss=0.3793, pruned_loss=0.08906, over 1796401.00 frames. 2023-06-19 08:13:24,324 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-19 08:13:35,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=419934.0, ans=0.125 2023-06-19 08:13:38,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=419934.0, ans=0.125 2023-06-19 08:13:48,962 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.53 vs. limit=10.0 2023-06-19 08:14:25,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=420054.0, ans=0.05 2023-06-19 08:14:43,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=22.5 2023-06-19 08:15:16,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=420174.0, ans=0.04949747468305833 2023-06-19 08:15:19,187 INFO [train.py:996] (3/4) Epoch 3, batch 9050, loss[loss=0.2839, simple_loss=0.3509, pruned_loss=0.1084, over 21698.00 frames. ], tot_loss[loss=0.2823, simple_loss=0.3459, pruned_loss=0.1094, over 4278367.77 frames. ], batch size: 351, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:15:39,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=420294.0, ans=0.125 2023-06-19 08:15:53,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=420294.0, ans=0.0 2023-06-19 08:15:54,465 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-19 08:16:23,141 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 3.182e+02 3.849e+02 4.740e+02 7.257e+02, threshold=7.697e+02, percent-clipped=0.0 2023-06-19 08:16:39,269 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2023-06-19 08:16:47,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=420474.0, ans=0.125 2023-06-19 08:17:05,677 INFO [train.py:996] (3/4) Epoch 3, batch 9100, loss[loss=0.2926, simple_loss=0.3582, pruned_loss=0.1135, over 21777.00 frames. ], tot_loss[loss=0.2885, simple_loss=0.3527, pruned_loss=0.1122, over 4275279.48 frames. ], batch size: 118, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:17:17,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=420534.0, ans=0.0 2023-06-19 08:17:34,345 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-19 08:17:42,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=420594.0, ans=0.0 2023-06-19 08:17:42,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=420594.0, ans=0.0 2023-06-19 08:18:04,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=420654.0, ans=0.1 2023-06-19 08:18:20,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=420714.0, ans=0.125 2023-06-19 08:18:52,689 INFO [train.py:996] (3/4) Epoch 3, batch 9150, loss[loss=0.3353, simple_loss=0.4148, pruned_loss=0.1279, over 21512.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.353, pruned_loss=0.1079, over 4275971.25 frames. ], batch size: 471, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:19:09,670 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:20:00,321 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 2.911e+02 3.357e+02 4.018e+02 6.144e+02, threshold=6.715e+02, percent-clipped=0.0 2023-06-19 08:20:45,074 INFO [train.py:996] (3/4) Epoch 3, batch 9200, loss[loss=0.3199, simple_loss=0.3845, pruned_loss=0.1277, over 21751.00 frames. ], tot_loss[loss=0.2823, simple_loss=0.3535, pruned_loss=0.1055, over 4275313.02 frames. ], batch size: 332, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:21:36,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=421254.0, ans=0.0 2023-06-19 08:22:04,847 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.00 vs. limit=15.0 2023-06-19 08:22:31,098 INFO [train.py:996] (3/4) Epoch 3, batch 9250, loss[loss=0.2689, simple_loss=0.3246, pruned_loss=0.1066, over 21258.00 frames. ], tot_loss[loss=0.2875, simple_loss=0.3564, pruned_loss=0.1093, over 4277966.23 frames. ], batch size: 548, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:22:43,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=421434.0, ans=0.0 2023-06-19 08:23:03,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=421494.0, ans=0.125 2023-06-19 08:23:14,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=421554.0, ans=0.0 2023-06-19 08:23:39,387 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 3.086e+02 3.800e+02 4.447e+02 7.339e+02, threshold=7.599e+02, percent-clipped=2.0 2023-06-19 08:23:39,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=421614.0, ans=10.0 2023-06-19 08:24:17,004 INFO [train.py:996] (3/4) Epoch 3, batch 9300, loss[loss=0.2429, simple_loss=0.2947, pruned_loss=0.0956, over 21873.00 frames. ], tot_loss[loss=0.2839, simple_loss=0.3503, pruned_loss=0.1088, over 4267265.87 frames. ], batch size: 98, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:24:37,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=421734.0, ans=0.125 2023-06-19 08:24:45,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=421794.0, ans=0.125 2023-06-19 08:24:49,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=421794.0, ans=0.015 2023-06-19 08:25:38,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=421914.0, ans=0.1 2023-06-19 08:25:43,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=421914.0, ans=0.0 2023-06-19 08:25:45,511 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-19 08:26:03,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=422034.0, ans=0.2 2023-06-19 08:26:11,696 INFO [train.py:996] (3/4) Epoch 3, batch 9350, loss[loss=0.3217, simple_loss=0.3878, pruned_loss=0.1278, over 21611.00 frames. ], tot_loss[loss=0.2892, simple_loss=0.3571, pruned_loss=0.1106, over 4270567.27 frames. ], batch size: 230, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:26:18,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=422034.0, ans=0.125 2023-06-19 08:26:56,206 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.38 vs. limit=10.0 2023-06-19 08:27:16,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=422154.0, ans=0.1 2023-06-19 08:27:20,730 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 3.071e+02 3.690e+02 4.644e+02 6.944e+02, threshold=7.381e+02, percent-clipped=0.0 2023-06-19 08:27:50,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=422274.0, ans=0.0 2023-06-19 08:27:59,150 INFO [train.py:996] (3/4) Epoch 3, batch 9400, loss[loss=0.2758, simple_loss=0.3237, pruned_loss=0.114, over 21740.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3595, pruned_loss=0.1125, over 4278373.84 frames. ], batch size: 112, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:29:32,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=422574.0, ans=0.125 2023-06-19 08:29:40,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=422574.0, ans=0.09899494936611666 2023-06-19 08:29:42,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=422634.0, ans=0.1 2023-06-19 08:29:43,988 INFO [train.py:996] (3/4) Epoch 3, batch 9450, loss[loss=0.2375, simple_loss=0.2848, pruned_loss=0.09511, over 21300.00 frames. ], tot_loss[loss=0.285, simple_loss=0.3497, pruned_loss=0.1101, over 4268650.16 frames. ], batch size: 551, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:30:51,260 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 3.147e+02 3.732e+02 4.957e+02 8.626e+02, threshold=7.464e+02, percent-clipped=5.0 2023-06-19 08:31:02,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=422814.0, ans=0.1 2023-06-19 08:31:28,556 INFO [train.py:996] (3/4) Epoch 3, batch 9500, loss[loss=0.2765, simple_loss=0.3315, pruned_loss=0.1108, over 21509.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3413, pruned_loss=0.1076, over 4251503.75 frames. ], batch size: 441, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:32:23,485 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2023-06-19 08:33:12,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=423234.0, ans=0.125 2023-06-19 08:33:14,242 INFO [train.py:996] (3/4) Epoch 3, batch 9550, loss[loss=0.2971, simple_loss=0.3779, pruned_loss=0.1081, over 21477.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.3451, pruned_loss=0.1093, over 4259384.51 frames. ], batch size: 211, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:34:20,855 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.824e+02 3.274e+02 3.853e+02 7.090e+02, threshold=6.547e+02, percent-clipped=0.0 2023-06-19 08:34:58,269 INFO [train.py:996] (3/4) Epoch 3, batch 9600, loss[loss=0.279, simple_loss=0.3302, pruned_loss=0.1139, over 21396.00 frames. ], tot_loss[loss=0.2851, simple_loss=0.3475, pruned_loss=0.1113, over 4262863.69 frames. ], batch size: 159, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:35:04,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=423534.0, ans=0.0 2023-06-19 08:35:05,831 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.80 vs. limit=15.0 2023-06-19 08:35:32,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=423594.0, ans=0.0 2023-06-19 08:35:35,430 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=22.5 2023-06-19 08:36:01,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=423654.0, ans=0.02 2023-06-19 08:36:02,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=423714.0, ans=0.0 2023-06-19 08:36:45,581 INFO [train.py:996] (3/4) Epoch 3, batch 9650, loss[loss=0.2795, simple_loss=0.3459, pruned_loss=0.1065, over 21761.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.3502, pruned_loss=0.1131, over 4268335.73 frames. ], batch size: 332, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:36:46,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=423834.0, ans=0.125 2023-06-19 08:36:54,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=423834.0, ans=0.1 2023-06-19 08:37:03,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=423834.0, ans=0.95 2023-06-19 08:37:16,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=423894.0, ans=0.1 2023-06-19 08:37:53,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=423954.0, ans=0.125 2023-06-19 08:37:58,440 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 3.232e+02 3.864e+02 5.587e+02 9.927e+02, threshold=7.728e+02, percent-clipped=9.0 2023-06-19 08:38:01,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=424014.0, ans=0.0 2023-06-19 08:38:07,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=424014.0, ans=0.125 2023-06-19 08:38:40,947 INFO [train.py:996] (3/4) Epoch 3, batch 9700, loss[loss=0.2798, simple_loss=0.3764, pruned_loss=0.0916, over 20728.00 frames. ], tot_loss[loss=0.2892, simple_loss=0.3528, pruned_loss=0.1128, over 4273159.90 frames. ], batch size: 607, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:38:49,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2023-06-19 08:39:33,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=424254.0, ans=0.125 2023-06-19 08:39:36,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=424254.0, ans=0.125 2023-06-19 08:39:39,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=424314.0, ans=0.125 2023-06-19 08:39:46,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=424314.0, ans=0.2 2023-06-19 08:40:18,519 INFO [train.py:996] (3/4) Epoch 3, batch 9750, loss[loss=0.2716, simple_loss=0.3139, pruned_loss=0.1146, over 21502.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3473, pruned_loss=0.1117, over 4268772.69 frames. ], batch size: 441, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:40:39,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=424494.0, ans=0.125 2023-06-19 08:40:55,483 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=22.5 2023-06-19 08:41:00,287 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=12.0 2023-06-19 08:41:18,621 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 3.037e+02 3.853e+02 4.411e+02 7.266e+02, threshold=7.707e+02, percent-clipped=0.0 2023-06-19 08:41:55,542 INFO [train.py:996] (3/4) Epoch 3, batch 9800, loss[loss=0.3055, simple_loss=0.3657, pruned_loss=0.1226, over 21845.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3477, pruned_loss=0.1131, over 4267393.93 frames. ], batch size: 124, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:42:02,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=424734.0, ans=0.0 2023-06-19 08:43:29,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=424974.0, ans=0.1 2023-06-19 08:43:39,038 INFO [train.py:996] (3/4) Epoch 3, batch 9850, loss[loss=0.2493, simple_loss=0.3276, pruned_loss=0.08557, over 20744.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.3444, pruned_loss=0.1124, over 4249773.01 frames. ], batch size: 607, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:43:46,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=425034.0, ans=0.1 2023-06-19 08:44:39,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=425154.0, ans=0.125 2023-06-19 08:44:50,373 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.794e+02 3.273e+02 4.007e+02 7.022e+02, threshold=6.547e+02, percent-clipped=0.0 2023-06-19 08:45:07,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=425274.0, ans=0.125 2023-06-19 08:45:15,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=425274.0, ans=0.0 2023-06-19 08:45:21,538 INFO [train.py:996] (3/4) Epoch 3, batch 9900, loss[loss=0.2787, simple_loss=0.347, pruned_loss=0.1052, over 21444.00 frames. ], tot_loss[loss=0.2808, simple_loss=0.3395, pruned_loss=0.1111, over 4244445.88 frames. ], batch size: 194, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:45:51,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=425394.0, ans=0.0 2023-06-19 08:47:04,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=425634.0, ans=0.125 2023-06-19 08:47:10,943 INFO [train.py:996] (3/4) Epoch 3, batch 9950, loss[loss=0.2525, simple_loss=0.3103, pruned_loss=0.09738, over 21601.00 frames. ], tot_loss[loss=0.2831, simple_loss=0.3409, pruned_loss=0.1127, over 4255190.91 frames. ], batch size: 263, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:47:26,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=425634.0, ans=0.2 2023-06-19 08:47:56,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=425754.0, ans=0.125 2023-06-19 08:48:02,164 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=12.0 2023-06-19 08:48:18,525 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 3.027e+02 3.494e+02 4.278e+02 9.586e+02, threshold=6.989e+02, percent-clipped=3.0 2023-06-19 08:49:03,418 INFO [train.py:996] (3/4) Epoch 3, batch 10000, loss[loss=0.2131, simple_loss=0.2674, pruned_loss=0.07937, over 21240.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3344, pruned_loss=0.1107, over 4252791.54 frames. ], batch size: 159, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:49:33,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=425994.0, ans=0.0 2023-06-19 08:50:01,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=426054.0, ans=0.125 2023-06-19 08:50:45,779 INFO [train.py:996] (3/4) Epoch 3, batch 10050, loss[loss=0.2857, simple_loss=0.339, pruned_loss=0.1162, over 21773.00 frames. ], tot_loss[loss=0.2802, simple_loss=0.3374, pruned_loss=0.1115, over 4255796.35 frames. ], batch size: 124, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:50:46,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=426234.0, ans=0.0 2023-06-19 08:51:21,974 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.41 vs. limit=15.0 2023-06-19 08:51:22,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=426354.0, ans=0.1 2023-06-19 08:51:47,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=426414.0, ans=0.025 2023-06-19 08:51:50,933 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.885e+02 3.512e+02 4.083e+02 6.660e+02, threshold=7.024e+02, percent-clipped=0.0 2023-06-19 08:52:23,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=426474.0, ans=0.125 2023-06-19 08:52:37,731 INFO [train.py:996] (3/4) Epoch 3, batch 10100, loss[loss=0.3136, simple_loss=0.3783, pruned_loss=0.1244, over 19886.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3362, pruned_loss=0.1088, over 4238857.09 frames. ], batch size: 702, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:53:47,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-19 08:54:23,735 INFO [train.py:996] (3/4) Epoch 3, batch 10150, loss[loss=0.2786, simple_loss=0.3271, pruned_loss=0.1151, over 15906.00 frames. ], tot_loss[loss=0.2824, simple_loss=0.342, pruned_loss=0.1114, over 4237422.34 frames. ], batch size: 60, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:54:49,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=426894.0, ans=0.2 2023-06-19 08:54:59,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=426894.0, ans=0.5 2023-06-19 08:55:01,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=426894.0, ans=0.0 2023-06-19 08:55:35,205 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.064e+02 3.671e+02 4.442e+02 6.348e+02, threshold=7.343e+02, percent-clipped=0.0 2023-06-19 08:56:10,008 INFO [train.py:996] (3/4) Epoch 3, batch 10200, loss[loss=0.2085, simple_loss=0.2776, pruned_loss=0.06971, over 21172.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3383, pruned_loss=0.1077, over 4241112.64 frames. ], batch size: 143, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:56:10,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=427134.0, ans=0.1 2023-06-19 08:56:15,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=427134.0, ans=0.125 2023-06-19 08:56:16,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=427134.0, ans=0.2 2023-06-19 08:56:18,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=427134.0, ans=0.0 2023-06-19 08:56:42,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=22.5 2023-06-19 08:56:42,193 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-19 08:56:45,303 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=15.0 2023-06-19 08:56:54,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=427254.0, ans=0.125 2023-06-19 08:57:14,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=427314.0, ans=0.1 2023-06-19 08:57:37,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=427374.0, ans=0.5 2023-06-19 08:57:57,543 INFO [train.py:996] (3/4) Epoch 3, batch 10250, loss[loss=0.1976, simple_loss=0.2812, pruned_loss=0.05695, over 21391.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3315, pruned_loss=0.1007, over 4235176.16 frames. ], batch size: 211, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:58:56,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=427554.0, ans=0.125 2023-06-19 08:59:07,971 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.673e+02 2.442e+02 2.730e+02 3.285e+02 6.537e+02, threshold=5.460e+02, percent-clipped=0.0 2023-06-19 08:59:43,911 INFO [train.py:996] (3/4) Epoch 3, batch 10300, loss[loss=0.2995, simple_loss=0.3661, pruned_loss=0.1164, over 20045.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3374, pruned_loss=0.1042, over 4238509.43 frames. ], batch size: 702, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 09:00:07,759 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=22.5 2023-06-19 09:00:24,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=427794.0, ans=0.125 2023-06-19 09:01:11,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=427914.0, ans=0.125 2023-06-19 09:01:37,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=428034.0, ans=0.0 2023-06-19 09:01:38,588 INFO [train.py:996] (3/4) Epoch 3, batch 10350, loss[loss=0.2157, simple_loss=0.2751, pruned_loss=0.07813, over 21643.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.336, pruned_loss=0.1023, over 4244112.74 frames. ], batch size: 247, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 09:02:08,470 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-19 09:02:50,527 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.191e+02 3.696e+02 4.662e+02 9.387e+02, threshold=7.392e+02, percent-clipped=8.0 2023-06-19 09:03:26,637 INFO [train.py:996] (3/4) Epoch 3, batch 10400, loss[loss=0.2703, simple_loss=0.3401, pruned_loss=0.1002, over 21916.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3283, pruned_loss=0.1001, over 4247944.28 frames. ], batch size: 373, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:03:42,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=428334.0, ans=0.125 2023-06-19 09:04:15,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=428454.0, ans=0.1 2023-06-19 09:04:15,576 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-19 09:04:32,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=428514.0, ans=0.0 2023-06-19 09:05:19,057 INFO [train.py:996] (3/4) Epoch 3, batch 10450, loss[loss=0.3105, simple_loss=0.3571, pruned_loss=0.1319, over 21409.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3348, pruned_loss=0.1044, over 4249773.22 frames. ], batch size: 131, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:06:00,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=428694.0, ans=0.125 2023-06-19 09:06:28,887 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 3.522e+02 4.182e+02 5.744e+02 1.036e+03, threshold=8.363e+02, percent-clipped=11.0 2023-06-19 09:06:30,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=428814.0, ans=0.0 2023-06-19 09:06:44,924 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-19 09:07:03,320 INFO [train.py:996] (3/4) Epoch 3, batch 10500, loss[loss=0.2542, simple_loss=0.3112, pruned_loss=0.09856, over 21717.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.334, pruned_loss=0.1034, over 4251515.25 frames. ], batch size: 351, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:07:43,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=428994.0, ans=0.02 2023-06-19 09:08:49,580 INFO [train.py:996] (3/4) Epoch 3, batch 10550, loss[loss=0.2849, simple_loss=0.3312, pruned_loss=0.1192, over 21849.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3284, pruned_loss=0.103, over 4251008.68 frames. ], batch size: 373, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:08:49,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=429234.0, ans=0.95 2023-06-19 09:09:20,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=429294.0, ans=0.2 2023-06-19 09:09:34,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=429354.0, ans=0.125 2023-06-19 09:10:00,978 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.021e+02 2.936e+02 3.551e+02 4.371e+02 5.985e+02, threshold=7.102e+02, percent-clipped=0.0 2023-06-19 09:10:03,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=429414.0, ans=0.025 2023-06-19 09:10:04,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=429414.0, ans=0.125 2023-06-19 09:10:13,525 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:10:15,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=429474.0, ans=0.0 2023-06-19 09:10:21,382 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.36 vs. limit=22.5 2023-06-19 09:10:36,933 INFO [train.py:996] (3/4) Epoch 3, batch 10600, loss[loss=0.2237, simple_loss=0.3068, pruned_loss=0.07024, over 21758.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3242, pruned_loss=0.1011, over 4248726.40 frames. ], batch size: 282, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:10:39,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=429534.0, ans=0.2 2023-06-19 09:11:41,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=429654.0, ans=0.07 2023-06-19 09:12:02,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=429714.0, ans=0.2 2023-06-19 09:12:31,296 INFO [train.py:996] (3/4) Epoch 3, batch 10650, loss[loss=0.2178, simple_loss=0.2897, pruned_loss=0.07293, over 21583.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3296, pruned_loss=0.1002, over 4252606.89 frames. ], batch size: 263, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:12:32,298 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-19 09:13:08,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=429894.0, ans=0.0 2023-06-19 09:13:41,871 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.341e+02 3.491e+02 4.438e+02 6.074e+02 1.034e+03, threshold=8.876e+02, percent-clipped=13.0 2023-06-19 09:13:50,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=430014.0, ans=0.05 2023-06-19 09:14:17,989 INFO [train.py:996] (3/4) Epoch 3, batch 10700, loss[loss=0.2762, simple_loss=0.3341, pruned_loss=0.1091, over 21324.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3295, pruned_loss=0.1012, over 4254664.69 frames. ], batch size: 159, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:14:37,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=430134.0, ans=0.2 2023-06-19 09:15:28,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=430314.0, ans=0.125 2023-06-19 09:15:42,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=430314.0, ans=0.04949747468305833 2023-06-19 09:15:49,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=430374.0, ans=0.125 2023-06-19 09:16:09,154 INFO [train.py:996] (3/4) Epoch 3, batch 10750, loss[loss=0.3321, simple_loss=0.4153, pruned_loss=0.1244, over 21672.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3411, pruned_loss=0.1067, over 4254982.98 frames. ], batch size: 414, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:16:24,152 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=22.5 2023-06-19 09:17:19,507 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 3.011e+02 3.555e+02 4.505e+02 9.587e+02, threshold=7.110e+02, percent-clipped=1.0 2023-06-19 09:17:28,338 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:17:38,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=430674.0, ans=0.0 2023-06-19 09:17:45,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=430674.0, ans=0.125 2023-06-19 09:18:01,331 INFO [train.py:996] (3/4) Epoch 3, batch 10800, loss[loss=0.3223, simple_loss=0.3834, pruned_loss=0.1306, over 21568.00 frames. ], tot_loss[loss=0.2814, simple_loss=0.3477, pruned_loss=0.1076, over 4257420.14 frames. ], batch size: 389, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:18:05,320 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:18:09,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=430734.0, ans=0.125 2023-06-19 09:19:22,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=430914.0, ans=0.125 2023-06-19 09:19:48,709 INFO [train.py:996] (3/4) Epoch 3, batch 10850, loss[loss=0.2391, simple_loss=0.2982, pruned_loss=0.09002, over 21919.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3487, pruned_loss=0.1076, over 4261928.01 frames. ], batch size: 373, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:20:20,342 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-19 09:20:33,614 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=22.5 2023-06-19 09:20:58,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=431214.0, ans=0.0 2023-06-19 09:20:59,260 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.993e+02 3.612e+02 4.329e+02 8.050e+02, threshold=7.223e+02, percent-clipped=3.0 2023-06-19 09:20:59,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=431214.0, ans=0.0 2023-06-19 09:21:37,042 INFO [train.py:996] (3/4) Epoch 3, batch 10900, loss[loss=0.2279, simple_loss=0.2863, pruned_loss=0.08481, over 21293.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3405, pruned_loss=0.1047, over 4264429.16 frames. ], batch size: 177, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:22:22,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=431454.0, ans=0.1 2023-06-19 09:23:14,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=431574.0, ans=0.0 2023-06-19 09:23:17,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.05 vs. limit=15.0 2023-06-19 09:23:22,708 INFO [train.py:996] (3/4) Epoch 3, batch 10950, loss[loss=0.2426, simple_loss=0.3012, pruned_loss=0.09201, over 21551.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3336, pruned_loss=0.1018, over 4265890.47 frames. ], batch size: 263, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:23:25,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=431634.0, ans=0.125 2023-06-19 09:23:29,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=431634.0, ans=0.1 2023-06-19 09:24:06,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=431754.0, ans=0.125 2023-06-19 09:24:09,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=431754.0, ans=0.125 2023-06-19 09:24:18,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=431754.0, ans=0.125 2023-06-19 09:24:29,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=431814.0, ans=0.0 2023-06-19 09:24:30,122 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.913e+02 3.689e+02 4.516e+02 9.090e+02, threshold=7.379e+02, percent-clipped=2.0 2023-06-19 09:24:54,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=431874.0, ans=0.125 2023-06-19 09:24:57,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=431874.0, ans=0.125 2023-06-19 09:25:07,348 INFO [train.py:996] (3/4) Epoch 3, batch 11000, loss[loss=0.2637, simple_loss=0.3226, pruned_loss=0.1024, over 21808.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3347, pruned_loss=0.104, over 4270794.99 frames. ], batch size: 282, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:26:06,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=432054.0, ans=0.125 2023-06-19 09:26:13,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=432114.0, ans=0.2 2023-06-19 09:26:17,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=432114.0, ans=0.125 2023-06-19 09:26:53,500 INFO [train.py:996] (3/4) Epoch 3, batch 11050, loss[loss=0.2753, simple_loss=0.3233, pruned_loss=0.1136, over 21864.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.3329, pruned_loss=0.1061, over 4276492.26 frames. ], batch size: 98, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:27:46,499 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=12.0 2023-06-19 09:27:55,633 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.293e+02 3.002e+02 3.534e+02 4.546e+02 1.059e+03, threshold=7.067e+02, percent-clipped=5.0 2023-06-19 09:28:04,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=432414.0, ans=0.0 2023-06-19 09:28:26,036 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-19 09:28:36,738 INFO [train.py:996] (3/4) Epoch 3, batch 11100, loss[loss=0.3197, simple_loss=0.366, pruned_loss=0.1367, over 21524.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3316, pruned_loss=0.1063, over 4276012.91 frames. ], batch size: 441, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:29:23,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=432654.0, ans=0.125 2023-06-19 09:30:23,544 INFO [train.py:996] (3/4) Epoch 3, batch 11150, loss[loss=0.2375, simple_loss=0.287, pruned_loss=0.09396, over 21475.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.3285, pruned_loss=0.1053, over 4275684.71 frames. ], batch size: 195, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:30:37,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=432834.0, ans=0.125 2023-06-19 09:30:58,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=432894.0, ans=0.0 2023-06-19 09:31:34,228 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.165e+02 2.954e+02 3.953e+02 5.206e+02 1.006e+03, threshold=7.907e+02, percent-clipped=9.0 2023-06-19 09:31:38,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=433014.0, ans=0.125 2023-06-19 09:32:10,322 INFO [train.py:996] (3/4) Epoch 3, batch 11200, loss[loss=0.2229, simple_loss=0.3006, pruned_loss=0.07257, over 21353.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3273, pruned_loss=0.1042, over 4266578.92 frames. ], batch size: 176, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:32:37,697 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=15.0 2023-06-19 09:32:51,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=433194.0, ans=0.1 2023-06-19 09:33:11,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=433314.0, ans=0.125 2023-06-19 09:33:11,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=433314.0, ans=0.0 2023-06-19 09:33:22,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=433314.0, ans=0.1 2023-06-19 09:33:25,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=433314.0, ans=0.0 2023-06-19 09:33:54,838 INFO [train.py:996] (3/4) Epoch 3, batch 11250, loss[loss=0.2606, simple_loss=0.3314, pruned_loss=0.09489, over 21724.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3272, pruned_loss=0.1046, over 4254582.00 frames. ], batch size: 333, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:34:39,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=433554.0, ans=0.1 2023-06-19 09:34:44,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=433554.0, ans=0.0 2023-06-19 09:34:57,122 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.13 vs. limit=15.0 2023-06-19 09:34:59,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=433614.0, ans=0.125 2023-06-19 09:35:00,777 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.899e+02 3.491e+02 4.112e+02 6.627e+02, threshold=6.983e+02, percent-clipped=0.0 2023-06-19 09:35:38,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=433674.0, ans=0.125 2023-06-19 09:35:41,854 INFO [train.py:996] (3/4) Epoch 3, batch 11300, loss[loss=0.2386, simple_loss=0.2983, pruned_loss=0.08947, over 21519.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3285, pruned_loss=0.1048, over 4260239.55 frames. ], batch size: 211, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:35:43,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=433734.0, ans=0.125 2023-06-19 09:36:35,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=433854.0, ans=0.125 2023-06-19 09:37:28,360 INFO [train.py:996] (3/4) Epoch 3, batch 11350, loss[loss=0.3521, simple_loss=0.4035, pruned_loss=0.1503, over 21810.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3308, pruned_loss=0.1037, over 4255749.43 frames. ], batch size: 118, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:38:04,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2023-06-19 09:38:11,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=434154.0, ans=0.1 2023-06-19 09:38:25,210 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-19 09:38:35,195 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-19 09:38:45,267 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.245e+02 4.001e+02 4.934e+02 9.082e+02, threshold=8.002e+02, percent-clipped=8.0 2023-06-19 09:39:13,759 INFO [train.py:996] (3/4) Epoch 3, batch 11400, loss[loss=0.3085, simple_loss=0.3744, pruned_loss=0.1212, over 21279.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3376, pruned_loss=0.1077, over 4262453.66 frames. ], batch size: 549, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:39:16,500 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.35 vs. limit=15.0 2023-06-19 09:39:41,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=434394.0, ans=0.0 2023-06-19 09:39:51,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=434394.0, ans=0.0 2023-06-19 09:41:06,634 INFO [train.py:996] (3/4) Epoch 3, batch 11450, loss[loss=0.2786, simple_loss=0.3318, pruned_loss=0.1127, over 21304.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3377, pruned_loss=0.1062, over 4265371.86 frames. ], batch size: 176, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:41:43,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=434694.0, ans=0.125 2023-06-19 09:42:13,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=434814.0, ans=0.0 2023-06-19 09:42:19,362 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 3.142e+02 3.588e+02 4.835e+02 7.937e+02, threshold=7.176e+02, percent-clipped=0.0 2023-06-19 09:42:27,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=434814.0, ans=0.125 2023-06-19 09:42:39,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=434874.0, ans=0.0 2023-06-19 09:42:51,649 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.81 vs. limit=8.0 2023-06-19 09:42:53,679 INFO [train.py:996] (3/4) Epoch 3, batch 11500, loss[loss=0.3447, simple_loss=0.4069, pruned_loss=0.1412, over 21492.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3412, pruned_loss=0.1078, over 4270386.93 frames. ], batch size: 508, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:44:00,670 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.12 vs. limit=15.0 2023-06-19 09:44:46,153 INFO [train.py:996] (3/4) Epoch 3, batch 11550, loss[loss=0.3488, simple_loss=0.4647, pruned_loss=0.1164, over 21204.00 frames. ], tot_loss[loss=0.2826, simple_loss=0.3486, pruned_loss=0.1083, over 4270545.13 frames. ], batch size: 548, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:44:50,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=435234.0, ans=0.1 2023-06-19 09:45:02,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=435234.0, ans=0.125 2023-06-19 09:45:02,678 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-19 09:45:54,462 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 2.996e+02 3.693e+02 5.072e+02 8.592e+02, threshold=7.387e+02, percent-clipped=2.0 2023-06-19 09:46:38,534 INFO [train.py:996] (3/4) Epoch 3, batch 11600, loss[loss=0.3887, simple_loss=0.4738, pruned_loss=0.1517, over 21492.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3637, pruned_loss=0.1107, over 4263345.51 frames. ], batch size: 471, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:46:42,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=435534.0, ans=0.2 2023-06-19 09:47:01,918 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-19 09:47:05,366 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.77 vs. limit=15.0 2023-06-19 09:47:15,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=435594.0, ans=0.125 2023-06-19 09:47:16,055 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.56 vs. limit=15.0 2023-06-19 09:47:44,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=435714.0, ans=0.125 2023-06-19 09:47:48,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=435714.0, ans=0.2 2023-06-19 09:47:53,584 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.74 vs. limit=22.5 2023-06-19 09:48:24,147 INFO [train.py:996] (3/4) Epoch 3, batch 11650, loss[loss=0.3224, simple_loss=0.3774, pruned_loss=0.1337, over 21723.00 frames. ], tot_loss[loss=0.2946, simple_loss=0.3679, pruned_loss=0.1107, over 4261476.62 frames. ], batch size: 351, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:48:47,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=435894.0, ans=0.2 2023-06-19 09:49:32,530 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 3.087e+02 3.546e+02 4.429e+02 8.703e+02, threshold=7.092e+02, percent-clipped=3.0 2023-06-19 09:49:52,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=436074.0, ans=0.2 2023-06-19 09:50:10,362 INFO [train.py:996] (3/4) Epoch 3, batch 11700, loss[loss=0.2264, simple_loss=0.2945, pruned_loss=0.07913, over 15200.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.3591, pruned_loss=0.1103, over 4256353.85 frames. ], batch size: 61, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:50:12,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=436134.0, ans=0.0 2023-06-19 09:50:54,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=436254.0, ans=0.05 2023-06-19 09:51:56,341 INFO [train.py:996] (3/4) Epoch 3, batch 11750, loss[loss=0.2425, simple_loss=0.2929, pruned_loss=0.09606, over 21597.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.349, pruned_loss=0.1098, over 4265003.99 frames. ], batch size: 247, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:52:13,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=436434.0, ans=0.125 2023-06-19 09:52:36,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=436554.0, ans=0.04949747468305833 2023-06-19 09:52:53,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=436554.0, ans=0.0 2023-06-19 09:52:58,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=436614.0, ans=0.0 2023-06-19 09:53:03,204 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.275e+02 3.279e+02 3.653e+02 4.613e+02 8.659e+02, threshold=7.305e+02, percent-clipped=4.0 2023-06-19 09:53:18,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=436674.0, ans=0.125 2023-06-19 09:53:27,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=436674.0, ans=0.015 2023-06-19 09:53:33,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=436674.0, ans=0.0 2023-06-19 09:53:41,259 INFO [train.py:996] (3/4) Epoch 3, batch 11800, loss[loss=0.252, simple_loss=0.3361, pruned_loss=0.084, over 21574.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3502, pruned_loss=0.1119, over 4263941.20 frames. ], batch size: 230, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:55:06,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=436974.0, ans=0.0 2023-06-19 09:55:18,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=436974.0, ans=0.125 2023-06-19 09:55:28,736 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-19 09:55:34,573 INFO [train.py:996] (3/4) Epoch 3, batch 11850, loss[loss=0.2778, simple_loss=0.3769, pruned_loss=0.08934, over 20855.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3519, pruned_loss=0.111, over 4265199.95 frames. ], batch size: 609, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:55:37,629 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.48 vs. limit=22.5 2023-06-19 09:55:52,171 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2023-06-19 09:55:53,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=437094.0, ans=0.1 2023-06-19 09:56:10,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=437094.0, ans=0.125 2023-06-19 09:56:44,316 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 2.974e+02 3.438e+02 3.994e+02 6.906e+02, threshold=6.876e+02, percent-clipped=0.0 2023-06-19 09:57:22,855 INFO [train.py:996] (3/4) Epoch 3, batch 11900, loss[loss=0.2257, simple_loss=0.3072, pruned_loss=0.07214, over 21582.00 frames. ], tot_loss[loss=0.2839, simple_loss=0.3521, pruned_loss=0.1078, over 4264332.14 frames. ], batch size: 263, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:57:43,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=437394.0, ans=0.125 2023-06-19 09:58:03,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=437454.0, ans=0.1 2023-06-19 09:59:11,120 INFO [train.py:996] (3/4) Epoch 3, batch 11950, loss[loss=0.233, simple_loss=0.3101, pruned_loss=0.07788, over 21408.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.3515, pruned_loss=0.1035, over 4269562.53 frames. ], batch size: 211, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 09:59:41,785 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:00:29,697 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.768e+02 3.443e+02 4.738e+02 7.856e+02, threshold=6.886e+02, percent-clipped=6.0 2023-06-19 10:00:45,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=437874.0, ans=0.0 2023-06-19 10:00:45,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=437874.0, ans=0.125 2023-06-19 10:00:56,209 INFO [train.py:996] (3/4) Epoch 3, batch 12000, loss[loss=0.3006, simple_loss=0.3457, pruned_loss=0.1278, over 21545.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3478, pruned_loss=0.1026, over 4261831.37 frames. ], batch size: 414, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:00:56,209 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 10:01:15,356 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.279, simple_loss=0.3755, pruned_loss=0.09124, over 1796401.00 frames. 2023-06-19 10:01:15,357 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-19 10:01:27,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=437934.0, ans=0.125 2023-06-19 10:01:29,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=437934.0, ans=0.125 2023-06-19 10:01:43,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=437994.0, ans=0.125 2023-06-19 10:02:24,094 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.70 vs. limit=22.5 2023-06-19 10:02:25,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=438114.0, ans=0.125 2023-06-19 10:02:35,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=438114.0, ans=0.125 2023-06-19 10:02:38,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=438174.0, ans=0.125 2023-06-19 10:03:02,152 INFO [train.py:996] (3/4) Epoch 3, batch 12050, loss[loss=0.2718, simple_loss=0.3305, pruned_loss=0.1066, over 21296.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.3444, pruned_loss=0.1054, over 4264237.96 frames. ], batch size: 176, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:03:04,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=438234.0, ans=0.2 2023-06-19 10:03:41,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=438294.0, ans=15.0 2023-06-19 10:03:59,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=438354.0, ans=0.125 2023-06-19 10:04:18,342 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 3.247e+02 3.603e+02 4.114e+02 7.694e+02, threshold=7.207e+02, percent-clipped=1.0 2023-06-19 10:04:49,646 INFO [train.py:996] (3/4) Epoch 3, batch 12100, loss[loss=0.278, simple_loss=0.3442, pruned_loss=0.1059, over 21628.00 frames. ], tot_loss[loss=0.286, simple_loss=0.3503, pruned_loss=0.1109, over 4271823.91 frames. ], batch size: 230, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:04:50,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=438534.0, ans=0.125 2023-06-19 10:04:55,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=438534.0, ans=0.125 2023-06-19 10:05:01,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=438534.0, ans=0.125 2023-06-19 10:05:35,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=438594.0, ans=0.125 2023-06-19 10:06:27,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=438774.0, ans=0.1 2023-06-19 10:06:38,094 INFO [train.py:996] (3/4) Epoch 3, batch 12150, loss[loss=0.2541, simple_loss=0.3461, pruned_loss=0.08109, over 21853.00 frames. ], tot_loss[loss=0.2851, simple_loss=0.3514, pruned_loss=0.1094, over 4267164.62 frames. ], batch size: 316, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:06:45,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=438834.0, ans=0.1 2023-06-19 10:07:11,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=438894.0, ans=0.0 2023-06-19 10:07:55,458 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 3.586e+02 4.536e+02 5.596e+02 8.610e+02, threshold=9.073e+02, percent-clipped=5.0 2023-06-19 10:08:22,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=439074.0, ans=0.0 2023-06-19 10:08:33,685 INFO [train.py:996] (3/4) Epoch 3, batch 12200, loss[loss=0.2485, simple_loss=0.2992, pruned_loss=0.09884, over 21496.00 frames. ], tot_loss[loss=0.2828, simple_loss=0.3478, pruned_loss=0.1089, over 4261687.28 frames. ], batch size: 230, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:08:37,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=439134.0, ans=0.1 2023-06-19 10:09:23,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=439254.0, ans=0.0 2023-06-19 10:09:25,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=439254.0, ans=0.125 2023-06-19 10:10:12,366 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=22.5 2023-06-19 10:10:12,827 INFO [train.py:996] (3/4) Epoch 3, batch 12250, loss[loss=0.2565, simple_loss=0.3437, pruned_loss=0.08461, over 21216.00 frames. ], tot_loss[loss=0.2752, simple_loss=0.3403, pruned_loss=0.1051, over 4249139.35 frames. ], batch size: 548, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:10:13,907 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=15.0 2023-06-19 10:10:37,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=439494.0, ans=0.125 2023-06-19 10:10:56,073 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-06-19 10:11:22,036 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.723e+02 2.671e+02 3.353e+02 4.398e+02 1.093e+03, threshold=6.707e+02, percent-clipped=1.0 2023-06-19 10:11:54,852 INFO [train.py:996] (3/4) Epoch 3, batch 12300, loss[loss=0.2746, simple_loss=0.3547, pruned_loss=0.09728, over 21790.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3303, pruned_loss=0.09673, over 4259780.19 frames. ], batch size: 351, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:12:27,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=439794.0, ans=0.1 2023-06-19 10:12:46,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=439854.0, ans=0.0 2023-06-19 10:12:49,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=439854.0, ans=0.09899494936611666 2023-06-19 10:12:59,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=439914.0, ans=0.2 2023-06-19 10:13:39,832 INFO [train.py:996] (3/4) Epoch 3, batch 12350, loss[loss=0.3371, simple_loss=0.3913, pruned_loss=0.1415, over 21774.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3357, pruned_loss=0.09823, over 4259543.47 frames. ], batch size: 441, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:13:49,268 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=15.0 2023-06-19 10:13:51,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=440034.0, ans=0.125 2023-06-19 10:14:02,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=440034.0, ans=0.125 2023-06-19 10:14:22,814 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-19 10:14:26,360 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=12.0 2023-06-19 10:14:37,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=440154.0, ans=0.125 2023-06-19 10:14:53,521 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.601e+02 2.837e+02 3.590e+02 4.995e+02 8.694e+02, threshold=7.180e+02, percent-clipped=5.0 2023-06-19 10:15:30,246 INFO [train.py:996] (3/4) Epoch 3, batch 12400, loss[loss=0.2938, simple_loss=0.3475, pruned_loss=0.12, over 21508.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3406, pruned_loss=0.1047, over 4275773.38 frames. ], batch size: 131, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:16:24,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=440454.0, ans=0.0 2023-06-19 10:16:27,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=440454.0, ans=0.1 2023-06-19 10:17:23,787 INFO [train.py:996] (3/4) Epoch 3, batch 12450, loss[loss=0.2948, simple_loss=0.358, pruned_loss=0.1158, over 21792.00 frames. ], tot_loss[loss=0.2828, simple_loss=0.3463, pruned_loss=0.1096, over 4281981.04 frames. ], batch size: 247, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:17:50,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=440694.0, ans=0.0 2023-06-19 10:18:06,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=440754.0, ans=0.035 2023-06-19 10:18:13,724 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.71 vs. limit=15.0 2023-06-19 10:18:35,155 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 3.264e+02 3.825e+02 4.731e+02 7.932e+02, threshold=7.651e+02, percent-clipped=1.0 2023-06-19 10:18:35,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=440814.0, ans=0.0 2023-06-19 10:18:44,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=440814.0, ans=0.0 2023-06-19 10:19:12,375 INFO [train.py:996] (3/4) Epoch 3, batch 12500, loss[loss=0.404, simple_loss=0.4753, pruned_loss=0.1663, over 21509.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3563, pruned_loss=0.1134, over 4281290.04 frames. ], batch size: 471, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:19:36,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=440994.0, ans=0.0 2023-06-19 10:19:45,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=440994.0, ans=0.125 2023-06-19 10:19:53,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=441054.0, ans=0.125 2023-06-19 10:20:29,804 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=22.5 2023-06-19 10:21:05,739 INFO [train.py:996] (3/4) Epoch 3, batch 12550, loss[loss=0.2973, simple_loss=0.3635, pruned_loss=0.1156, over 21716.00 frames. ], tot_loss[loss=0.2961, simple_loss=0.3607, pruned_loss=0.1157, over 4278201.79 frames. ], batch size: 298, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:21:08,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=441234.0, ans=0.1 2023-06-19 10:21:21,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=441294.0, ans=0.2 2023-06-19 10:21:23,743 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=22.5 2023-06-19 10:22:03,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=441354.0, ans=0.07 2023-06-19 10:22:22,015 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 3.217e+02 3.804e+02 4.730e+02 9.875e+02, threshold=7.608e+02, percent-clipped=0.0 2023-06-19 10:22:52,395 INFO [train.py:996] (3/4) Epoch 3, batch 12600, loss[loss=0.2274, simple_loss=0.3029, pruned_loss=0.0759, over 21771.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3568, pruned_loss=0.1123, over 4268769.44 frames. ], batch size: 282, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:23:06,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=441534.0, ans=0.125 2023-06-19 10:23:12,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=441594.0, ans=0.0 2023-06-19 10:23:45,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=441654.0, ans=0.125 2023-06-19 10:23:51,973 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-19 10:24:10,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=441714.0, ans=0.0 2023-06-19 10:24:29,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=441774.0, ans=0.125 2023-06-19 10:24:35,190 INFO [train.py:996] (3/4) Epoch 3, batch 12650, loss[loss=0.2841, simple_loss=0.3368, pruned_loss=0.1158, over 21527.00 frames. ], tot_loss[loss=0.2789, simple_loss=0.3463, pruned_loss=0.1057, over 4276539.62 frames. ], batch size: 131, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:24:50,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=441894.0, ans=0.0 2023-06-19 10:25:11,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=441894.0, ans=0.2 2023-06-19 10:25:47,076 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.848e+02 3.430e+02 4.434e+02 6.952e+02, threshold=6.860e+02, percent-clipped=1.0 2023-06-19 10:26:17,195 INFO [train.py:996] (3/4) Epoch 3, batch 12700, loss[loss=0.3126, simple_loss=0.4217, pruned_loss=0.1017, over 20809.00 frames. ], tot_loss[loss=0.2815, simple_loss=0.3466, pruned_loss=0.1082, over 4281655.77 frames. ], batch size: 608, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:26:24,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=442134.0, ans=0.125 2023-06-19 10:26:47,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=442194.0, ans=0.125 2023-06-19 10:27:46,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=442374.0, ans=0.2 2023-06-19 10:28:03,204 INFO [train.py:996] (3/4) Epoch 3, batch 12750, loss[loss=0.2178, simple_loss=0.2903, pruned_loss=0.0727, over 16482.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3486, pruned_loss=0.1079, over 4274864.40 frames. ], batch size: 60, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:28:19,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=442434.0, ans=0.125 2023-06-19 10:28:23,055 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=22.5 2023-06-19 10:29:16,184 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.440e+02 3.021e+02 3.668e+02 4.359e+02 6.708e+02, threshold=7.336e+02, percent-clipped=0.0 2023-06-19 10:29:26,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=442614.0, ans=0.95 2023-06-19 10:29:45,776 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.16 vs. limit=10.0 2023-06-19 10:29:46,122 INFO [train.py:996] (3/4) Epoch 3, batch 12800, loss[loss=0.3509, simple_loss=0.4289, pruned_loss=0.1364, over 19877.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3488, pruned_loss=0.1089, over 4277634.67 frames. ], batch size: 704, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:30:09,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=442734.0, ans=0.125 2023-06-19 10:30:12,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=442794.0, ans=0.125 2023-06-19 10:30:49,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=442914.0, ans=0.0 2023-06-19 10:31:05,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=442914.0, ans=0.125 2023-06-19 10:31:40,098 INFO [train.py:996] (3/4) Epoch 3, batch 12850, loss[loss=0.3093, simple_loss=0.3577, pruned_loss=0.1305, over 19979.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3524, pruned_loss=0.1114, over 4276091.88 frames. ], batch size: 703, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:32:04,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=443094.0, ans=0.125 2023-06-19 10:32:07,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=443094.0, ans=0.125 2023-06-19 10:32:14,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=443094.0, ans=0.2 2023-06-19 10:32:19,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=443154.0, ans=0.125 2023-06-19 10:32:47,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=443214.0, ans=0.025 2023-06-19 10:32:48,810 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.077e+02 3.595e+02 4.462e+02 6.932e+02, threshold=7.189e+02, percent-clipped=0.0 2023-06-19 10:33:13,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=443274.0, ans=0.125 2023-06-19 10:33:23,079 INFO [train.py:996] (3/4) Epoch 3, batch 12900, loss[loss=0.2865, simple_loss=0.3626, pruned_loss=0.1052, over 21694.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.3519, pruned_loss=0.1077, over 4273132.11 frames. ], batch size: 351, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:33:36,128 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-19 10:34:56,931 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.35 vs. limit=6.0 2023-06-19 10:35:09,438 INFO [train.py:996] (3/4) Epoch 3, batch 12950, loss[loss=0.3397, simple_loss=0.3954, pruned_loss=0.142, over 21416.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3495, pruned_loss=0.1055, over 4272199.74 frames. ], batch size: 471, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:35:18,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=443634.0, ans=0.2 2023-06-19 10:35:37,219 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.70 vs. limit=15.0 2023-06-19 10:35:38,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=443694.0, ans=0.0 2023-06-19 10:36:23,003 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 2.936e+02 3.406e+02 4.166e+02 8.200e+02, threshold=6.811e+02, percent-clipped=3.0 2023-06-19 10:36:51,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=443934.0, ans=0.0 2023-06-19 10:36:52,651 INFO [train.py:996] (3/4) Epoch 3, batch 13000, loss[loss=0.1819, simple_loss=0.2418, pruned_loss=0.06101, over 17078.00 frames. ], tot_loss[loss=0.279, simple_loss=0.3491, pruned_loss=0.1044, over 4263852.94 frames. ], batch size: 63, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:37:25,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=443994.0, ans=0.125 2023-06-19 10:38:14,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=444174.0, ans=0.0 2023-06-19 10:38:35,606 INFO [train.py:996] (3/4) Epoch 3, batch 13050, loss[loss=0.2639, simple_loss=0.3222, pruned_loss=0.1028, over 21562.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.3451, pruned_loss=0.1021, over 4267540.04 frames. ], batch size: 548, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:38:47,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=444234.0, ans=0.125 2023-06-19 10:39:01,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=444294.0, ans=0.1 2023-06-19 10:39:03,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=444294.0, ans=0.1 2023-06-19 10:39:03,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=444294.0, ans=0.125 2023-06-19 10:39:16,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=444354.0, ans=0.2 2023-06-19 10:39:19,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=444354.0, ans=0.125 2023-06-19 10:39:23,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=444354.0, ans=0.0 2023-06-19 10:39:33,364 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.22 vs. limit=22.5 2023-06-19 10:39:49,642 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.662e+02 3.344e+02 3.864e+02 6.973e+02, threshold=6.687e+02, percent-clipped=1.0 2023-06-19 10:40:12,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=444474.0, ans=0.125 2023-06-19 10:40:20,621 INFO [train.py:996] (3/4) Epoch 3, batch 13100, loss[loss=0.2988, simple_loss=0.3523, pruned_loss=0.1226, over 21877.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3458, pruned_loss=0.103, over 4276579.63 frames. ], batch size: 107, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:40:22,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=444534.0, ans=0.0 2023-06-19 10:40:22,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=444534.0, ans=0.125 2023-06-19 10:40:27,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=444534.0, ans=0.07 2023-06-19 10:40:41,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=444594.0, ans=0.2 2023-06-19 10:40:56,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=444594.0, ans=0.125 2023-06-19 10:41:19,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=444654.0, ans=0.125 2023-06-19 10:41:40,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=444714.0, ans=0.125 2023-06-19 10:41:45,702 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.19 vs. limit=15.0 2023-06-19 10:42:09,719 INFO [train.py:996] (3/4) Epoch 3, batch 13150, loss[loss=0.2681, simple_loss=0.3218, pruned_loss=0.1072, over 21773.00 frames. ], tot_loss[loss=0.2817, simple_loss=0.3484, pruned_loss=0.1075, over 4274664.54 frames. ], batch size: 124, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:42:36,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=444894.0, ans=0.0 2023-06-19 10:43:24,851 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 3.199e+02 3.918e+02 5.152e+02 8.520e+02, threshold=7.837e+02, percent-clipped=11.0 2023-06-19 10:43:45,780 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:43:50,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=445074.0, ans=0.125 2023-06-19 10:43:55,530 INFO [train.py:996] (3/4) Epoch 3, batch 13200, loss[loss=0.2709, simple_loss=0.3379, pruned_loss=0.102, over 21974.00 frames. ], tot_loss[loss=0.2796, simple_loss=0.3456, pruned_loss=0.1068, over 4280465.92 frames. ], batch size: 317, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:44:47,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=445254.0, ans=0.125 2023-06-19 10:45:01,344 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-19 10:45:35,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=445374.0, ans=0.0 2023-06-19 10:45:38,762 INFO [train.py:996] (3/4) Epoch 3, batch 13250, loss[loss=0.2657, simple_loss=0.3188, pruned_loss=0.1063, over 21450.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3464, pruned_loss=0.1089, over 4280473.65 frames. ], batch size: 548, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:45:54,931 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-19 10:46:40,400 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:46:57,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=445614.0, ans=0.125 2023-06-19 10:46:58,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=445614.0, ans=0.04949747468305833 2023-06-19 10:46:59,976 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.966e+02 3.440e+02 4.161e+02 7.151e+02, threshold=6.880e+02, percent-clipped=0.0 2023-06-19 10:47:29,753 INFO [train.py:996] (3/4) Epoch 3, batch 13300, loss[loss=0.293, simple_loss=0.3595, pruned_loss=0.1132, over 21313.00 frames. ], tot_loss[loss=0.285, simple_loss=0.3506, pruned_loss=0.1097, over 4286468.46 frames. ], batch size: 143, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 10:47:32,408 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=22.5 2023-06-19 10:48:02,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=445794.0, ans=0.1 2023-06-19 10:48:40,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=445914.0, ans=0.0 2023-06-19 10:49:13,538 INFO [train.py:996] (3/4) Epoch 3, batch 13350, loss[loss=0.3125, simple_loss=0.3747, pruned_loss=0.1252, over 21701.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3548, pruned_loss=0.1122, over 4281768.18 frames. ], batch size: 351, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 10:49:19,531 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.15 vs. limit=22.5 2023-06-19 10:50:06,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=446154.0, ans=0.1 2023-06-19 10:50:09,088 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.75 vs. limit=15.0 2023-06-19 10:50:20,921 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.205e+02 3.225e+02 3.868e+02 4.526e+02 7.710e+02, threshold=7.735e+02, percent-clipped=4.0 2023-06-19 10:50:48,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=446274.0, ans=0.125 2023-06-19 10:50:48,308 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:50:55,968 INFO [train.py:996] (3/4) Epoch 3, batch 13400, loss[loss=0.2943, simple_loss=0.3443, pruned_loss=0.1222, over 21451.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3557, pruned_loss=0.1136, over 4280041.94 frames. ], batch size: 194, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 10:51:04,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=446334.0, ans=0.125 2023-06-19 10:51:04,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=446334.0, ans=0.2 2023-06-19 10:51:39,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=446454.0, ans=0.125 2023-06-19 10:52:46,470 INFO [train.py:996] (3/4) Epoch 3, batch 13450, loss[loss=0.2652, simple_loss=0.3119, pruned_loss=0.1093, over 21682.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3576, pruned_loss=0.1169, over 4287542.43 frames. ], batch size: 247, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 10:53:56,272 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 3.091e+02 3.617e+02 4.599e+02 7.916e+02, threshold=7.234e+02, percent-clipped=2.0 2023-06-19 10:54:08,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=446874.0, ans=0.125 2023-06-19 10:54:31,171 INFO [train.py:996] (3/4) Epoch 3, batch 13500, loss[loss=0.2574, simple_loss=0.3147, pruned_loss=0.1001, over 21253.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3469, pruned_loss=0.1128, over 4284067.26 frames. ], batch size: 159, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 10:56:04,633 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.92 vs. limit=10.0 2023-06-19 10:56:10,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=447174.0, ans=0.2 2023-06-19 10:56:14,319 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-19 10:56:14,911 INFO [train.py:996] (3/4) Epoch 3, batch 13550, loss[loss=0.2539, simple_loss=0.3265, pruned_loss=0.09065, over 21808.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3505, pruned_loss=0.1109, over 4284408.68 frames. ], batch size: 124, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 10:56:56,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=447354.0, ans=0.0 2023-06-19 10:57:31,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=447414.0, ans=0.1 2023-06-19 10:57:36,372 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.165e+02 2.954e+02 3.584e+02 4.667e+02 1.065e+03, threshold=7.167e+02, percent-clipped=1.0 2023-06-19 10:57:49,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=447474.0, ans=0.125 2023-06-19 10:58:03,805 INFO [train.py:996] (3/4) Epoch 3, batch 13600, loss[loss=0.3052, simple_loss=0.3895, pruned_loss=0.1105, over 20727.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.3537, pruned_loss=0.1127, over 4290416.24 frames. ], batch size: 607, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 10:58:07,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=447534.0, ans=0.5 2023-06-19 10:58:43,228 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.73 vs. limit=22.5 2023-06-19 10:58:47,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=447654.0, ans=0.2 2023-06-19 10:59:02,503 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=22.5 2023-06-19 10:59:43,356 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-19 10:59:45,698 INFO [train.py:996] (3/4) Epoch 3, batch 13650, loss[loss=0.2502, simple_loss=0.3034, pruned_loss=0.09847, over 15460.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3482, pruned_loss=0.1084, over 4270930.64 frames. ], batch size: 62, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 10:59:51,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=447834.0, ans=0.0 2023-06-19 10:59:59,416 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=15.0 2023-06-19 11:00:41,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=447954.0, ans=0.125 2023-06-19 11:01:01,627 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 3.087e+02 4.228e+02 5.339e+02 1.090e+03, threshold=8.456e+02, percent-clipped=10.0 2023-06-19 11:01:28,409 INFO [train.py:996] (3/4) Epoch 3, batch 13700, loss[loss=0.2445, simple_loss=0.3188, pruned_loss=0.08509, over 21853.00 frames. ], tot_loss[loss=0.2789, simple_loss=0.3419, pruned_loss=0.1079, over 4267419.39 frames. ], batch size: 316, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:01:37,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=448134.0, ans=0.1 2023-06-19 11:01:40,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=448134.0, ans=0.0 2023-06-19 11:02:31,068 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=15.0 2023-06-19 11:02:34,680 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.27 vs. limit=15.0 2023-06-19 11:02:55,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=448374.0, ans=0.0 2023-06-19 11:03:11,599 INFO [train.py:996] (3/4) Epoch 3, batch 13750, loss[loss=0.2538, simple_loss=0.3148, pruned_loss=0.09635, over 21436.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3365, pruned_loss=0.1052, over 4260821.46 frames. ], batch size: 212, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:03:47,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=448494.0, ans=0.125 2023-06-19 11:04:05,606 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.85 vs. limit=15.0 2023-06-19 11:04:06,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=448554.0, ans=0.1 2023-06-19 11:04:08,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=448554.0, ans=0.125 2023-06-19 11:04:34,592 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.128e+02 3.276e+02 4.043e+02 5.146e+02 9.090e+02, threshold=8.085e+02, percent-clipped=5.0 2023-06-19 11:04:56,201 INFO [train.py:996] (3/4) Epoch 3, batch 13800, loss[loss=0.2951, simple_loss=0.3889, pruned_loss=0.1007, over 21767.00 frames. ], tot_loss[loss=0.2778, simple_loss=0.3447, pruned_loss=0.1054, over 4258237.96 frames. ], batch size: 332, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:05:37,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=448794.0, ans=0.0 2023-06-19 11:05:46,120 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.09 vs. limit=15.0 2023-06-19 11:06:03,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-19 11:06:11,847 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:06:38,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=448974.0, ans=0.2 2023-06-19 11:06:50,091 INFO [train.py:996] (3/4) Epoch 3, batch 13850, loss[loss=0.4277, simple_loss=0.4608, pruned_loss=0.1973, over 21491.00 frames. ], tot_loss[loss=0.2831, simple_loss=0.3511, pruned_loss=0.1076, over 4262547.14 frames. ], batch size: 508, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:06:53,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=449034.0, ans=0.1 2023-06-19 11:06:59,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=449034.0, ans=0.0 2023-06-19 11:07:32,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=449154.0, ans=0.125 2023-06-19 11:07:33,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=449154.0, ans=0.0 2023-06-19 11:07:45,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=449154.0, ans=0.0 2023-06-19 11:08:01,640 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 2.934e+02 3.600e+02 4.326e+02 8.652e+02, threshold=7.199e+02, percent-clipped=1.0 2023-06-19 11:08:32,510 INFO [train.py:996] (3/4) Epoch 3, batch 13900, loss[loss=0.3233, simple_loss=0.3816, pruned_loss=0.1325, over 21824.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.3531, pruned_loss=0.1107, over 4266475.17 frames. ], batch size: 112, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:08:59,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=449394.0, ans=0.1 2023-06-19 11:09:01,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=449394.0, ans=0.125 2023-06-19 11:09:53,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=449574.0, ans=0.0 2023-06-19 11:10:15,203 INFO [train.py:996] (3/4) Epoch 3, batch 13950, loss[loss=0.3313, simple_loss=0.3821, pruned_loss=0.1403, over 21484.00 frames. ], tot_loss[loss=0.2912, simple_loss=0.3552, pruned_loss=0.1136, over 4270287.76 frames. ], batch size: 548, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:10:48,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=449694.0, ans=0.04949747468305833 2023-06-19 11:10:53,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=449694.0, ans=0.125 2023-06-19 11:11:26,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=449814.0, ans=0.0 2023-06-19 11:11:30,116 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 3.017e+02 3.528e+02 4.292e+02 7.550e+02, threshold=7.057e+02, percent-clipped=1.0 2023-06-19 11:11:33,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=449874.0, ans=0.2 2023-06-19 11:11:35,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=449874.0, ans=0.1 2023-06-19 11:11:55,736 INFO [train.py:996] (3/4) Epoch 3, batch 14000, loss[loss=0.2151, simple_loss=0.3077, pruned_loss=0.0613, over 21381.00 frames. ], tot_loss[loss=0.2852, simple_loss=0.3509, pruned_loss=0.1098, over 4268069.15 frames. ], batch size: 211, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:12:17,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=449994.0, ans=0.1 2023-06-19 11:12:40,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=450054.0, ans=0.0 2023-06-19 11:13:03,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=450114.0, ans=15.0 2023-06-19 11:13:36,919 INFO [train.py:996] (3/4) Epoch 3, batch 14050, loss[loss=0.25, simple_loss=0.3606, pruned_loss=0.06965, over 19735.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.3435, pruned_loss=0.1035, over 4269841.07 frames. ], batch size: 702, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:14:06,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=450294.0, ans=10.0 2023-06-19 11:14:18,434 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:14:21,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=450354.0, ans=0.0 2023-06-19 11:14:54,752 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.755e+02 3.797e+02 5.310e+02 9.461e+02, threshold=7.595e+02, percent-clipped=8.0 2023-06-19 11:15:24,576 INFO [train.py:996] (3/4) Epoch 3, batch 14100, loss[loss=0.2917, simple_loss=0.3368, pruned_loss=0.1233, over 21557.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3388, pruned_loss=0.1044, over 4272523.35 frames. ], batch size: 391, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:15:47,869 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-19 11:16:01,848 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=15.0 2023-06-19 11:16:20,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=450714.0, ans=0.2 2023-06-19 11:16:28,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=450714.0, ans=0.125 2023-06-19 11:16:36,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=450774.0, ans=0.0 2023-06-19 11:16:39,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=450774.0, ans=0.2 2023-06-19 11:16:58,759 INFO [train.py:996] (3/4) Epoch 3, batch 14150, loss[loss=0.2806, simple_loss=0.3652, pruned_loss=0.09798, over 21261.00 frames. ], tot_loss[loss=0.277, simple_loss=0.3428, pruned_loss=0.1056, over 4273349.16 frames. ], batch size: 549, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:17:05,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=450834.0, ans=0.125 2023-06-19 11:18:14,080 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.701e+02 3.295e+02 4.437e+02 7.217e+02, threshold=6.589e+02, percent-clipped=0.0 2023-06-19 11:18:19,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=451074.0, ans=10.0 2023-06-19 11:18:38,233 INFO [train.py:996] (3/4) Epoch 3, batch 14200, loss[loss=0.2629, simple_loss=0.3263, pruned_loss=0.09972, over 21860.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3406, pruned_loss=0.1039, over 4276900.76 frames. ], batch size: 371, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:19:36,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=451314.0, ans=15.0 2023-06-19 11:19:57,483 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:20:20,550 INFO [train.py:996] (3/4) Epoch 3, batch 14250, loss[loss=0.2333, simple_loss=0.3099, pruned_loss=0.07835, over 21708.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3348, pruned_loss=0.1036, over 4272556.52 frames. ], batch size: 298, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:21:39,200 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.808e+02 3.716e+02 4.608e+02 1.130e+03, threshold=7.432e+02, percent-clipped=7.0 2023-06-19 11:21:59,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=451674.0, ans=0.04949747468305833 2023-06-19 11:22:04,556 INFO [train.py:996] (3/4) Epoch 3, batch 14300, loss[loss=0.3317, simple_loss=0.4154, pruned_loss=0.124, over 21827.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3366, pruned_loss=0.1037, over 4273144.30 frames. ], batch size: 316, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:23:04,991 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=15.0 2023-06-19 11:23:30,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=451974.0, ans=0.0 2023-06-19 11:23:46,798 INFO [train.py:996] (3/4) Epoch 3, batch 14350, loss[loss=0.263, simple_loss=0.3206, pruned_loss=0.1028, over 21413.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3405, pruned_loss=0.1032, over 4266964.79 frames. ], batch size: 131, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:23:57,695 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.97 vs. limit=10.0 2023-06-19 11:25:04,025 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 3.074e+02 3.738e+02 4.858e+02 1.364e+03, threshold=7.476e+02, percent-clipped=9.0 2023-06-19 11:25:20,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=452274.0, ans=0.0 2023-06-19 11:25:25,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=452274.0, ans=0.0 2023-06-19 11:25:28,322 INFO [train.py:996] (3/4) Epoch 3, batch 14400, loss[loss=0.251, simple_loss=0.3083, pruned_loss=0.09685, over 21518.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3385, pruned_loss=0.1039, over 4272867.79 frames. ], batch size: 195, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:26:12,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=452454.0, ans=0.1 2023-06-19 11:26:14,325 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2023-06-19 11:26:17,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=452454.0, ans=0.2 2023-06-19 11:26:23,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=452454.0, ans=0.5 2023-06-19 11:27:09,067 INFO [train.py:996] (3/4) Epoch 3, batch 14450, loss[loss=0.2461, simple_loss=0.3053, pruned_loss=0.09339, over 21755.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3331, pruned_loss=0.1044, over 4272804.20 frames. ], batch size: 316, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:27:14,917 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.14 vs. limit=12.0 2023-06-19 11:27:19,696 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.44 vs. limit=10.0 2023-06-19 11:27:26,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=452694.0, ans=0.2 2023-06-19 11:27:34,409 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=22.5 2023-06-19 11:28:08,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=452814.0, ans=0.0 2023-06-19 11:28:23,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=452814.0, ans=0.125 2023-06-19 11:28:26,398 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 3.178e+02 3.835e+02 4.854e+02 8.477e+02, threshold=7.671e+02, percent-clipped=4.0 2023-06-19 11:28:46,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=452874.0, ans=0.0 2023-06-19 11:28:51,103 INFO [train.py:996] (3/4) Epoch 3, batch 14500, loss[loss=0.2694, simple_loss=0.342, pruned_loss=0.09844, over 21233.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3317, pruned_loss=0.1046, over 4270045.94 frames. ], batch size: 548, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:29:19,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=452994.0, ans=0.125 2023-06-19 11:29:59,879 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=15.0 2023-06-19 11:30:19,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-06-19 11:30:33,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=453234.0, ans=0.035 2023-06-19 11:30:34,330 INFO [train.py:996] (3/4) Epoch 3, batch 14550, loss[loss=0.3108, simple_loss=0.369, pruned_loss=0.1263, over 21919.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3387, pruned_loss=0.1072, over 4262973.91 frames. ], batch size: 316, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:31:04,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=453294.0, ans=0.2 2023-06-19 11:31:54,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=453414.0, ans=0.2 2023-06-19 11:31:57,098 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 3.259e+02 4.086e+02 5.797e+02 9.548e+02, threshold=8.171e+02, percent-clipped=6.0 2023-06-19 11:32:00,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=453474.0, ans=0.1 2023-06-19 11:32:09,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=453474.0, ans=0.0 2023-06-19 11:32:16,940 INFO [train.py:996] (3/4) Epoch 3, batch 14600, loss[loss=0.3223, simple_loss=0.404, pruned_loss=0.1203, over 21666.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.3475, pruned_loss=0.1116, over 4268951.19 frames. ], batch size: 414, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:32:22,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=453534.0, ans=0.125 2023-06-19 11:32:41,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=453594.0, ans=0.125 2023-06-19 11:32:56,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=453594.0, ans=0.0 2023-06-19 11:33:58,763 INFO [train.py:996] (3/4) Epoch 3, batch 14650, loss[loss=0.192, simple_loss=0.2579, pruned_loss=0.06303, over 16148.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.3473, pruned_loss=0.11, over 4270323.02 frames. ], batch size: 61, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:34:41,816 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-19 11:34:54,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=453954.0, ans=0.07 2023-06-19 11:35:07,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=454014.0, ans=0.0 2023-06-19 11:35:24,893 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 2.524e+02 3.223e+02 4.029e+02 6.900e+02, threshold=6.446e+02, percent-clipped=0.0 2023-06-19 11:35:34,401 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.31 vs. limit=15.0 2023-06-19 11:35:50,384 INFO [train.py:996] (3/4) Epoch 3, batch 14700, loss[loss=0.3032, simple_loss=0.3897, pruned_loss=0.1084, over 21310.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3429, pruned_loss=0.1043, over 4262724.34 frames. ], batch size: 548, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:36:13,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=454194.0, ans=0.0 2023-06-19 11:36:23,544 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-19 11:36:31,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=454254.0, ans=0.0 2023-06-19 11:36:34,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=454254.0, ans=0.125 2023-06-19 11:37:18,787 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=15.0 2023-06-19 11:37:38,431 INFO [train.py:996] (3/4) Epoch 3, batch 14750, loss[loss=0.3148, simple_loss=0.379, pruned_loss=0.1253, over 21763.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3472, pruned_loss=0.1069, over 4246377.07 frames. ], batch size: 124, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:38:27,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=454554.0, ans=0.125 2023-06-19 11:38:28,521 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.06 vs. limit=15.0 2023-06-19 11:38:32,667 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-19 11:38:48,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=454614.0, ans=0.125 2023-06-19 11:38:51,299 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 3.078e+02 3.855e+02 4.833e+02 8.936e+02, threshold=7.710e+02, percent-clipped=7.0 2023-06-19 11:39:20,979 INFO [train.py:996] (3/4) Epoch 3, batch 14800, loss[loss=0.2768, simple_loss=0.3241, pruned_loss=0.1147, over 21852.00 frames. ], tot_loss[loss=0.291, simple_loss=0.357, pruned_loss=0.1125, over 4251043.67 frames. ], batch size: 107, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:39:50,089 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.45 vs. limit=10.0 2023-06-19 11:40:05,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=454854.0, ans=0.0 2023-06-19 11:40:28,301 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:40:39,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=454914.0, ans=0.1 2023-06-19 11:41:04,768 INFO [train.py:996] (3/4) Epoch 3, batch 14850, loss[loss=0.3868, simple_loss=0.4416, pruned_loss=0.166, over 21608.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.353, pruned_loss=0.113, over 4257481.63 frames. ], batch size: 441, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:41:40,683 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-06-19 11:41:45,693 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.07 vs. limit=22.5 2023-06-19 11:41:46,793 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:41:47,276 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=15.0 2023-06-19 11:42:19,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-19 11:42:23,060 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.372e+02 3.130e+02 3.900e+02 4.785e+02 9.691e+02, threshold=7.799e+02, percent-clipped=2.0 2023-06-19 11:42:53,915 INFO [train.py:996] (3/4) Epoch 3, batch 14900, loss[loss=0.3114, simple_loss=0.3635, pruned_loss=0.1296, over 21444.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3571, pruned_loss=0.1141, over 4256235.75 frames. ], batch size: 194, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:42:55,864 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:43:27,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=455454.0, ans=0.125 2023-06-19 11:43:57,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=455514.0, ans=0.2 2023-06-19 11:44:17,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=455574.0, ans=0.2 2023-06-19 11:44:36,794 INFO [train.py:996] (3/4) Epoch 3, batch 14950, loss[loss=0.2748, simple_loss=0.3495, pruned_loss=0.1001, over 21628.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3573, pruned_loss=0.1134, over 4261863.04 frames. ], batch size: 263, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:44:37,834 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=22.5 2023-06-19 11:44:44,646 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.84 vs. limit=22.5 2023-06-19 11:44:47,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=455634.0, ans=0.125 2023-06-19 11:44:50,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=455634.0, ans=0.0 2023-06-19 11:44:52,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=455694.0, ans=0.2 2023-06-19 11:45:07,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=455694.0, ans=0.1 2023-06-19 11:45:34,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=455754.0, ans=0.125 2023-06-19 11:45:55,103 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 3.136e+02 3.778e+02 4.659e+02 7.505e+02, threshold=7.556e+02, percent-clipped=0.0 2023-06-19 11:46:20,050 INFO [train.py:996] (3/4) Epoch 3, batch 15000, loss[loss=0.2687, simple_loss=0.3304, pruned_loss=0.1035, over 21452.00 frames. ], tot_loss[loss=0.295, simple_loss=0.3595, pruned_loss=0.1153, over 4268033.23 frames. ], batch size: 211, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:46:20,051 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 11:46:36,891 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2722, simple_loss=0.3734, pruned_loss=0.08553, over 1796401.00 frames. 2023-06-19 11:46:36,892 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-19 11:46:55,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=455934.0, ans=0.0 2023-06-19 11:47:35,124 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=22.5 2023-06-19 11:47:38,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=456054.0, ans=0.125 2023-06-19 11:48:03,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=456174.0, ans=0.125 2023-06-19 11:48:12,369 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-19 11:48:17,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=456174.0, ans=0.0 2023-06-19 11:48:26,258 INFO [train.py:996] (3/4) Epoch 3, batch 15050, loss[loss=0.2767, simple_loss=0.3622, pruned_loss=0.09562, over 19790.00 frames. ], tot_loss[loss=0.2955, simple_loss=0.3591, pruned_loss=0.116, over 4264400.49 frames. ], batch size: 703, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:48:46,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=456294.0, ans=0.125 2023-06-19 11:49:29,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=456414.0, ans=0.125 2023-06-19 11:49:36,680 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.93 vs. limit=15.0 2023-06-19 11:49:43,738 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 3.179e+02 3.857e+02 4.850e+02 8.474e+02, threshold=7.714e+02, percent-clipped=3.0 2023-06-19 11:50:02,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=456474.0, ans=0.125 2023-06-19 11:50:03,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=456474.0, ans=0.125 2023-06-19 11:50:08,209 INFO [train.py:996] (3/4) Epoch 3, batch 15100, loss[loss=0.3441, simple_loss=0.4031, pruned_loss=0.1425, over 21590.00 frames. ], tot_loss[loss=0.2952, simple_loss=0.3606, pruned_loss=0.1149, over 4264539.39 frames. ], batch size: 389, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:50:37,494 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-19 11:50:44,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=456594.0, ans=0.5 2023-06-19 11:51:05,899 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-19 11:51:56,400 INFO [train.py:996] (3/4) Epoch 3, batch 15150, loss[loss=0.3592, simple_loss=0.4744, pruned_loss=0.122, over 19754.00 frames. ], tot_loss[loss=0.294, simple_loss=0.3572, pruned_loss=0.1154, over 4271039.88 frames. ], batch size: 702, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:52:18,914 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=22.5 2023-06-19 11:52:33,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=456954.0, ans=0.125 2023-06-19 11:52:35,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=456954.0, ans=0.0 2023-06-19 11:52:50,644 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=22.5 2023-06-19 11:52:56,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=457014.0, ans=0.125 2023-06-19 11:53:13,800 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 3.143e+02 3.638e+02 4.416e+02 6.792e+02, threshold=7.275e+02, percent-clipped=0.0 2023-06-19 11:53:38,811 INFO [train.py:996] (3/4) Epoch 3, batch 15200, loss[loss=0.2234, simple_loss=0.3168, pruned_loss=0.06496, over 21313.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3483, pruned_loss=0.1108, over 4264497.18 frames. ], batch size: 551, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:53:56,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=457134.0, ans=0.0 2023-06-19 11:54:34,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=457254.0, ans=0.125 2023-06-19 11:55:19,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=457434.0, ans=0.125 2023-06-19 11:55:20,832 INFO [train.py:996] (3/4) Epoch 3, batch 15250, loss[loss=0.3004, simple_loss=0.3475, pruned_loss=0.1267, over 21566.00 frames. ], tot_loss[loss=0.2806, simple_loss=0.3427, pruned_loss=0.1092, over 4265074.53 frames. ], batch size: 415, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:56:07,113 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-19 11:56:19,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=457614.0, ans=0.125 2023-06-19 11:56:19,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=457614.0, ans=0.125 2023-06-19 11:56:43,900 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.938e+02 3.532e+02 4.223e+02 6.837e+02, threshold=7.064e+02, percent-clipped=0.0 2023-06-19 11:56:54,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=457674.0, ans=0.0 2023-06-19 11:57:00,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=457674.0, ans=0.125 2023-06-19 11:57:08,074 INFO [train.py:996] (3/4) Epoch 3, batch 15300, loss[loss=0.4177, simple_loss=0.4315, pruned_loss=0.2019, over 21446.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.348, pruned_loss=0.1142, over 4271348.44 frames. ], batch size: 510, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:58:05,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=457914.0, ans=0.0 2023-06-19 11:58:25,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.26 vs. limit=12.0 2023-06-19 11:58:39,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=457974.0, ans=0.125 2023-06-19 11:58:43,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=458034.0, ans=0.2 2023-06-19 11:58:44,323 INFO [train.py:996] (3/4) Epoch 3, batch 15350, loss[loss=0.3291, simple_loss=0.3923, pruned_loss=0.1329, over 21453.00 frames. ], tot_loss[loss=0.2932, simple_loss=0.3528, pruned_loss=0.1167, over 4270739.22 frames. ], batch size: 471, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:59:08,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=458094.0, ans=0.5 2023-06-19 11:59:23,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=458094.0, ans=0.0 2023-06-19 11:59:23,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=458094.0, ans=0.0 2023-06-19 11:59:38,143 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=15.0 2023-06-19 11:59:42,466 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:59:48,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458214.0, ans=0.1 2023-06-19 11:59:54,545 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 3.004e+02 3.619e+02 4.702e+02 1.047e+03, threshold=7.238e+02, percent-clipped=6.0 2023-06-19 12:00:06,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=458274.0, ans=0.125 2023-06-19 12:00:24,063 INFO [train.py:996] (3/4) Epoch 3, batch 15400, loss[loss=0.2679, simple_loss=0.3338, pruned_loss=0.101, over 21500.00 frames. ], tot_loss[loss=0.2894, simple_loss=0.3515, pruned_loss=0.1137, over 4270916.78 frames. ], batch size: 211, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:00:30,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=458334.0, ans=0.125 2023-06-19 12:01:25,510 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:01:51,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=458574.0, ans=0.125 2023-06-19 12:02:02,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=458574.0, ans=0.0 2023-06-19 12:02:06,909 INFO [train.py:996] (3/4) Epoch 3, batch 15450, loss[loss=0.2979, simple_loss=0.3959, pruned_loss=0.09995, over 19623.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3486, pruned_loss=0.112, over 4259886.70 frames. ], batch size: 703, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:02:08,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=458634.0, ans=0.2 2023-06-19 12:02:21,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=458634.0, ans=0.1 2023-06-19 12:03:23,945 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 2.889e+02 3.450e+02 3.978e+02 6.262e+02, threshold=6.899e+02, percent-clipped=0.0 2023-06-19 12:03:54,353 INFO [train.py:996] (3/4) Epoch 3, batch 15500, loss[loss=0.4106, simple_loss=0.4388, pruned_loss=0.1912, over 21338.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.3516, pruned_loss=0.1119, over 4257227.88 frames. ], batch size: 507, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:04:34,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=459054.0, ans=0.125 2023-06-19 12:05:10,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=459114.0, ans=0.1 2023-06-19 12:05:10,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=459114.0, ans=0.125 2023-06-19 12:05:22,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=459174.0, ans=0.125 2023-06-19 12:05:26,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=459174.0, ans=22.5 2023-06-19 12:05:37,221 INFO [train.py:996] (3/4) Epoch 3, batch 15550, loss[loss=0.3233, simple_loss=0.3715, pruned_loss=0.1375, over 21474.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3509, pruned_loss=0.1089, over 4259056.20 frames. ], batch size: 508, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:05:37,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=459234.0, ans=0.0 2023-06-19 12:05:37,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=459234.0, ans=0.0 2023-06-19 12:05:47,062 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-06-19 12:05:51,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=459234.0, ans=0.125 2023-06-19 12:05:52,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=459234.0, ans=0.125 2023-06-19 12:06:05,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=459294.0, ans=0.125 2023-06-19 12:06:11,355 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=22.5 2023-06-19 12:06:12,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=459294.0, ans=0.125 2023-06-19 12:06:46,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=459414.0, ans=0.0 2023-06-19 12:06:50,565 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.72 vs. limit=22.5 2023-06-19 12:06:54,188 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 2.997e+02 3.470e+02 4.241e+02 8.422e+02, threshold=6.941e+02, percent-clipped=1.0 2023-06-19 12:07:09,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=459474.0, ans=0.2 2023-06-19 12:07:12,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=459474.0, ans=0.0 2023-06-19 12:07:18,536 INFO [train.py:996] (3/4) Epoch 3, batch 15600, loss[loss=0.2714, simple_loss=0.3224, pruned_loss=0.1103, over 21311.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3462, pruned_loss=0.1076, over 4237003.24 frames. ], batch size: 160, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:07:20,940 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=22.5 2023-06-19 12:07:30,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=459534.0, ans=0.125 2023-06-19 12:08:44,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=459774.0, ans=0.125 2023-06-19 12:09:06,277 INFO [train.py:996] (3/4) Epoch 3, batch 15650, loss[loss=0.3053, simple_loss=0.3579, pruned_loss=0.1264, over 21439.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.3422, pruned_loss=0.1065, over 4244536.82 frames. ], batch size: 389, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:09:09,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=459834.0, ans=0.0 2023-06-19 12:09:09,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=459834.0, ans=0.07 2023-06-19 12:10:22,754 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.856e+02 3.445e+02 4.569e+02 7.529e+02, threshold=6.891e+02, percent-clipped=2.0 2023-06-19 12:10:47,548 INFO [train.py:996] (3/4) Epoch 3, batch 15700, loss[loss=0.2225, simple_loss=0.2841, pruned_loss=0.08042, over 21808.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3385, pruned_loss=0.1063, over 4247231.64 frames. ], batch size: 112, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:11:47,411 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.88 vs. limit=6.0 2023-06-19 12:11:49,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=460314.0, ans=0.125 2023-06-19 12:12:28,099 INFO [train.py:996] (3/4) Epoch 3, batch 15750, loss[loss=0.2695, simple_loss=0.3128, pruned_loss=0.1131, over 15957.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.3338, pruned_loss=0.1058, over 4253462.43 frames. ], batch size: 64, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:12:50,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=460494.0, ans=0.125 2023-06-19 12:13:43,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=460614.0, ans=0.1 2023-06-19 12:13:46,487 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.862e+02 3.455e+02 4.105e+02 6.683e+02, threshold=6.910e+02, percent-clipped=1.0 2023-06-19 12:14:09,442 INFO [train.py:996] (3/4) Epoch 3, batch 15800, loss[loss=0.2619, simple_loss=0.3087, pruned_loss=0.1075, over 21433.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3292, pruned_loss=0.1056, over 4257451.67 frames. ], batch size: 194, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:14:17,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=460734.0, ans=0.125 2023-06-19 12:14:29,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=460794.0, ans=0.125 2023-06-19 12:14:50,002 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-19 12:14:59,689 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:15:24,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=460914.0, ans=0.1 2023-06-19 12:15:27,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=460914.0, ans=0.025 2023-06-19 12:15:32,627 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-19 12:15:52,338 INFO [train.py:996] (3/4) Epoch 3, batch 15850, loss[loss=0.2802, simple_loss=0.3331, pruned_loss=0.1136, over 21796.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3308, pruned_loss=0.1077, over 4254861.75 frames. ], batch size: 124, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:16:08,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=461094.0, ans=0.125 2023-06-19 12:16:24,192 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.13 vs. limit=6.0 2023-06-19 12:17:03,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=461214.0, ans=0.125 2023-06-19 12:17:08,954 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=22.5 2023-06-19 12:17:10,934 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 2.908e+02 3.639e+02 4.203e+02 7.869e+02, threshold=7.277e+02, percent-clipped=1.0 2023-06-19 12:17:35,015 INFO [train.py:996] (3/4) Epoch 3, batch 15900, loss[loss=0.2722, simple_loss=0.3492, pruned_loss=0.09759, over 21539.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3281, pruned_loss=0.1069, over 4263555.66 frames. ], batch size: 389, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:17:37,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=461334.0, ans=0.1 2023-06-19 12:18:18,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=461454.0, ans=0.125 2023-06-19 12:18:55,512 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=22.5 2023-06-19 12:19:16,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=461634.0, ans=0.2 2023-06-19 12:19:17,493 INFO [train.py:996] (3/4) Epoch 3, batch 15950, loss[loss=0.2411, simple_loss=0.3168, pruned_loss=0.08266, over 21738.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3287, pruned_loss=0.1039, over 4261135.10 frames. ], batch size: 247, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:19:44,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=461694.0, ans=0.125 2023-06-19 12:20:22,570 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-19 12:20:36,054 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 2.793e+02 3.209e+02 4.219e+02 1.070e+03, threshold=6.418e+02, percent-clipped=5.0 2023-06-19 12:20:59,850 INFO [train.py:996] (3/4) Epoch 3, batch 16000, loss[loss=0.2627, simple_loss=0.3507, pruned_loss=0.08735, over 21672.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3295, pruned_loss=0.1011, over 4255437.04 frames. ], batch size: 389, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:21:11,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=461934.0, ans=0.125 2023-06-19 12:21:12,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=461934.0, ans=0.0 2023-06-19 12:22:42,079 INFO [train.py:996] (3/4) Epoch 3, batch 16050, loss[loss=0.3549, simple_loss=0.4363, pruned_loss=0.1367, over 21513.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3332, pruned_loss=0.09864, over 4249749.07 frames. ], batch size: 471, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:23:00,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=462294.0, ans=0.125 2023-06-19 12:23:12,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=462354.0, ans=0.1 2023-06-19 12:24:00,103 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.215e+02 3.084e+02 3.532e+02 4.519e+02 7.240e+02, threshold=7.063e+02, percent-clipped=4.0 2023-06-19 12:24:00,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=462474.0, ans=0.125 2023-06-19 12:24:11,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=462474.0, ans=0.125 2023-06-19 12:24:17,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=462474.0, ans=0.125 2023-06-19 12:24:18,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=462474.0, ans=0.125 2023-06-19 12:24:23,177 INFO [train.py:996] (3/4) Epoch 3, batch 16100, loss[loss=0.3296, simple_loss=0.3738, pruned_loss=0.1427, over 21634.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3378, pruned_loss=0.1003, over 4255040.52 frames. ], batch size: 471, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:24:24,120 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=15.0 2023-06-19 12:24:26,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=462534.0, ans=0.04949747468305833 2023-06-19 12:24:53,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=462654.0, ans=0.1 2023-06-19 12:24:59,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=462654.0, ans=0.1 2023-06-19 12:25:26,220 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-06-19 12:25:57,544 INFO [train.py:996] (3/4) Epoch 3, batch 16150, loss[loss=0.2891, simple_loss=0.3711, pruned_loss=0.1036, over 19904.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3396, pruned_loss=0.104, over 4272184.11 frames. ], batch size: 703, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:27:02,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=463014.0, ans=15.0 2023-06-19 12:27:15,779 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=15.0 2023-06-19 12:27:16,606 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 2.948e+02 3.427e+02 4.312e+02 9.423e+02, threshold=6.854e+02, percent-clipped=2.0 2023-06-19 12:27:20,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=463074.0, ans=0.125 2023-06-19 12:27:25,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=463074.0, ans=0.0 2023-06-19 12:27:32,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=463074.0, ans=0.125 2023-06-19 12:27:40,000 INFO [train.py:996] (3/4) Epoch 3, batch 16200, loss[loss=0.3333, simple_loss=0.3965, pruned_loss=0.1351, over 21435.00 frames. ], tot_loss[loss=0.2786, simple_loss=0.3448, pruned_loss=0.1063, over 4276111.89 frames. ], batch size: 131, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:27:44,219 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=22.5 2023-06-19 12:28:06,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=463194.0, ans=0.125 2023-06-19 12:28:38,865 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=22.5 2023-06-19 12:29:15,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=463374.0, ans=0.125 2023-06-19 12:29:21,908 INFO [train.py:996] (3/4) Epoch 3, batch 16250, loss[loss=0.1999, simple_loss=0.2718, pruned_loss=0.064, over 21286.00 frames. ], tot_loss[loss=0.2767, simple_loss=0.3424, pruned_loss=0.1055, over 4281091.37 frames. ], batch size: 176, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:29:46,110 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-06-19 12:29:50,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=463494.0, ans=0.125 2023-06-19 12:30:45,989 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.795e+02 3.241e+02 4.405e+02 7.562e+02, threshold=6.482e+02, percent-clipped=2.0 2023-06-19 12:30:55,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=463674.0, ans=0.2 2023-06-19 12:30:57,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=463674.0, ans=0.2 2023-06-19 12:31:03,296 INFO [train.py:996] (3/4) Epoch 3, batch 16300, loss[loss=0.2195, simple_loss=0.3084, pruned_loss=0.06532, over 21716.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3347, pruned_loss=0.1008, over 4276655.60 frames. ], batch size: 351, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:31:18,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=463794.0, ans=0.125 2023-06-19 12:31:59,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=463914.0, ans=0.0 2023-06-19 12:32:15,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=463914.0, ans=0.0 2023-06-19 12:32:27,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=463974.0, ans=0.1 2023-06-19 12:32:37,019 INFO [train.py:996] (3/4) Epoch 3, batch 16350, loss[loss=0.2896, simple_loss=0.3539, pruned_loss=0.1126, over 21941.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3365, pruned_loss=0.1029, over 4277485.80 frames. ], batch size: 372, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:32:46,418 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-19 12:33:08,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=464094.0, ans=0.5 2023-06-19 12:33:37,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=464154.0, ans=0.125 2023-06-19 12:33:38,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=464154.0, ans=0.1 2023-06-19 12:33:59,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=464214.0, ans=0.05 2023-06-19 12:34:02,463 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 3.064e+02 3.648e+02 5.135e+02 1.076e+03, threshold=7.296e+02, percent-clipped=9.0 2023-06-19 12:34:18,654 INFO [train.py:996] (3/4) Epoch 3, batch 16400, loss[loss=0.2423, simple_loss=0.3063, pruned_loss=0.08915, over 21803.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.3424, pruned_loss=0.1064, over 4277503.89 frames. ], batch size: 247, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:35:08,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=464454.0, ans=0.125 2023-06-19 12:35:51,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=464574.0, ans=0.0 2023-06-19 12:36:00,605 INFO [train.py:996] (3/4) Epoch 3, batch 16450, loss[loss=0.2562, simple_loss=0.3205, pruned_loss=0.09597, over 21654.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3409, pruned_loss=0.1062, over 4281746.74 frames. ], batch size: 263, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:36:05,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=464634.0, ans=0.0 2023-06-19 12:36:10,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=464634.0, ans=0.04949747468305833 2023-06-19 12:36:22,545 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:36:23,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=464694.0, ans=0.125 2023-06-19 12:36:43,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=464754.0, ans=0.125 2023-06-19 12:37:09,527 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-19 12:37:21,749 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 3.011e+02 3.468e+02 3.986e+02 7.351e+02, threshold=6.935e+02, percent-clipped=1.0 2023-06-19 12:37:32,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=464874.0, ans=0.125 2023-06-19 12:37:38,556 INFO [train.py:996] (3/4) Epoch 3, batch 16500, loss[loss=0.2157, simple_loss=0.2686, pruned_loss=0.08142, over 21649.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3391, pruned_loss=0.1064, over 4287185.55 frames. ], batch size: 230, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:38:06,987 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:38:34,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=465054.0, ans=0.125 2023-06-19 12:38:39,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=465054.0, ans=0.125 2023-06-19 12:39:01,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=465174.0, ans=0.1 2023-06-19 12:39:15,960 INFO [train.py:996] (3/4) Epoch 3, batch 16550, loss[loss=0.2955, simple_loss=0.3629, pruned_loss=0.1141, over 21843.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3376, pruned_loss=0.1035, over 4276669.94 frames. ], batch size: 371, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:40:42,237 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 3.124e+02 3.787e+02 4.304e+02 9.133e+02, threshold=7.574e+02, percent-clipped=3.0 2023-06-19 12:41:09,200 INFO [train.py:996] (3/4) Epoch 3, batch 16600, loss[loss=0.3596, simple_loss=0.4423, pruned_loss=0.1384, over 21729.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.3478, pruned_loss=0.108, over 4267323.73 frames. ], batch size: 351, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:41:35,600 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-19 12:41:36,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=465594.0, ans=0.2 2023-06-19 12:41:54,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=465654.0, ans=0.125 2023-06-19 12:41:56,770 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=15.0 2023-06-19 12:42:13,590 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-19 12:42:26,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=465714.0, ans=0.1 2023-06-19 12:42:58,987 INFO [train.py:996] (3/4) Epoch 3, batch 16650, loss[loss=0.2849, simple_loss=0.3512, pruned_loss=0.1093, over 21791.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3586, pruned_loss=0.1122, over 4267183.51 frames. ], batch size: 247, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:42:59,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=465834.0, ans=0.0 2023-06-19 12:43:15,340 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.72 vs. limit=6.0 2023-06-19 12:43:21,726 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-19 12:43:37,809 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.51 vs. limit=15.0 2023-06-19 12:43:55,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=465954.0, ans=0.0 2023-06-19 12:44:16,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=466014.0, ans=0.2 2023-06-19 12:44:27,302 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.247e+02 3.781e+02 4.657e+02 6.369e+02, threshold=7.563e+02, percent-clipped=0.0 2023-06-19 12:44:29,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=466074.0, ans=0.125 2023-06-19 12:44:49,106 INFO [train.py:996] (3/4) Epoch 3, batch 16700, loss[loss=0.2717, simple_loss=0.3535, pruned_loss=0.09497, over 21632.00 frames. ], tot_loss[loss=0.2946, simple_loss=0.3611, pruned_loss=0.1141, over 4270716.56 frames. ], batch size: 389, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:45:40,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=466254.0, ans=0.0 2023-06-19 12:46:11,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=466314.0, ans=0.0 2023-06-19 12:46:16,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=466374.0, ans=0.5 2023-06-19 12:46:35,212 INFO [train.py:996] (3/4) Epoch 3, batch 16750, loss[loss=0.3445, simple_loss=0.4235, pruned_loss=0.1327, over 21694.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.363, pruned_loss=0.1157, over 4275685.63 frames. ], batch size: 441, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:46:39,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=466434.0, ans=0.2 2023-06-19 12:47:46,854 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-19 12:48:01,815 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.898e+02 3.370e+02 4.211e+02 9.702e+02, threshold=6.740e+02, percent-clipped=1.0 2023-06-19 12:48:04,713 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.05 vs. limit=15.0 2023-06-19 12:48:07,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=466674.0, ans=0.2 2023-06-19 12:48:13,050 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=19.23 vs. limit=22.5 2023-06-19 12:48:22,820 INFO [train.py:996] (3/4) Epoch 3, batch 16800, loss[loss=0.2922, simple_loss=0.348, pruned_loss=0.1182, over 21891.00 frames. ], tot_loss[loss=0.3, simple_loss=0.3671, pruned_loss=0.1164, over 4272739.34 frames. ], batch size: 316, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:50:04,939 INFO [train.py:996] (3/4) Epoch 3, batch 16850, loss[loss=0.3101, simple_loss=0.3617, pruned_loss=0.1292, over 21466.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3645, pruned_loss=0.1173, over 4282602.15 frames. ], batch size: 131, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:50:44,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=467094.0, ans=0.0 2023-06-19 12:51:18,022 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=15.0 2023-06-19 12:51:25,337 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.898e+02 3.370e+02 4.330e+02 9.168e+02, threshold=6.739e+02, percent-clipped=5.0 2023-06-19 12:51:45,951 INFO [train.py:996] (3/4) Epoch 3, batch 16900, loss[loss=0.3375, simple_loss=0.3812, pruned_loss=0.1469, over 20068.00 frames. ], tot_loss[loss=0.2936, simple_loss=0.3577, pruned_loss=0.1148, over 4287911.71 frames. ], batch size: 703, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:52:01,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=467334.0, ans=0.5 2023-06-19 12:53:02,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=467514.0, ans=0.125 2023-06-19 12:53:26,662 INFO [train.py:996] (3/4) Epoch 3, batch 16950, loss[loss=0.2717, simple_loss=0.3285, pruned_loss=0.1074, over 21814.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.3493, pruned_loss=0.1125, over 4290827.53 frames. ], batch size: 298, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:53:27,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=467634.0, ans=0.125 2023-06-19 12:54:00,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=467694.0, ans=0.04949747468305833 2023-06-19 12:54:46,693 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.133e+02 2.834e+02 3.306e+02 3.951e+02 5.809e+02, threshold=6.612e+02, percent-clipped=0.0 2023-06-19 12:54:55,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=467874.0, ans=0.2 2023-06-19 12:55:08,173 INFO [train.py:996] (3/4) Epoch 3, batch 17000, loss[loss=0.2677, simple_loss=0.3222, pruned_loss=0.1066, over 21373.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3452, pruned_loss=0.1127, over 4295053.95 frames. ], batch size: 159, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:55:20,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=467934.0, ans=0.125 2023-06-19 12:55:32,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=467934.0, ans=0.2 2023-06-19 12:55:34,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=467994.0, ans=0.0 2023-06-19 12:55:48,451 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.36 vs. limit=15.0 2023-06-19 12:56:03,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=468054.0, ans=0.1 2023-06-19 12:56:03,951 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-19 12:56:19,666 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:56:30,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=468174.0, ans=0.95 2023-06-19 12:56:49,424 INFO [train.py:996] (3/4) Epoch 3, batch 17050, loss[loss=0.2831, simple_loss=0.3621, pruned_loss=0.102, over 21847.00 frames. ], tot_loss[loss=0.2925, simple_loss=0.3535, pruned_loss=0.1157, over 4299961.81 frames. ], batch size: 351, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:57:18,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=468294.0, ans=0.125 2023-06-19 12:57:36,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=468354.0, ans=0.125 2023-06-19 12:57:41,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=468354.0, ans=0.125 2023-06-19 12:57:50,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=468354.0, ans=0.1 2023-06-19 12:58:14,520 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.418e+02 3.024e+02 3.438e+02 4.032e+02 7.555e+02, threshold=6.877e+02, percent-clipped=1.0 2023-06-19 12:58:30,299 INFO [train.py:996] (3/4) Epoch 3, batch 17100, loss[loss=0.3535, simple_loss=0.3791, pruned_loss=0.164, over 21720.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3522, pruned_loss=0.1161, over 4304517.47 frames. ], batch size: 508, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:59:13,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=468594.0, ans=0.0 2023-06-19 12:59:57,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=468774.0, ans=0.2 2023-06-19 13:00:11,699 INFO [train.py:996] (3/4) Epoch 3, batch 17150, loss[loss=0.2332, simple_loss=0.3108, pruned_loss=0.0778, over 21827.00 frames. ], tot_loss[loss=0.2888, simple_loss=0.3476, pruned_loss=0.115, over 4303518.76 frames. ], batch size: 332, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:00:26,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=468834.0, ans=0.125 2023-06-19 13:00:27,352 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-06-19 13:00:41,659 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=12.0 2023-06-19 13:01:03,953 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.31 vs. limit=15.0 2023-06-19 13:01:14,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=468954.0, ans=0.125 2023-06-19 13:01:32,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=469014.0, ans=0.125 2023-06-19 13:01:38,237 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.355e+02 2.874e+02 3.285e+02 3.849e+02 6.375e+02, threshold=6.570e+02, percent-clipped=0.0 2023-06-19 13:01:55,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=469074.0, ans=0.035 2023-06-19 13:02:09,759 INFO [train.py:996] (3/4) Epoch 3, batch 17200, loss[loss=0.2884, simple_loss=0.3596, pruned_loss=0.1086, over 21487.00 frames. ], tot_loss[loss=0.2864, simple_loss=0.3464, pruned_loss=0.1132, over 4297472.91 frames. ], batch size: 131, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:02:20,856 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=12.0 2023-06-19 13:02:37,287 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.28 vs. limit=6.0 2023-06-19 13:02:40,854 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=15.0 2023-06-19 13:02:53,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=469254.0, ans=0.125 2023-06-19 13:03:04,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=469314.0, ans=0.125 2023-06-19 13:03:35,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=469374.0, ans=0.0 2023-06-19 13:03:42,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=469374.0, ans=0.5 2023-06-19 13:03:53,527 INFO [train.py:996] (3/4) Epoch 3, batch 17250, loss[loss=0.2816, simple_loss=0.3536, pruned_loss=0.1048, over 21670.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3493, pruned_loss=0.1145, over 4289678.12 frames. ], batch size: 351, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:03:57,191 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:05:20,318 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.292e+02 3.993e+02 5.117e+02 9.442e+02, threshold=7.987e+02, percent-clipped=7.0 2023-06-19 13:05:37,144 INFO [train.py:996] (3/4) Epoch 3, batch 17300, loss[loss=0.3056, simple_loss=0.3853, pruned_loss=0.1129, over 20716.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3592, pruned_loss=0.1183, over 4282550.98 frames. ], batch size: 607, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:05:59,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=469794.0, ans=0.2 2023-06-19 13:06:18,181 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=15.0 2023-06-19 13:06:31,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=469854.0, ans=0.1 2023-06-19 13:06:42,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=469914.0, ans=0.015 2023-06-19 13:06:59,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=469974.0, ans=0.09899494936611666 2023-06-19 13:07:08,604 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-19 13:07:14,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=470034.0, ans=15.0 2023-06-19 13:07:15,532 INFO [train.py:996] (3/4) Epoch 3, batch 17350, loss[loss=0.256, simple_loss=0.3298, pruned_loss=0.09105, over 21642.00 frames. ], tot_loss[loss=0.2987, simple_loss=0.3614, pruned_loss=0.118, over 4276937.71 frames. ], batch size: 263, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:08:42,138 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.889e+02 3.414e+02 4.320e+02 8.908e+02, threshold=6.829e+02, percent-clipped=3.0 2023-06-19 13:08:47,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=470274.0, ans=0.2 2023-06-19 13:08:58,904 INFO [train.py:996] (3/4) Epoch 3, batch 17400, loss[loss=0.2376, simple_loss=0.2959, pruned_loss=0.08963, over 21340.00 frames. ], tot_loss[loss=0.2925, simple_loss=0.3566, pruned_loss=0.1142, over 4277012.07 frames. ], batch size: 131, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:09:58,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=470454.0, ans=0.125 2023-06-19 13:10:47,896 INFO [train.py:996] (3/4) Epoch 3, batch 17450, loss[loss=0.2123, simple_loss=0.3022, pruned_loss=0.06116, over 21618.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3499, pruned_loss=0.1093, over 4270756.79 frames. ], batch size: 263, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:11:25,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=470694.0, ans=0.125 2023-06-19 13:11:26,079 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-19 13:12:06,935 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.874e+02 3.534e+02 4.725e+02 8.315e+02, threshold=7.067e+02, percent-clipped=5.0 2023-06-19 13:12:27,738 INFO [train.py:996] (3/4) Epoch 3, batch 17500, loss[loss=0.2598, simple_loss=0.3225, pruned_loss=0.09854, over 21438.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3446, pruned_loss=0.106, over 4270108.75 frames. ], batch size: 194, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:13:49,100 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2023-06-19 13:14:07,316 INFO [train.py:996] (3/4) Epoch 3, batch 17550, loss[loss=0.2386, simple_loss=0.3154, pruned_loss=0.08089, over 21889.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3443, pruned_loss=0.1051, over 4261063.20 frames. ], batch size: 98, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:15:12,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=471414.0, ans=0.0 2023-06-19 13:15:26,287 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.757e+02 3.626e+02 4.370e+02 8.420e+02, threshold=7.252e+02, percent-clipped=1.0 2023-06-19 13:15:48,150 INFO [train.py:996] (3/4) Epoch 3, batch 17600, loss[loss=0.2571, simple_loss=0.3412, pruned_loss=0.08647, over 21839.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3454, pruned_loss=0.1044, over 4266562.45 frames. ], batch size: 124, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:16:06,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=471534.0, ans=0.125 2023-06-19 13:16:30,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=471594.0, ans=0.0 2023-06-19 13:16:44,095 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-19 13:17:35,466 INFO [train.py:996] (3/4) Epoch 3, batch 17650, loss[loss=0.2002, simple_loss=0.2606, pruned_loss=0.06991, over 21674.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3431, pruned_loss=0.1041, over 4273657.32 frames. ], batch size: 247, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:18:55,998 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.888e+02 3.326e+02 4.505e+02 7.697e+02, threshold=6.651e+02, percent-clipped=2.0 2023-06-19 13:19:17,596 INFO [train.py:996] (3/4) Epoch 3, batch 17700, loss[loss=0.2768, simple_loss=0.3568, pruned_loss=0.09842, over 21744.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3394, pruned_loss=0.1015, over 4278433.75 frames. ], batch size: 298, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:19:36,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=472134.0, ans=0.1 2023-06-19 13:19:44,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=472194.0, ans=0.1 2023-06-19 13:19:46,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=472194.0, ans=0.125 2023-06-19 13:20:38,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=472314.0, ans=0.0 2023-06-19 13:20:44,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=472374.0, ans=0.0 2023-06-19 13:20:45,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=472374.0, ans=0.0 2023-06-19 13:21:10,364 INFO [train.py:996] (3/4) Epoch 3, batch 17750, loss[loss=0.3301, simple_loss=0.3878, pruned_loss=0.1362, over 21246.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3475, pruned_loss=0.1058, over 4274227.95 frames. ], batch size: 143, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:21:14,822 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2023-06-19 13:21:16,541 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.13 vs. limit=15.0 2023-06-19 13:21:17,941 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=15.0 2023-06-19 13:21:19,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=472434.0, ans=0.125 2023-06-19 13:21:32,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=472494.0, ans=0.0 2023-06-19 13:21:39,654 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=22.5 2023-06-19 13:21:42,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=472554.0, ans=0.1 2023-06-19 13:21:44,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=472554.0, ans=0.1 2023-06-19 13:22:30,706 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.63 vs. limit=10.0 2023-06-19 13:22:32,936 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 2.975e+02 3.463e+02 4.383e+02 8.374e+02, threshold=6.927e+02, percent-clipped=5.0 2023-06-19 13:22:54,291 INFO [train.py:996] (3/4) Epoch 3, batch 17800, loss[loss=0.2295, simple_loss=0.3132, pruned_loss=0.07289, over 21739.00 frames. ], tot_loss[loss=0.2814, simple_loss=0.3494, pruned_loss=0.1067, over 4276501.73 frames. ], batch size: 298, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:23:09,229 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:23:19,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=472794.0, ans=0.125 2023-06-19 13:23:32,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=472854.0, ans=0.2 2023-06-19 13:23:48,156 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.88 vs. limit=15.0 2023-06-19 13:24:23,357 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=22.5 2023-06-19 13:24:37,168 INFO [train.py:996] (3/4) Epoch 3, batch 17850, loss[loss=0.3113, simple_loss=0.3719, pruned_loss=0.1254, over 21603.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3504, pruned_loss=0.1078, over 4274956.12 frames. ], batch size: 389, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:25:07,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=473094.0, ans=0.2 2023-06-19 13:25:07,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=473094.0, ans=0.0 2023-06-19 13:25:22,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=473154.0, ans=0.125 2023-06-19 13:26:02,586 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 3.042e+02 3.981e+02 5.013e+02 8.666e+02, threshold=7.962e+02, percent-clipped=5.0 2023-06-19 13:26:08,505 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-06-19 13:26:18,584 INFO [train.py:996] (3/4) Epoch 3, batch 17900, loss[loss=0.2676, simple_loss=0.324, pruned_loss=0.1056, over 20133.00 frames. ], tot_loss[loss=0.2884, simple_loss=0.3561, pruned_loss=0.1103, over 4277990.43 frames. ], batch size: 702, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:26:22,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=473334.0, ans=0.1 2023-06-19 13:27:42,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=473514.0, ans=0.0 2023-06-19 13:28:06,415 INFO [train.py:996] (3/4) Epoch 3, batch 17950, loss[loss=0.2497, simple_loss=0.3328, pruned_loss=0.08335, over 21745.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3538, pruned_loss=0.1058, over 4282441.10 frames. ], batch size: 332, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:28:21,092 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.40 vs. limit=6.0 2023-06-19 13:28:37,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=473694.0, ans=0.0 2023-06-19 13:28:59,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=473754.0, ans=0.0 2023-06-19 13:29:19,257 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=12.0 2023-06-19 13:29:25,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=473874.0, ans=0.125 2023-06-19 13:29:26,492 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.695e+02 3.486e+02 4.539e+02 1.017e+03, threshold=6.972e+02, percent-clipped=4.0 2023-06-19 13:29:47,379 INFO [train.py:996] (3/4) Epoch 3, batch 18000, loss[loss=0.2575, simple_loss=0.3037, pruned_loss=0.1056, over 21212.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3459, pruned_loss=0.1042, over 4278224.19 frames. ], batch size: 548, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:29:47,379 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 13:30:08,400 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2748, simple_loss=0.3795, pruned_loss=0.08502, over 1796401.00 frames. 2023-06-19 13:30:08,400 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-19 13:30:54,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=474054.0, ans=0.0 2023-06-19 13:31:49,932 INFO [train.py:996] (3/4) Epoch 3, batch 18050, loss[loss=0.2231, simple_loss=0.2863, pruned_loss=0.08001, over 21780.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.3394, pruned_loss=0.103, over 4277936.32 frames. ], batch size: 371, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:32:09,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=474234.0, ans=0.125 2023-06-19 13:32:25,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=474294.0, ans=0.125 2023-06-19 13:33:04,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=474414.0, ans=0.1 2023-06-19 13:33:07,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=474414.0, ans=0.0 2023-06-19 13:33:10,551 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 3.274e+02 3.707e+02 4.391e+02 9.006e+02, threshold=7.414e+02, percent-clipped=2.0 2023-06-19 13:33:32,285 INFO [train.py:996] (3/4) Epoch 3, batch 18100, loss[loss=0.2554, simple_loss=0.3468, pruned_loss=0.08204, over 21234.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3445, pruned_loss=0.1065, over 4268686.23 frames. ], batch size: 176, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:34:58,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=474774.0, ans=0.05 2023-06-19 13:35:07,856 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=15.0 2023-06-19 13:35:18,752 INFO [train.py:996] (3/4) Epoch 3, batch 18150, loss[loss=0.253, simple_loss=0.322, pruned_loss=0.09201, over 21192.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.344, pruned_loss=0.1056, over 4267891.20 frames. ], batch size: 549, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:35:35,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=474834.0, ans=0.02 2023-06-19 13:35:55,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=474954.0, ans=0.1 2023-06-19 13:36:11,896 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.11 vs. limit=15.0 2023-06-19 13:36:31,884 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.411e+02 3.099e+02 3.760e+02 4.824e+02 9.400e+02, threshold=7.520e+02, percent-clipped=8.0 2023-06-19 13:36:52,781 INFO [train.py:996] (3/4) Epoch 3, batch 18200, loss[loss=0.2516, simple_loss=0.3108, pruned_loss=0.09622, over 21354.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.3375, pruned_loss=0.106, over 4254640.49 frames. ], batch size: 144, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:37:25,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=475194.0, ans=0.125 2023-06-19 13:37:42,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=475254.0, ans=0.2 2023-06-19 13:37:58,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=475314.0, ans=0.1 2023-06-19 13:38:20,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=475374.0, ans=0.1 2023-06-19 13:38:31,996 INFO [train.py:996] (3/4) Epoch 3, batch 18250, loss[loss=0.2772, simple_loss=0.3293, pruned_loss=0.1126, over 21720.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3295, pruned_loss=0.1025, over 4257534.30 frames. ], batch size: 389, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:38:32,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=475434.0, ans=15.0 2023-06-19 13:38:59,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=475494.0, ans=0.0 2023-06-19 13:39:21,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=475554.0, ans=10.0 2023-06-19 13:39:45,972 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.479e+02 3.020e+02 3.989e+02 8.042e+02, threshold=6.040e+02, percent-clipped=2.0 2023-06-19 13:40:06,821 INFO [train.py:996] (3/4) Epoch 3, batch 18300, loss[loss=0.2811, simple_loss=0.3429, pruned_loss=0.1097, over 21427.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3274, pruned_loss=0.1024, over 4253986.99 frames. ], batch size: 131, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:41:46,932 INFO [train.py:996] (3/4) Epoch 3, batch 18350, loss[loss=0.2971, simple_loss=0.3858, pruned_loss=0.1042, over 21389.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3335, pruned_loss=0.1014, over 4246837.22 frames. ], batch size: 194, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:42:00,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=476034.0, ans=0.125 2023-06-19 13:42:02,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=476034.0, ans=0.125 2023-06-19 13:43:08,848 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.933e+02 3.430e+02 4.228e+02 7.523e+02, threshold=6.860e+02, percent-clipped=6.0 2023-06-19 13:43:09,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=476274.0, ans=0.125 2023-06-19 13:43:22,515 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=15.0 2023-06-19 13:43:28,024 INFO [train.py:996] (3/4) Epoch 3, batch 18400, loss[loss=0.2337, simple_loss=0.3007, pruned_loss=0.08336, over 21625.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3317, pruned_loss=0.1019, over 4253484.85 frames. ], batch size: 263, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:44:40,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=476514.0, ans=0.125 2023-06-19 13:44:53,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=476574.0, ans=0.2 2023-06-19 13:45:02,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=476574.0, ans=0.1 2023-06-19 13:45:08,768 INFO [train.py:996] (3/4) Epoch 3, batch 18450, loss[loss=0.2436, simple_loss=0.3161, pruned_loss=0.08553, over 21676.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3274, pruned_loss=0.09657, over 4245961.84 frames. ], batch size: 298, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:45:46,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=476694.0, ans=0.2 2023-06-19 13:46:01,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=476754.0, ans=0.125 2023-06-19 13:46:20,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=476814.0, ans=0.0 2023-06-19 13:46:31,980 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.812e+02 3.346e+02 4.382e+02 1.092e+03, threshold=6.692e+02, percent-clipped=3.0 2023-06-19 13:46:49,831 INFO [train.py:996] (3/4) Epoch 3, batch 18500, loss[loss=0.2064, simple_loss=0.2764, pruned_loss=0.06824, over 21506.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3216, pruned_loss=0.09473, over 4233788.57 frames. ], batch size: 230, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:47:11,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=476994.0, ans=0.2 2023-06-19 13:47:38,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=477054.0, ans=0.2 2023-06-19 13:47:55,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=477114.0, ans=0.125 2023-06-19 13:48:15,823 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=22.5 2023-06-19 13:48:20,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=477174.0, ans=0.1 2023-06-19 13:48:23,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=477174.0, ans=0.125 2023-06-19 13:48:30,905 INFO [train.py:996] (3/4) Epoch 3, batch 18550, loss[loss=0.2196, simple_loss=0.2819, pruned_loss=0.07864, over 21208.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3207, pruned_loss=0.09344, over 4238042.74 frames. ], batch size: 176, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:49:24,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=477354.0, ans=0.1 2023-06-19 13:49:59,809 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 3.099e+02 3.527e+02 4.215e+02 7.049e+02, threshold=7.053e+02, percent-clipped=1.0 2023-06-19 13:50:13,142 INFO [train.py:996] (3/4) Epoch 3, batch 18600, loss[loss=0.2465, simple_loss=0.3173, pruned_loss=0.08787, over 21656.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.321, pruned_loss=0.09491, over 4234519.21 frames. ], batch size: 298, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:50:48,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=477594.0, ans=0.0 2023-06-19 13:50:54,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=477594.0, ans=0.2 2023-06-19 13:51:59,728 INFO [train.py:996] (3/4) Epoch 3, batch 18650, loss[loss=0.2677, simple_loss=0.3228, pruned_loss=0.1064, over 21804.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3217, pruned_loss=0.09563, over 4236084.20 frames. ], batch size: 317, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:52:05,570 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=15.0 2023-06-19 13:52:48,692 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=22.5 2023-06-19 13:53:21,443 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.016e+02 3.610e+02 4.241e+02 7.263e+02, threshold=7.220e+02, percent-clipped=2.0 2023-06-19 13:53:33,747 INFO [train.py:996] (3/4) Epoch 3, batch 18700, loss[loss=0.2606, simple_loss=0.3177, pruned_loss=0.1018, over 21887.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3213, pruned_loss=0.09827, over 4242868.33 frames. ], batch size: 316, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:54:25,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=478254.0, ans=0.0 2023-06-19 13:55:02,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=478374.0, ans=0.125 2023-06-19 13:55:07,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=478374.0, ans=0.1 2023-06-19 13:55:10,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=478374.0, ans=0.2 2023-06-19 13:55:15,225 INFO [train.py:996] (3/4) Epoch 3, batch 18750, loss[loss=0.2475, simple_loss=0.3, pruned_loss=0.09753, over 21542.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3221, pruned_loss=0.1, over 4243882.42 frames. ], batch size: 212, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:55:32,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=478434.0, ans=0.2 2023-06-19 13:56:06,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=478554.0, ans=0.0 2023-06-19 13:56:14,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=478554.0, ans=0.1 2023-06-19 13:56:43,450 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 3.016e+02 3.473e+02 4.351e+02 6.634e+02, threshold=6.946e+02, percent-clipped=0.0 2023-06-19 13:56:56,573 INFO [train.py:996] (3/4) Epoch 3, batch 18800, loss[loss=0.1977, simple_loss=0.2708, pruned_loss=0.06226, over 21767.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3276, pruned_loss=0.1006, over 4254867.81 frames. ], batch size: 118, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:57:41,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=478794.0, ans=0.1 2023-06-19 13:57:41,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=478794.0, ans=0.1 2023-06-19 13:57:59,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=478914.0, ans=0.125 2023-06-19 13:58:03,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=478914.0, ans=0.125 2023-06-19 13:58:04,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=478914.0, ans=0.1 2023-06-19 13:58:15,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=478914.0, ans=0.0 2023-06-19 13:58:31,293 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-19 13:58:41,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=478974.0, ans=0.125 2023-06-19 13:58:44,186 INFO [train.py:996] (3/4) Epoch 3, batch 18850, loss[loss=0.199, simple_loss=0.2724, pruned_loss=0.06275, over 21611.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3234, pruned_loss=0.095, over 4253841.07 frames. ], batch size: 247, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:59:31,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=479154.0, ans=0.125 2023-06-19 14:00:08,732 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.815e+02 2.738e+02 3.222e+02 4.135e+02 8.390e+02, threshold=6.445e+02, percent-clipped=2.0 2023-06-19 14:00:24,830 INFO [train.py:996] (3/4) Epoch 3, batch 18900, loss[loss=0.306, simple_loss=0.3492, pruned_loss=0.1314, over 21617.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3208, pruned_loss=0.0959, over 4263483.76 frames. ], batch size: 473, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 14:01:16,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=479454.0, ans=0.1 2023-06-19 14:01:39,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=479514.0, ans=0.0 2023-06-19 14:01:56,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=479574.0, ans=0.0 2023-06-19 14:02:07,433 INFO [train.py:996] (3/4) Epoch 3, batch 18950, loss[loss=0.309, simple_loss=0.3911, pruned_loss=0.1135, over 21872.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3201, pruned_loss=0.09789, over 4270832.24 frames. ], batch size: 333, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 14:02:34,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=479694.0, ans=0.125 2023-06-19 14:02:51,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=479754.0, ans=0.125 2023-06-19 14:03:04,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=479754.0, ans=0.0 2023-06-19 14:03:40,000 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.897e+02 3.488e+02 4.402e+02 6.601e+02, threshold=6.976e+02, percent-clipped=2.0 2023-06-19 14:03:40,490 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:03:42,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=479874.0, ans=0.125 2023-06-19 14:03:42,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=479874.0, ans=0.125 2023-06-19 14:03:56,916 INFO [train.py:996] (3/4) Epoch 3, batch 19000, loss[loss=0.282, simple_loss=0.3513, pruned_loss=0.1063, over 21919.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3304, pruned_loss=0.1016, over 4267165.20 frames. ], batch size: 316, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 14:04:25,032 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-06-19 14:05:08,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=480114.0, ans=0.1 2023-06-19 14:05:20,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=480174.0, ans=0.125 2023-06-19 14:05:39,657 INFO [train.py:996] (3/4) Epoch 3, batch 19050, loss[loss=0.3171, simple_loss=0.3612, pruned_loss=0.1365, over 21817.00 frames. ], tot_loss[loss=0.2746, simple_loss=0.3366, pruned_loss=0.1063, over 4273846.43 frames. ], batch size: 112, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:05:40,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=480234.0, ans=0.04949747468305833 2023-06-19 14:05:53,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=480234.0, ans=0.125 2023-06-19 14:05:56,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=480234.0, ans=0.0 2023-06-19 14:05:56,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=480234.0, ans=0.07 2023-06-19 14:06:31,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=480354.0, ans=0.125 2023-06-19 14:06:41,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=480414.0, ans=0.125 2023-06-19 14:07:04,465 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.299e+02 3.242e+02 3.668e+02 4.263e+02 6.635e+02, threshold=7.336e+02, percent-clipped=0.0 2023-06-19 14:07:06,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=480474.0, ans=0.125 2023-06-19 14:07:21,778 INFO [train.py:996] (3/4) Epoch 3, batch 19100, loss[loss=0.3002, simple_loss=0.3452, pruned_loss=0.1276, over 21248.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3353, pruned_loss=0.1075, over 4279587.78 frames. ], batch size: 548, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:07:22,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=480534.0, ans=0.2 2023-06-19 14:08:05,672 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-19 14:08:37,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=480714.0, ans=0.125 2023-06-19 14:08:58,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=480774.0, ans=0.0 2023-06-19 14:09:11,220 INFO [train.py:996] (3/4) Epoch 3, batch 19150, loss[loss=0.249, simple_loss=0.3188, pruned_loss=0.08964, over 21219.00 frames. ], tot_loss[loss=0.276, simple_loss=0.337, pruned_loss=0.1075, over 4278873.82 frames. ], batch size: 159, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:10:31,912 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-19 14:10:43,711 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 3.009e+02 3.597e+02 4.510e+02 7.028e+02, threshold=7.194e+02, percent-clipped=0.0 2023-06-19 14:10:55,125 INFO [train.py:996] (3/4) Epoch 3, batch 19200, loss[loss=0.3046, simple_loss=0.4017, pruned_loss=0.1037, over 21728.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.3496, pruned_loss=0.109, over 4271822.49 frames. ], batch size: 332, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:11:10,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=481134.0, ans=0.0 2023-06-19 14:11:46,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=481254.0, ans=0.125 2023-06-19 14:11:54,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=481314.0, ans=0.0 2023-06-19 14:12:35,844 INFO [train.py:996] (3/4) Epoch 3, batch 19250, loss[loss=0.2355, simple_loss=0.3109, pruned_loss=0.08002, over 21366.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.346, pruned_loss=0.1018, over 4279958.32 frames. ], batch size: 194, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:12:44,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=481434.0, ans=0.125 2023-06-19 14:13:08,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=481494.0, ans=0.0 2023-06-19 14:13:29,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=481614.0, ans=0.125 2023-06-19 14:13:45,625 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-06-19 14:13:54,152 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.656e+02 2.436e+02 3.016e+02 3.592e+02 9.679e+02, threshold=6.032e+02, percent-clipped=2.0 2023-06-19 14:14:10,925 INFO [train.py:996] (3/4) Epoch 3, batch 19300, loss[loss=0.2825, simple_loss=0.3385, pruned_loss=0.1132, over 21904.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3416, pruned_loss=0.1003, over 4283897.09 frames. ], batch size: 107, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:14:54,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=481854.0, ans=0.125 2023-06-19 14:15:20,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=481914.0, ans=0.125 2023-06-19 14:15:38,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=481974.0, ans=0.125 2023-06-19 14:15:54,288 INFO [train.py:996] (3/4) Epoch 3, batch 19350, loss[loss=0.2157, simple_loss=0.2912, pruned_loss=0.07008, over 21473.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3361, pruned_loss=0.09633, over 4286071.32 frames. ], batch size: 195, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:17:13,359 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.752e+02 3.460e+02 4.444e+02 7.574e+02, threshold=6.920e+02, percent-clipped=6.0 2023-06-19 14:17:24,652 INFO [train.py:996] (3/4) Epoch 3, batch 19400, loss[loss=0.3352, simple_loss=0.3763, pruned_loss=0.147, over 21733.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3345, pruned_loss=0.09596, over 4286735.11 frames. ], batch size: 508, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:17:26,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=482334.0, ans=0.125 2023-06-19 14:17:59,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=482394.0, ans=0.1 2023-06-19 14:18:17,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=482454.0, ans=0.0 2023-06-19 14:18:44,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=482574.0, ans=0.09899494936611666 2023-06-19 14:19:05,666 INFO [train.py:996] (3/4) Epoch 3, batch 19450, loss[loss=0.2645, simple_loss=0.3061, pruned_loss=0.1114, over 21487.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3339, pruned_loss=0.09877, over 4283894.87 frames. ], batch size: 212, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:19:50,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=482754.0, ans=0.1 2023-06-19 14:20:23,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=482814.0, ans=0.125 2023-06-19 14:20:37,739 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.381e+02 3.021e+02 3.528e+02 4.324e+02 6.786e+02, threshold=7.055e+02, percent-clipped=0.0 2023-06-19 14:20:52,466 INFO [train.py:996] (3/4) Epoch 3, batch 19500, loss[loss=0.2169, simple_loss=0.2669, pruned_loss=0.08341, over 21159.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3286, pruned_loss=0.09983, over 4279357.82 frames. ], batch size: 159, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:20:59,901 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-19 14:21:17,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=482994.0, ans=0.1 2023-06-19 14:21:51,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=483114.0, ans=0.1 2023-06-19 14:22:22,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=483174.0, ans=0.07 2023-06-19 14:22:32,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=483174.0, ans=0.1 2023-06-19 14:22:34,956 INFO [train.py:996] (3/4) Epoch 3, batch 19550, loss[loss=0.2675, simple_loss=0.3506, pruned_loss=0.09222, over 21625.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3242, pruned_loss=0.09778, over 4273008.35 frames. ], batch size: 263, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:22:57,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=483294.0, ans=0.04949747468305833 2023-06-19 14:23:36,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=483414.0, ans=0.125 2023-06-19 14:23:47,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=483414.0, ans=0.0 2023-06-19 14:23:47,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=483414.0, ans=0.1 2023-06-19 14:23:59,163 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-19 14:24:06,641 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.887e+02 3.715e+02 4.750e+02 9.269e+02, threshold=7.430e+02, percent-clipped=4.0 2023-06-19 14:24:16,380 INFO [train.py:996] (3/4) Epoch 3, batch 19600, loss[loss=0.2817, simple_loss=0.3418, pruned_loss=0.1108, over 21748.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3265, pruned_loss=0.0986, over 4274877.27 frames. ], batch size: 441, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:24:35,461 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=15.0 2023-06-19 14:25:05,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=483654.0, ans=0.1 2023-06-19 14:25:14,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=483714.0, ans=0.5 2023-06-19 14:25:52,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=483774.0, ans=0.125 2023-06-19 14:25:58,811 INFO [train.py:996] (3/4) Epoch 3, batch 19650, loss[loss=0.2969, simple_loss=0.3633, pruned_loss=0.1152, over 21417.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.334, pruned_loss=0.1044, over 4278711.99 frames. ], batch size: 131, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:26:17,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=483894.0, ans=0.125 2023-06-19 14:27:34,292 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.382e+02 2.992e+02 3.430e+02 3.953e+02 7.302e+02, threshold=6.859e+02, percent-clipped=0.0 2023-06-19 14:27:34,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=484074.0, ans=0.0 2023-06-19 14:27:44,511 INFO [train.py:996] (3/4) Epoch 3, batch 19700, loss[loss=0.2689, simple_loss=0.3366, pruned_loss=0.1006, over 21608.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3385, pruned_loss=0.1059, over 4279820.93 frames. ], batch size: 263, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:28:26,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=484194.0, ans=0.125 2023-06-19 14:28:42,110 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-19 14:29:05,912 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=22.5 2023-06-19 14:29:14,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=15.0 2023-06-19 14:29:33,029 INFO [train.py:996] (3/4) Epoch 3, batch 19750, loss[loss=0.2661, simple_loss=0.3604, pruned_loss=0.08591, over 21570.00 frames. ], tot_loss[loss=0.2809, simple_loss=0.3481, pruned_loss=0.1068, over 4285074.32 frames. ], batch size: 441, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:30:15,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=484554.0, ans=0.125 2023-06-19 14:31:05,867 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 3.206e+02 3.832e+02 4.660e+02 9.927e+02, threshold=7.664e+02, percent-clipped=2.0 2023-06-19 14:31:15,126 INFO [train.py:996] (3/4) Epoch 3, batch 19800, loss[loss=0.2793, simple_loss=0.3516, pruned_loss=0.1035, over 21521.00 frames. ], tot_loss[loss=0.2811, simple_loss=0.3468, pruned_loss=0.1077, over 4285273.95 frames. ], batch size: 471, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:31:25,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=484734.0, ans=0.2 2023-06-19 14:31:30,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=484734.0, ans=0.1 2023-06-19 14:31:37,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=484734.0, ans=0.0 2023-06-19 14:31:47,871 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.68 vs. limit=10.0 2023-06-19 14:32:12,137 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-19 14:33:01,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=485034.0, ans=0.125 2023-06-19 14:33:03,125 INFO [train.py:996] (3/4) Epoch 3, batch 19850, loss[loss=0.2242, simple_loss=0.3032, pruned_loss=0.07257, over 21599.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3388, pruned_loss=0.1017, over 4281389.24 frames. ], batch size: 230, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:33:28,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=485094.0, ans=0.125 2023-06-19 14:33:40,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=485094.0, ans=0.1 2023-06-19 14:34:21,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=485274.0, ans=0.125 2023-06-19 14:34:29,603 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.663e+02 3.192e+02 3.932e+02 5.931e+02, threshold=6.384e+02, percent-clipped=0.0 2023-06-19 14:34:40,924 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2023-06-19 14:34:43,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=485334.0, ans=0.0 2023-06-19 14:34:45,233 INFO [train.py:996] (3/4) Epoch 3, batch 19900, loss[loss=0.2438, simple_loss=0.3153, pruned_loss=0.08611, over 21655.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.337, pruned_loss=0.09825, over 4279559.49 frames. ], batch size: 298, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:34:48,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=485334.0, ans=0.125 2023-06-19 14:35:01,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=485334.0, ans=0.125 2023-06-19 14:35:05,837 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=12.0 2023-06-19 14:35:46,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=485514.0, ans=0.125 2023-06-19 14:36:01,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=485514.0, ans=0.125 2023-06-19 14:36:33,272 INFO [train.py:996] (3/4) Epoch 3, batch 19950, loss[loss=0.2235, simple_loss=0.289, pruned_loss=0.07901, over 19953.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3316, pruned_loss=0.0986, over 4274199.16 frames. ], batch size: 702, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:36:57,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=485694.0, ans=0.2 2023-06-19 14:37:16,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=485754.0, ans=0.125 2023-06-19 14:37:17,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=485754.0, ans=0.125 2023-06-19 14:37:25,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=485814.0, ans=0.125 2023-06-19 14:37:59,666 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.893e+02 3.575e+02 4.384e+02 6.859e+02, threshold=7.149e+02, percent-clipped=1.0 2023-06-19 14:38:13,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=485934.0, ans=0.1 2023-06-19 14:38:14,239 INFO [train.py:996] (3/4) Epoch 3, batch 20000, loss[loss=0.2943, simple_loss=0.3496, pruned_loss=0.1195, over 21849.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3332, pruned_loss=0.1002, over 4281198.02 frames. ], batch size: 118, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:39:54,948 INFO [train.py:996] (3/4) Epoch 3, batch 20050, loss[loss=0.2519, simple_loss=0.3165, pruned_loss=0.0936, over 21657.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3353, pruned_loss=0.1032, over 4286972.19 frames. ], batch size: 230, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:40:12,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=486294.0, ans=0.125 2023-06-19 14:41:27,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=486474.0, ans=0.1 2023-06-19 14:41:28,455 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.834e+02 3.316e+02 3.890e+02 7.458e+02, threshold=6.631e+02, percent-clipped=1.0 2023-06-19 14:41:38,305 INFO [train.py:996] (3/4) Epoch 3, batch 20100, loss[loss=0.3307, simple_loss=0.419, pruned_loss=0.1213, over 20948.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3377, pruned_loss=0.1054, over 4290185.24 frames. ], batch size: 607, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:42:00,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.whiten.whitening_limit, batch_count=486594.0, ans=12.0 2023-06-19 14:42:08,954 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.22 vs. limit=10.0 2023-06-19 14:43:14,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=486774.0, ans=0.1 2023-06-19 14:43:27,775 INFO [train.py:996] (3/4) Epoch 3, batch 20150, loss[loss=0.2708, simple_loss=0.3359, pruned_loss=0.1028, over 21574.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.3473, pruned_loss=0.1083, over 4289817.19 frames. ], batch size: 230, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:44:01,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=486894.0, ans=0.125 2023-06-19 14:44:09,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=486954.0, ans=0.0 2023-06-19 14:44:13,913 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=12.0 2023-06-19 14:45:05,235 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.534e+02 3.281e+02 3.907e+02 5.074e+02 8.084e+02, threshold=7.814e+02, percent-clipped=7.0 2023-06-19 14:45:13,866 INFO [train.py:996] (3/4) Epoch 3, batch 20200, loss[loss=0.2482, simple_loss=0.3036, pruned_loss=0.09638, over 21856.00 frames. ], tot_loss[loss=0.288, simple_loss=0.3527, pruned_loss=0.1117, over 4282945.99 frames. ], batch size: 107, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:45:27,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=487134.0, ans=0.125 2023-06-19 14:46:17,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=487314.0, ans=0.125 2023-06-19 14:46:34,128 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=22.5 2023-06-19 14:47:00,865 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.13 vs. limit=15.0 2023-06-19 14:47:01,479 INFO [train.py:996] (3/4) Epoch 3, batch 20250, loss[loss=0.2762, simple_loss=0.3549, pruned_loss=0.09873, over 21407.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3547, pruned_loss=0.1102, over 4281208.23 frames. ], batch size: 548, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:47:05,766 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=15.0 2023-06-19 14:47:13,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=487434.0, ans=0.1 2023-06-19 14:47:51,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=487554.0, ans=0.125 2023-06-19 14:48:04,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=487614.0, ans=0.125 2023-06-19 14:48:06,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=487614.0, ans=0.125 2023-06-19 14:48:17,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=487614.0, ans=0.0 2023-06-19 14:48:29,759 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.865e+02 3.489e+02 4.461e+02 6.612e+02, threshold=6.978e+02, percent-clipped=0.0 2023-06-19 14:48:43,465 INFO [train.py:996] (3/4) Epoch 3, batch 20300, loss[loss=0.278, simple_loss=0.3203, pruned_loss=0.1178, over 20012.00 frames. ], tot_loss[loss=0.2815, simple_loss=0.3507, pruned_loss=0.1061, over 4278423.73 frames. ], batch size: 704, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:49:01,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=487794.0, ans=0.0 2023-06-19 14:49:16,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=487794.0, ans=0.0 2023-06-19 14:49:59,071 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-19 14:50:24,315 INFO [train.py:996] (3/4) Epoch 3, batch 20350, loss[loss=0.2799, simple_loss=0.349, pruned_loss=0.1053, over 21431.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3505, pruned_loss=0.1066, over 4269937.92 frames. ], batch size: 211, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:50:27,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=488034.0, ans=0.125 2023-06-19 14:50:45,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=488094.0, ans=0.0 2023-06-19 14:51:16,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=488154.0, ans=0.2 2023-06-19 14:51:25,557 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-19 14:51:41,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=488214.0, ans=0.0 2023-06-19 14:51:51,833 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.942e+02 3.646e+02 4.954e+02 9.108e+02, threshold=7.293e+02, percent-clipped=8.0 2023-06-19 14:52:05,465 INFO [train.py:996] (3/4) Epoch 3, batch 20400, loss[loss=0.2179, simple_loss=0.2854, pruned_loss=0.07522, over 16765.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3523, pruned_loss=0.1095, over 4246269.59 frames. ], batch size: 62, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:52:40,571 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:53:21,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=488574.0, ans=0.125 2023-06-19 14:53:22,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=488574.0, ans=0.125 2023-06-19 14:53:42,576 INFO [train.py:996] (3/4) Epoch 3, batch 20450, loss[loss=0.3017, simple_loss=0.353, pruned_loss=0.1252, over 21679.00 frames. ], tot_loss[loss=0.2883, simple_loss=0.3534, pruned_loss=0.1116, over 4224540.89 frames. ], batch size: 230, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:54:33,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=488754.0, ans=0.125 2023-06-19 14:55:09,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=488874.0, ans=0.0 2023-06-19 14:55:15,703 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.996e+02 3.435e+02 4.162e+02 7.102e+02, threshold=6.869e+02, percent-clipped=0.0 2023-06-19 14:55:22,276 INFO [train.py:996] (3/4) Epoch 3, batch 20500, loss[loss=0.2733, simple_loss=0.3291, pruned_loss=0.1087, over 21623.00 frames. ], tot_loss[loss=0.2865, simple_loss=0.3494, pruned_loss=0.1118, over 4233116.31 frames. ], batch size: 230, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:57:02,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=489174.0, ans=0.125 2023-06-19 14:57:05,156 INFO [train.py:996] (3/4) Epoch 3, batch 20550, loss[loss=0.3118, simple_loss=0.3839, pruned_loss=0.1199, over 21478.00 frames. ], tot_loss[loss=0.2826, simple_loss=0.3441, pruned_loss=0.1105, over 4237635.90 frames. ], batch size: 473, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 14:58:39,775 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.659e+02 2.727e+02 3.175e+02 3.881e+02 7.747e+02, threshold=6.350e+02, percent-clipped=1.0 2023-06-19 14:58:45,942 INFO [train.py:996] (3/4) Epoch 3, batch 20600, loss[loss=0.3682, simple_loss=0.4204, pruned_loss=0.158, over 21503.00 frames. ], tot_loss[loss=0.2809, simple_loss=0.3457, pruned_loss=0.108, over 4226462.52 frames. ], batch size: 507, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 14:58:52,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=489534.0, ans=0.1 2023-06-19 14:58:58,244 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.02 vs. limit=15.0 2023-06-19 14:59:02,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=489594.0, ans=0.125 2023-06-19 14:59:29,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=489654.0, ans=0.125 2023-06-19 14:59:55,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=489714.0, ans=10.0 2023-06-19 15:00:06,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=489714.0, ans=0.1 2023-06-19 15:00:24,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=489774.0, ans=0.0 2023-06-19 15:00:27,222 INFO [train.py:996] (3/4) Epoch 3, batch 20650, loss[loss=0.2578, simple_loss=0.3066, pruned_loss=0.1044, over 21873.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.3407, pruned_loss=0.1081, over 4231551.73 frames. ], batch size: 98, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:00:34,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=489834.0, ans=0.2 2023-06-19 15:00:37,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-19 15:00:47,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=489894.0, ans=0.035 2023-06-19 15:01:09,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=489954.0, ans=0.2 2023-06-19 15:01:38,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.12 vs. limit=6.0 2023-06-19 15:02:03,113 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.806e+02 3.239e+02 3.703e+02 6.671e+02, threshold=6.478e+02, percent-clipped=1.0 2023-06-19 15:02:07,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=490074.0, ans=0.125 2023-06-19 15:02:10,367 INFO [train.py:996] (3/4) Epoch 3, batch 20700, loss[loss=0.2638, simple_loss=0.327, pruned_loss=0.1003, over 21640.00 frames. ], tot_loss[loss=0.272, simple_loss=0.334, pruned_loss=0.105, over 4243319.78 frames. ], batch size: 263, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:02:18,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=490134.0, ans=0.125 2023-06-19 15:02:31,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=490194.0, ans=0.0 2023-06-19 15:02:52,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=490254.0, ans=0.0 2023-06-19 15:02:54,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=490254.0, ans=0.125 2023-06-19 15:03:21,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=490314.0, ans=0.0 2023-06-19 15:03:42,220 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-06-19 15:03:48,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=490374.0, ans=0.04949747468305833 2023-06-19 15:03:50,827 INFO [train.py:996] (3/4) Epoch 3, batch 20750, loss[loss=0.278, simple_loss=0.3652, pruned_loss=0.09544, over 21656.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3349, pruned_loss=0.1035, over 4246289.95 frames. ], batch size: 230, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:04:16,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=490494.0, ans=0.0 2023-06-19 15:04:33,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-06-19 15:05:16,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=490674.0, ans=0.125 2023-06-19 15:05:17,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=490674.0, ans=0.125 2023-06-19 15:05:27,643 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.202e+02 3.808e+02 5.093e+02 1.097e+03, threshold=7.616e+02, percent-clipped=4.0 2023-06-19 15:05:33,779 INFO [train.py:996] (3/4) Epoch 3, batch 20800, loss[loss=0.2666, simple_loss=0.3195, pruned_loss=0.1069, over 21560.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3406, pruned_loss=0.1053, over 4256761.44 frames. ], batch size: 414, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:05:44,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=490734.0, ans=0.125 2023-06-19 15:05:54,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=490794.0, ans=0.0 2023-06-19 15:06:15,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=490854.0, ans=0.0 2023-06-19 15:06:41,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=490914.0, ans=0.2 2023-06-19 15:06:54,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=490974.0, ans=0.04949747468305833 2023-06-19 15:07:10,047 INFO [train.py:996] (3/4) Epoch 3, batch 20850, loss[loss=0.1962, simple_loss=0.27, pruned_loss=0.06121, over 21680.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3351, pruned_loss=0.1046, over 4258049.62 frames. ], batch size: 298, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:07:12,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=491034.0, ans=0.0 2023-06-19 15:08:07,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=491154.0, ans=0.125 2023-06-19 15:08:39,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=491274.0, ans=0.07 2023-06-19 15:08:45,558 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 3.010e+02 3.845e+02 4.738e+02 1.149e+03, threshold=7.690e+02, percent-clipped=6.0 2023-06-19 15:08:56,662 INFO [train.py:996] (3/4) Epoch 3, batch 20900, loss[loss=0.2978, simple_loss=0.3708, pruned_loss=0.1124, over 21351.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.3351, pruned_loss=0.1058, over 4268713.18 frames. ], batch size: 548, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:10:12,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=491514.0, ans=0.0 2023-06-19 15:10:20,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=491574.0, ans=0.125 2023-06-19 15:10:31,159 INFO [train.py:996] (3/4) Epoch 3, batch 20950, loss[loss=0.2125, simple_loss=0.2876, pruned_loss=0.06871, over 21690.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3312, pruned_loss=0.1018, over 4268738.90 frames. ], batch size: 298, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:10:31,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=491634.0, ans=0.125 2023-06-19 15:10:42,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=491634.0, ans=0.2 2023-06-19 15:11:05,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=491694.0, ans=0.0 2023-06-19 15:11:14,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=491754.0, ans=0.5 2023-06-19 15:11:28,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=491754.0, ans=0.125 2023-06-19 15:11:41,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=491814.0, ans=0.0 2023-06-19 15:12:04,993 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.767e+02 2.584e+02 3.033e+02 4.070e+02 6.900e+02, threshold=6.066e+02, percent-clipped=0.0 2023-06-19 15:12:11,005 INFO [train.py:996] (3/4) Epoch 3, batch 21000, loss[loss=0.2301, simple_loss=0.2979, pruned_loss=0.08117, over 21263.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.33, pruned_loss=0.1019, over 4274032.03 frames. ], batch size: 143, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:12:11,005 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 15:12:29,323 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2787, simple_loss=0.3805, pruned_loss=0.08847, over 1796401.00 frames. 2023-06-19 15:12:29,324 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-19 15:13:34,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=492114.0, ans=0.125 2023-06-19 15:14:05,447 INFO [train.py:996] (3/4) Epoch 3, batch 21050, loss[loss=0.2866, simple_loss=0.3326, pruned_loss=0.1203, over 21242.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3273, pruned_loss=0.1024, over 4281408.46 frames. ], batch size: 176, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:15:06,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=492354.0, ans=0.125 2023-06-19 15:15:24,124 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.38 vs. limit=10.0 2023-06-19 15:15:36,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=492474.0, ans=0.1 2023-06-19 15:15:39,125 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.805e+02 3.251e+02 3.947e+02 6.448e+02, threshold=6.502e+02, percent-clipped=2.0 2023-06-19 15:15:41,862 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.23 vs. limit=5.0 2023-06-19 15:15:45,679 INFO [train.py:996] (3/4) Epoch 3, batch 21100, loss[loss=0.2413, simple_loss=0.2943, pruned_loss=0.09415, over 21179.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3232, pruned_loss=0.1015, over 4281278.59 frames. ], batch size: 159, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:16:05,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=492534.0, ans=0.1 2023-06-19 15:16:43,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=492654.0, ans=0.025 2023-06-19 15:16:58,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=492714.0, ans=0.0 2023-06-19 15:17:03,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=492714.0, ans=0.0 2023-06-19 15:17:14,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=492774.0, ans=0.125 2023-06-19 15:17:27,191 INFO [train.py:996] (3/4) Epoch 3, batch 21150, loss[loss=0.2661, simple_loss=0.3113, pruned_loss=0.1104, over 21686.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3205, pruned_loss=0.1015, over 4271889.46 frames. ], batch size: 282, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:18:06,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=492894.0, ans=0.07 2023-06-19 15:19:03,367 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.674e+02 3.023e+02 3.632e+02 5.729e+02, threshold=6.045e+02, percent-clipped=0.0 2023-06-19 15:19:13,122 INFO [train.py:996] (3/4) Epoch 3, batch 21200, loss[loss=0.2313, simple_loss=0.2852, pruned_loss=0.08874, over 21198.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3158, pruned_loss=0.1, over 4273119.41 frames. ], batch size: 548, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:19:27,663 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.66 vs. limit=15.0 2023-06-19 15:19:41,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=493194.0, ans=0.125 2023-06-19 15:19:52,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=493194.0, ans=0.125 2023-06-19 15:20:09,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.84 vs. limit=22.5 2023-06-19 15:20:10,524 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.76 vs. limit=6.0 2023-06-19 15:20:49,038 INFO [train.py:996] (3/4) Epoch 3, batch 21250, loss[loss=0.3706, simple_loss=0.4219, pruned_loss=0.1597, over 21730.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.316, pruned_loss=0.1015, over 4257801.20 frames. ], batch size: 415, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:20:49,925 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-19 15:21:01,790 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.51 vs. limit=15.0 2023-06-19 15:21:08,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=493434.0, ans=0.0 2023-06-19 15:21:39,470 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-19 15:21:54,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=493554.0, ans=0.1 2023-06-19 15:22:24,394 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 3.377e+02 3.991e+02 5.475e+02 9.358e+02, threshold=7.981e+02, percent-clipped=20.0 2023-06-19 15:22:29,360 INFO [train.py:996] (3/4) Epoch 3, batch 21300, loss[loss=0.2957, simple_loss=0.3783, pruned_loss=0.1066, over 19772.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3231, pruned_loss=0.1039, over 4262017.59 frames. ], batch size: 704, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:22:58,486 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-19 15:23:13,851 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-19 15:23:15,401 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=22.5 2023-06-19 15:23:59,839 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=12.0 2023-06-19 15:24:03,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=493974.0, ans=22.5 2023-06-19 15:24:17,167 INFO [train.py:996] (3/4) Epoch 3, batch 21350, loss[loss=0.2361, simple_loss=0.3295, pruned_loss=0.07136, over 21264.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3273, pruned_loss=0.1041, over 4254987.79 frames. ], batch size: 548, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:24:48,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=494094.0, ans=0.0 2023-06-19 15:25:18,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=494154.0, ans=0.125 2023-06-19 15:25:35,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=494214.0, ans=0.04949747468305833 2023-06-19 15:25:50,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=494274.0, ans=0.125 2023-06-19 15:25:54,674 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.815e+02 3.178e+02 3.883e+02 6.278e+02, threshold=6.357e+02, percent-clipped=0.0 2023-06-19 15:26:03,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=494334.0, ans=0.125 2023-06-19 15:26:10,344 INFO [train.py:996] (3/4) Epoch 3, batch 21400, loss[loss=0.2945, simple_loss=0.3594, pruned_loss=0.1148, over 21380.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3285, pruned_loss=0.1017, over 4251610.38 frames. ], batch size: 548, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:26:20,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=494334.0, ans=0.025 2023-06-19 15:26:22,659 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=22.5 2023-06-19 15:26:41,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=494394.0, ans=0.125 2023-06-19 15:27:27,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=494574.0, ans=0.0 2023-06-19 15:27:28,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=494574.0, ans=0.125 2023-06-19 15:27:45,364 INFO [train.py:996] (3/4) Epoch 3, batch 21450, loss[loss=0.2638, simple_loss=0.3233, pruned_loss=0.1021, over 21299.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3312, pruned_loss=0.1026, over 4258738.99 frames. ], batch size: 159, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:28:50,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=494814.0, ans=0.0 2023-06-19 15:28:53,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=494814.0, ans=0.0 2023-06-19 15:29:00,745 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.68 vs. limit=22.5 2023-06-19 15:29:21,101 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 2.895e+02 3.297e+02 3.892e+02 6.030e+02, threshold=6.593e+02, percent-clipped=0.0 2023-06-19 15:29:21,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=494874.0, ans=0.125 2023-06-19 15:29:31,457 INFO [train.py:996] (3/4) Epoch 3, batch 21500, loss[loss=0.2742, simple_loss=0.3239, pruned_loss=0.1123, over 21681.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3296, pruned_loss=0.1044, over 4265972.88 frames. ], batch size: 393, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:30:16,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=495054.0, ans=0.0 2023-06-19 15:30:22,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=495114.0, ans=0.1 2023-06-19 15:30:53,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=495174.0, ans=0.125 2023-06-19 15:31:04,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=495174.0, ans=10.0 2023-06-19 15:31:06,822 INFO [train.py:996] (3/4) Epoch 3, batch 21550, loss[loss=0.2437, simple_loss=0.2989, pruned_loss=0.0942, over 21150.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3213, pruned_loss=0.1011, over 4268032.54 frames. ], batch size: 176, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:31:19,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=495234.0, ans=0.04949747468305833 2023-06-19 15:32:48,697 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 3.022e+02 3.623e+02 4.651e+02 8.178e+02, threshold=7.247e+02, percent-clipped=4.0 2023-06-19 15:32:57,594 INFO [train.py:996] (3/4) Epoch 3, batch 21600, loss[loss=0.2592, simple_loss=0.3715, pruned_loss=0.07341, over 19660.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3172, pruned_loss=0.09884, over 4269657.07 frames. ], batch size: 703, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:34:04,586 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.07 vs. limit=15.0 2023-06-19 15:34:35,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=495774.0, ans=0.1 2023-06-19 15:34:39,689 INFO [train.py:996] (3/4) Epoch 3, batch 21650, loss[loss=0.227, simple_loss=0.3101, pruned_loss=0.072, over 21119.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3214, pruned_loss=0.09567, over 4274785.68 frames. ], batch size: 176, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:34:43,917 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.09 vs. limit=6.0 2023-06-19 15:35:41,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=496014.0, ans=0.035 2023-06-19 15:35:46,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=496014.0, ans=0.125 2023-06-19 15:35:55,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=496074.0, ans=0.0 2023-06-19 15:35:55,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=496074.0, ans=0.125 2023-06-19 15:36:08,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=496074.0, ans=0.1 2023-06-19 15:36:18,630 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.307e+02 3.153e+02 4.282e+02 5.472e+02 1.270e+03, threshold=8.565e+02, percent-clipped=5.0 2023-06-19 15:36:20,158 INFO [train.py:996] (3/4) Epoch 3, batch 21700, loss[loss=0.2103, simple_loss=0.2919, pruned_loss=0.06436, over 21286.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3228, pruned_loss=0.09455, over 4274123.24 frames. ], batch size: 176, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:37:07,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=496254.0, ans=0.04949747468305833 2023-06-19 15:37:53,980 INFO [train.py:996] (3/4) Epoch 3, batch 21750, loss[loss=0.246, simple_loss=0.2945, pruned_loss=0.09875, over 21953.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3196, pruned_loss=0.09585, over 4275825.60 frames. ], batch size: 103, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:38:29,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=496494.0, ans=0.2 2023-06-19 15:38:37,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=496554.0, ans=0.5 2023-06-19 15:38:48,141 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-19 15:39:04,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=496614.0, ans=0.5 2023-06-19 15:39:16,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=496674.0, ans=0.125 2023-06-19 15:39:34,238 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.772e+02 3.220e+02 4.013e+02 6.187e+02, threshold=6.439e+02, percent-clipped=0.0 2023-06-19 15:39:34,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=496734.0, ans=0.0 2023-06-19 15:39:41,168 INFO [train.py:996] (3/4) Epoch 3, batch 21800, loss[loss=0.2845, simple_loss=0.3217, pruned_loss=0.1237, over 21232.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3188, pruned_loss=0.09818, over 4279418.20 frames. ], batch size: 471, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:40:04,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=496794.0, ans=0.125 2023-06-19 15:40:11,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=496794.0, ans=0.0 2023-06-19 15:41:07,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=496974.0, ans=0.0 2023-06-19 15:41:23,658 INFO [train.py:996] (3/4) Epoch 3, batch 21850, loss[loss=0.2432, simple_loss=0.3155, pruned_loss=0.08547, over 21811.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.324, pruned_loss=0.09933, over 4270293.51 frames. ], batch size: 282, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:41:37,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.34 vs. limit=10.0 2023-06-19 15:41:46,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=497094.0, ans=0.0 2023-06-19 15:42:02,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=497154.0, ans=0.0 2023-06-19 15:42:05,421 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:42:07,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=497154.0, ans=0.05 2023-06-19 15:42:29,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=497214.0, ans=0.125 2023-06-19 15:43:07,045 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.253e+02 3.918e+02 5.054e+02 8.247e+02, threshold=7.836e+02, percent-clipped=6.0 2023-06-19 15:43:08,871 INFO [train.py:996] (3/4) Epoch 3, batch 21900, loss[loss=0.2476, simple_loss=0.3011, pruned_loss=0.09703, over 21756.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3251, pruned_loss=0.1001, over 4260851.99 frames. ], batch size: 316, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:43:53,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=497454.0, ans=0.95 2023-06-19 15:44:49,932 INFO [train.py:996] (3/4) Epoch 3, batch 21950, loss[loss=0.2161, simple_loss=0.2981, pruned_loss=0.06709, over 20841.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3192, pruned_loss=0.09775, over 4263172.52 frames. ], batch size: 608, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:45:41,862 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-19 15:45:46,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=497814.0, ans=0.0 2023-06-19 15:46:25,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=497874.0, ans=0.1 2023-06-19 15:46:29,038 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.53 vs. limit=6.0 2023-06-19 15:46:31,001 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.810e+02 3.300e+02 3.802e+02 6.470e+02, threshold=6.601e+02, percent-clipped=0.0 2023-06-19 15:46:31,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=497934.0, ans=0.1 2023-06-19 15:46:32,692 INFO [train.py:996] (3/4) Epoch 3, batch 22000, loss[loss=0.2596, simple_loss=0.3119, pruned_loss=0.1037, over 21389.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3139, pruned_loss=0.09538, over 4256048.45 frames. ], batch size: 131, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:46:48,005 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.70 vs. limit=15.0 2023-06-19 15:47:22,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=498054.0, ans=0.0 2023-06-19 15:47:39,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=498114.0, ans=0.125 2023-06-19 15:47:46,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=498114.0, ans=0.2 2023-06-19 15:47:49,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=15.0 2023-06-19 15:48:16,229 INFO [train.py:996] (3/4) Epoch 3, batch 22050, loss[loss=0.3693, simple_loss=0.4831, pruned_loss=0.1277, over 19848.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3227, pruned_loss=0.09836, over 4253712.49 frames. ], batch size: 702, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:49:08,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=498354.0, ans=0.125 2023-06-19 15:49:18,432 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.75 vs. limit=8.0 2023-06-19 15:49:25,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=498414.0, ans=0.125 2023-06-19 15:49:53,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=498474.0, ans=0.2 2023-06-19 15:49:57,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=498534.0, ans=0.125 2023-06-19 15:49:58,781 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 3.520e+02 4.276e+02 5.874e+02 8.679e+02, threshold=8.552e+02, percent-clipped=13.0 2023-06-19 15:49:58,802 INFO [train.py:996] (3/4) Epoch 3, batch 22100, loss[loss=0.2849, simple_loss=0.3405, pruned_loss=0.1146, over 21950.00 frames. ], tot_loss[loss=0.271, simple_loss=0.333, pruned_loss=0.1045, over 4257437.92 frames. ], batch size: 316, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:50:07,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=498534.0, ans=0.0 2023-06-19 15:50:15,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=498594.0, ans=0.0 2023-06-19 15:50:17,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=498594.0, ans=0.125 2023-06-19 15:50:39,756 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-19 15:51:13,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=498714.0, ans=0.0 2023-06-19 15:51:38,265 INFO [train.py:996] (3/4) Epoch 3, batch 22150, loss[loss=0.2705, simple_loss=0.3394, pruned_loss=0.1008, over 21901.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3352, pruned_loss=0.1063, over 4268478.21 frames. ], batch size: 124, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:52:55,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=499014.0, ans=0.04949747468305833 2023-06-19 15:53:14,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=499074.0, ans=0.125 2023-06-19 15:53:18,942 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.814e+02 3.451e+02 4.432e+02 8.221e+02, threshold=6.902e+02, percent-clipped=0.0 2023-06-19 15:53:18,963 INFO [train.py:996] (3/4) Epoch 3, batch 22200, loss[loss=0.2464, simple_loss=0.3111, pruned_loss=0.09087, over 21666.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.3359, pruned_loss=0.1067, over 4277048.54 frames. ], batch size: 263, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:53:19,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=499134.0, ans=0.125 2023-06-19 15:54:18,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=499314.0, ans=0.0 2023-06-19 15:55:01,247 INFO [train.py:996] (3/4) Epoch 3, batch 22250, loss[loss=0.311, simple_loss=0.3766, pruned_loss=0.1227, over 21909.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3442, pruned_loss=0.1086, over 4278055.55 frames. ], batch size: 316, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:55:19,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=499494.0, ans=0.0 2023-06-19 15:55:31,472 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-19 15:56:31,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=499674.0, ans=0.0 2023-06-19 15:56:41,193 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.351e+02 3.175e+02 3.826e+02 4.848e+02 6.426e+02, threshold=7.653e+02, percent-clipped=0.0 2023-06-19 15:56:41,214 INFO [train.py:996] (3/4) Epoch 3, batch 22300, loss[loss=0.2752, simple_loss=0.3336, pruned_loss=0.1084, over 21901.00 frames. ], tot_loss[loss=0.2848, simple_loss=0.3464, pruned_loss=0.1116, over 4282260.27 frames. ], batch size: 351, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:56:46,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=499734.0, ans=0.0 2023-06-19 15:57:57,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=499914.0, ans=0.5 2023-06-19 15:58:21,538 INFO [train.py:996] (3/4) Epoch 3, batch 22350, loss[loss=0.2305, simple_loss=0.2977, pruned_loss=0.08169, over 21471.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3436, pruned_loss=0.1118, over 4287962.09 frames. ], batch size: 194, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:58:26,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=500034.0, ans=0.05 2023-06-19 15:58:28,853 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=12.0 2023-06-19 15:58:34,079 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=22.5 2023-06-19 15:59:24,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=500214.0, ans=0.2 2023-06-19 15:59:29,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=500214.0, ans=0.125 2023-06-19 15:59:48,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=500274.0, ans=0.1 2023-06-19 15:59:54,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=500274.0, ans=0.125 2023-06-19 16:00:02,195 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.767e+02 3.274e+02 4.023e+02 7.731e+02, threshold=6.547e+02, percent-clipped=1.0 2023-06-19 16:00:02,216 INFO [train.py:996] (3/4) Epoch 3, batch 22400, loss[loss=0.236, simple_loss=0.3035, pruned_loss=0.08424, over 21511.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3386, pruned_loss=0.1068, over 4275080.38 frames. ], batch size: 230, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:00:19,210 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.26 vs. limit=6.0 2023-06-19 16:00:33,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=500394.0, ans=10.0 2023-06-19 16:00:38,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=500454.0, ans=0.125 2023-06-19 16:01:02,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=500454.0, ans=0.125 2023-06-19 16:01:22,491 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 16:01:31,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=500574.0, ans=0.02 2023-06-19 16:01:40,685 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.80 vs. limit=12.0 2023-06-19 16:01:41,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=500634.0, ans=0.0 2023-06-19 16:01:42,898 INFO [train.py:996] (3/4) Epoch 3, batch 22450, loss[loss=0.2571, simple_loss=0.3099, pruned_loss=0.1022, over 21519.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3338, pruned_loss=0.1066, over 4261606.22 frames. ], batch size: 391, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:01:51,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=500634.0, ans=0.0 2023-06-19 16:02:34,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=500754.0, ans=0.125 2023-06-19 16:03:19,713 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2023-06-19 16:03:27,120 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 3.119e+02 3.860e+02 5.027e+02 1.347e+03, threshold=7.719e+02, percent-clipped=7.0 2023-06-19 16:03:27,152 INFO [train.py:996] (3/4) Epoch 3, batch 22500, loss[loss=0.2475, simple_loss=0.2965, pruned_loss=0.0992, over 21231.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3283, pruned_loss=0.1059, over 4261708.33 frames. ], batch size: 176, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:04:05,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=500994.0, ans=0.125 2023-06-19 16:04:33,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=501054.0, ans=0.1 2023-06-19 16:04:58,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=501174.0, ans=0.1 2023-06-19 16:05:03,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=501174.0, ans=0.125 2023-06-19 16:05:10,035 INFO [train.py:996] (3/4) Epoch 3, batch 22550, loss[loss=0.2944, simple_loss=0.3493, pruned_loss=0.1197, over 21865.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3349, pruned_loss=0.1071, over 4274100.90 frames. ], batch size: 371, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:05:37,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=501294.0, ans=0.04949747468305833 2023-06-19 16:05:48,050 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 16:05:48,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=501294.0, ans=0.125 2023-06-19 16:06:17,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=501354.0, ans=0.125 2023-06-19 16:06:30,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=501414.0, ans=0.125 2023-06-19 16:07:06,025 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 3.038e+02 3.713e+02 4.829e+02 9.473e+02, threshold=7.425e+02, percent-clipped=2.0 2023-06-19 16:07:06,045 INFO [train.py:996] (3/4) Epoch 3, batch 22600, loss[loss=0.3663, simple_loss=0.4276, pruned_loss=0.1524, over 21514.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3388, pruned_loss=0.1085, over 4284193.55 frames. ], batch size: 471, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:07:07,047 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2023-06-19 16:08:08,355 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=15.0 2023-06-19 16:08:26,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=501774.0, ans=0.125 2023-06-19 16:08:40,605 INFO [train.py:996] (3/4) Epoch 3, batch 22650, loss[loss=0.2542, simple_loss=0.3007, pruned_loss=0.1039, over 21826.00 frames. ], tot_loss[loss=0.2752, simple_loss=0.336, pruned_loss=0.1072, over 4267897.46 frames. ], batch size: 107, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:09:10,918 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=15.0 2023-06-19 16:09:55,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=502014.0, ans=0.125 2023-06-19 16:10:23,369 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.918e+02 3.422e+02 4.341e+02 8.662e+02, threshold=6.843e+02, percent-clipped=1.0 2023-06-19 16:10:23,390 INFO [train.py:996] (3/4) Epoch 3, batch 22700, loss[loss=0.2866, simple_loss=0.3268, pruned_loss=0.1232, over 21653.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.3271, pruned_loss=0.1058, over 4271279.17 frames. ], batch size: 282, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:10:23,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=502134.0, ans=0.0 2023-06-19 16:10:34,771 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.06 vs. limit=6.0 2023-06-19 16:10:59,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=502194.0, ans=0.0 2023-06-19 16:11:33,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=502314.0, ans=0.125 2023-06-19 16:12:04,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=502434.0, ans=0.125 2023-06-19 16:12:04,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=502434.0, ans=0.2 2023-06-19 16:12:10,387 INFO [train.py:996] (3/4) Epoch 3, batch 22750, loss[loss=0.3258, simple_loss=0.3689, pruned_loss=0.1414, over 21480.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3295, pruned_loss=0.1088, over 4269440.98 frames. ], batch size: 194, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:12:16,444 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.08 vs. limit=15.0 2023-06-19 16:12:27,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=502434.0, ans=0.0 2023-06-19 16:12:48,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=502494.0, ans=0.04949747468305833 2023-06-19 16:13:17,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=502614.0, ans=0.125 2023-06-19 16:13:24,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=502674.0, ans=0.125 2023-06-19 16:13:38,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=502674.0, ans=0.125 2023-06-19 16:13:51,334 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.464e+02 3.384e+02 3.991e+02 5.040e+02 7.219e+02, threshold=7.983e+02, percent-clipped=3.0 2023-06-19 16:13:51,354 INFO [train.py:996] (3/4) Epoch 3, batch 22800, loss[loss=0.2454, simple_loss=0.3093, pruned_loss=0.09078, over 21831.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.334, pruned_loss=0.1112, over 4275933.28 frames. ], batch size: 282, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:14:23,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=502794.0, ans=0.1 2023-06-19 16:14:30,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=502794.0, ans=0.1 2023-06-19 16:14:50,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=502914.0, ans=0.0 2023-06-19 16:15:16,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=502974.0, ans=0.0 2023-06-19 16:15:32,970 INFO [train.py:996] (3/4) Epoch 3, batch 22850, loss[loss=0.2864, simple_loss=0.3362, pruned_loss=0.1183, over 21759.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.33, pruned_loss=0.1097, over 4276096.91 frames. ], batch size: 371, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:15:40,686 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-06-19 16:17:16,733 INFO [train.py:996] (3/4) Epoch 3, batch 22900, loss[loss=0.2712, simple_loss=0.3753, pruned_loss=0.08352, over 21663.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3336, pruned_loss=0.1086, over 4273232.80 frames. ], batch size: 389, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:17:18,568 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 3.187e+02 3.862e+02 4.458e+02 8.142e+02, threshold=7.724e+02, percent-clipped=1.0 2023-06-19 16:17:42,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=503394.0, ans=10.0 2023-06-19 16:17:56,941 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-19 16:18:03,778 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.17 vs. limit=5.0 2023-06-19 16:18:50,967 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.43 vs. limit=6.0 2023-06-19 16:19:04,704 INFO [train.py:996] (3/4) Epoch 3, batch 22950, loss[loss=0.3395, simple_loss=0.4533, pruned_loss=0.1128, over 21630.00 frames. ], tot_loss[loss=0.2808, simple_loss=0.3483, pruned_loss=0.1066, over 4272999.94 frames. ], batch size: 414, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:19:45,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=503754.0, ans=0.0 2023-06-19 16:20:08,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=503814.0, ans=0.0 2023-06-19 16:20:08,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=503814.0, ans=0.125 2023-06-19 16:20:45,475 INFO [train.py:996] (3/4) Epoch 3, batch 23000, loss[loss=0.2624, simple_loss=0.3268, pruned_loss=0.09901, over 21832.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3467, pruned_loss=0.104, over 4280219.71 frames. ], batch size: 298, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:20:51,992 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.976e+02 2.906e+02 3.294e+02 4.043e+02 6.729e+02, threshold=6.588e+02, percent-clipped=0.0 2023-06-19 16:21:02,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=503934.0, ans=0.125 2023-06-19 16:21:10,805 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=22.5 2023-06-19 16:21:18,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=503994.0, ans=0.0 2023-06-19 16:21:39,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=504054.0, ans=0.04949747468305833 2023-06-19 16:21:52,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=504114.0, ans=0.0 2023-06-19 16:22:07,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=504174.0, ans=0.125 2023-06-19 16:22:33,404 INFO [train.py:996] (3/4) Epoch 3, batch 23050, loss[loss=0.3274, simple_loss=0.3818, pruned_loss=0.1365, over 21943.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3478, pruned_loss=0.1064, over 4282818.08 frames. ], batch size: 372, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:22:58,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=504294.0, ans=0.125 2023-06-19 16:23:28,652 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=22.5 2023-06-19 16:24:13,992 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2023-06-19 16:24:14,570 INFO [train.py:996] (3/4) Epoch 3, batch 23100, loss[loss=0.2607, simple_loss=0.3088, pruned_loss=0.1063, over 21412.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3423, pruned_loss=0.1069, over 4286229.96 frames. ], batch size: 211, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:24:16,317 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.949e+02 3.465e+02 4.322e+02 6.088e+02, threshold=6.930e+02, percent-clipped=0.0 2023-06-19 16:24:34,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=504594.0, ans=0.125 2023-06-19 16:24:37,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=504594.0, ans=0.2 2023-06-19 16:25:49,779 INFO [train.py:996] (3/4) Epoch 3, batch 23150, loss[loss=0.3115, simple_loss=0.349, pruned_loss=0.137, over 21539.00 frames. ], tot_loss[loss=0.273, simple_loss=0.3348, pruned_loss=0.1055, over 4286808.58 frames. ], batch size: 508, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:26:17,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=504894.0, ans=0.1 2023-06-19 16:27:23,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=505074.0, ans=0.125 2023-06-19 16:27:29,886 INFO [train.py:996] (3/4) Epoch 3, batch 23200, loss[loss=0.2466, simple_loss=0.3044, pruned_loss=0.09444, over 19914.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3339, pruned_loss=0.1067, over 4293915.09 frames. ], batch size: 703, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:27:31,408 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 3.188e+02 3.765e+02 4.583e+02 7.279e+02, threshold=7.530e+02, percent-clipped=1.0 2023-06-19 16:27:33,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=505134.0, ans=0.0 2023-06-19 16:27:34,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=505134.0, ans=0.125 2023-06-19 16:27:40,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=505134.0, ans=0.0 2023-06-19 16:29:05,064 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.37 vs. limit=6.0 2023-06-19 16:29:07,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=505374.0, ans=0.125 2023-06-19 16:29:10,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=505434.0, ans=0.125 2023-06-19 16:29:11,832 INFO [train.py:996] (3/4) Epoch 3, batch 23250, loss[loss=0.2448, simple_loss=0.3022, pruned_loss=0.09365, over 21045.00 frames. ], tot_loss[loss=0.276, simple_loss=0.3351, pruned_loss=0.1085, over 4290893.71 frames. ], batch size: 607, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:29:19,513 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2023-06-19 16:30:15,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=505614.0, ans=0.0 2023-06-19 16:30:38,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=505674.0, ans=0.125 2023-06-19 16:30:55,039 INFO [train.py:996] (3/4) Epoch 3, batch 23300, loss[loss=0.3007, simple_loss=0.3592, pruned_loss=0.1211, over 21782.00 frames. ], tot_loss[loss=0.2831, simple_loss=0.3441, pruned_loss=0.1111, over 4286715.54 frames. ], batch size: 414, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:30:56,666 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 3.151e+02 3.585e+02 4.227e+02 7.319e+02, threshold=7.169e+02, percent-clipped=0.0 2023-06-19 16:31:26,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=505794.0, ans=0.2 2023-06-19 16:31:40,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=505854.0, ans=0.125 2023-06-19 16:31:42,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=505854.0, ans=0.2 2023-06-19 16:32:38,664 INFO [train.py:996] (3/4) Epoch 3, batch 23350, loss[loss=0.2345, simple_loss=0.3065, pruned_loss=0.08123, over 21635.00 frames. ], tot_loss[loss=0.2851, simple_loss=0.3497, pruned_loss=0.1102, over 4290252.71 frames. ], batch size: 230, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:32:53,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=506034.0, ans=0.2 2023-06-19 16:34:21,300 INFO [train.py:996] (3/4) Epoch 3, batch 23400, loss[loss=0.2916, simple_loss=0.35, pruned_loss=0.1166, over 21864.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3411, pruned_loss=0.1056, over 4286091.15 frames. ], batch size: 124, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:34:22,870 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 2.587e+02 3.042e+02 3.768e+02 6.854e+02, threshold=6.085e+02, percent-clipped=0.0 2023-06-19 16:34:35,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=506334.0, ans=0.125 2023-06-19 16:36:07,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=506634.0, ans=0.1 2023-06-19 16:36:07,901 INFO [train.py:996] (3/4) Epoch 3, batch 23450, loss[loss=0.2773, simple_loss=0.341, pruned_loss=0.1067, over 21701.00 frames. ], tot_loss[loss=0.2796, simple_loss=0.3424, pruned_loss=0.1084, over 4285454.88 frames. ], batch size: 351, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:36:25,606 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=22.5 2023-06-19 16:36:28,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=506694.0, ans=0.1 2023-06-19 16:36:35,535 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.53 vs. limit=10.0 2023-06-19 16:36:36,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=506694.0, ans=0.025 2023-06-19 16:36:36,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=506694.0, ans=0.1 2023-06-19 16:37:20,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=506814.0, ans=0.0 2023-06-19 16:37:49,342 INFO [train.py:996] (3/4) Epoch 3, batch 23500, loss[loss=0.2596, simple_loss=0.324, pruned_loss=0.09764, over 21876.00 frames. ], tot_loss[loss=0.2811, simple_loss=0.3422, pruned_loss=0.1101, over 4291747.85 frames. ], batch size: 124, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:37:50,952 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.271e+02 4.126e+02 5.318e+02 8.868e+02, threshold=8.252e+02, percent-clipped=14.0 2023-06-19 16:38:15,088 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-19 16:39:30,984 INFO [train.py:996] (3/4) Epoch 3, batch 23550, loss[loss=0.2343, simple_loss=0.2866, pruned_loss=0.09098, over 21657.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3376, pruned_loss=0.1099, over 4290634.58 frames. ], batch size: 264, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:40:07,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-19 16:40:39,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=507414.0, ans=0.0 2023-06-19 16:40:42,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=507414.0, ans=0.1 2023-06-19 16:40:44,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=507414.0, ans=0.1 2023-06-19 16:40:51,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=507474.0, ans=0.125 2023-06-19 16:41:17,608 INFO [train.py:996] (3/4) Epoch 3, batch 23600, loss[loss=0.2699, simple_loss=0.3311, pruned_loss=0.1043, over 21799.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.3372, pruned_loss=0.1095, over 4282181.08 frames. ], batch size: 351, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:41:19,227 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 3.135e+02 3.693e+02 4.651e+02 9.053e+02, threshold=7.385e+02, percent-clipped=1.0 2023-06-19 16:42:17,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=507714.0, ans=0.0 2023-06-19 16:42:56,810 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-19 16:42:58,115 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=12.0 2023-06-19 16:43:00,371 INFO [train.py:996] (3/4) Epoch 3, batch 23650, loss[loss=0.2751, simple_loss=0.3474, pruned_loss=0.1014, over 21302.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3375, pruned_loss=0.1079, over 4278681.69 frames. ], batch size: 548, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:43:16,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=507834.0, ans=0.0 2023-06-19 16:43:35,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=507894.0, ans=0.125 2023-06-19 16:44:29,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=508074.0, ans=0.125 2023-06-19 16:44:48,397 INFO [train.py:996] (3/4) Epoch 3, batch 23700, loss[loss=0.2402, simple_loss=0.3038, pruned_loss=0.08825, over 21388.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3395, pruned_loss=0.1069, over 4280425.50 frames. ], batch size: 159, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:44:49,943 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.106e+02 2.801e+02 3.226e+02 4.051e+02 6.982e+02, threshold=6.453e+02, percent-clipped=0.0 2023-06-19 16:45:21,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=508194.0, ans=0.0 2023-06-19 16:45:36,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=508254.0, ans=0.05 2023-06-19 16:46:30,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=508374.0, ans=0.0 2023-06-19 16:46:36,720 INFO [train.py:996] (3/4) Epoch 3, batch 23750, loss[loss=0.2587, simple_loss=0.3474, pruned_loss=0.08507, over 21658.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.3434, pruned_loss=0.1076, over 4279263.08 frames. ], batch size: 389, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:47:40,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=508614.0, ans=0.125 2023-06-19 16:47:46,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=508614.0, ans=0.2 2023-06-19 16:47:56,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=508614.0, ans=0.125 2023-06-19 16:48:21,115 INFO [train.py:996] (3/4) Epoch 3, batch 23800, loss[loss=0.2835, simple_loss=0.3655, pruned_loss=0.1007, over 21769.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3393, pruned_loss=0.1038, over 4276063.54 frames. ], batch size: 282, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:48:22,756 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.690e+02 3.256e+02 4.075e+02 6.648e+02, threshold=6.511e+02, percent-clipped=1.0 2023-06-19 16:48:30,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=508734.0, ans=0.125 2023-06-19 16:50:01,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=508974.0, ans=0.0 2023-06-19 16:50:05,856 INFO [train.py:996] (3/4) Epoch 3, batch 23850, loss[loss=0.2845, simple_loss=0.3541, pruned_loss=0.1075, over 21617.00 frames. ], tot_loss[loss=0.2816, simple_loss=0.3499, pruned_loss=0.1067, over 4274715.68 frames. ], batch size: 230, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:50:20,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=509034.0, ans=0.0 2023-06-19 16:50:30,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=509094.0, ans=0.07 2023-06-19 16:51:02,088 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-19 16:51:48,147 INFO [train.py:996] (3/4) Epoch 3, batch 23900, loss[loss=0.2948, simple_loss=0.359, pruned_loss=0.1153, over 21475.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3586, pruned_loss=0.1111, over 4277750.18 frames. ], batch size: 389, lr: 1.03e-02, grad_scale: 16.0 2023-06-19 16:51:51,143 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 3.185e+02 4.046e+02 5.288e+02 1.128e+03, threshold=8.092e+02, percent-clipped=13.0 2023-06-19 16:52:06,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=509334.0, ans=0.0 2023-06-19 16:52:17,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=509394.0, ans=0.1 2023-06-19 16:52:18,025 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-19 16:52:22,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=509394.0, ans=0.125 2023-06-19 16:52:29,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=509454.0, ans=0.125 2023-06-19 16:53:10,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=22.5 2023-06-19 16:53:11,455 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 16:53:28,828 INFO [train.py:996] (3/4) Epoch 3, batch 23950, loss[loss=0.2928, simple_loss=0.3469, pruned_loss=0.1194, over 21556.00 frames. ], tot_loss[loss=0.2859, simple_loss=0.3513, pruned_loss=0.1103, over 4271663.22 frames. ], batch size: 414, lr: 1.03e-02, grad_scale: 16.0 2023-06-19 16:53:39,740 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=22.5 2023-06-19 16:53:45,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=509634.0, ans=0.125 2023-06-19 16:54:33,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=509754.0, ans=0.125 2023-06-19 16:55:15,645 INFO [train.py:996] (3/4) Epoch 3, batch 24000, loss[loss=0.3135, simple_loss=0.3669, pruned_loss=0.13, over 21653.00 frames. ], tot_loss[loss=0.29, simple_loss=0.3529, pruned_loss=0.1135, over 4266440.05 frames. ], batch size: 263, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:55:15,645 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 16:55:31,887 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2855, simple_loss=0.3833, pruned_loss=0.09389, over 1796401.00 frames. 2023-06-19 16:55:31,887 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24414MB 2023-06-19 16:55:35,239 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.384e+02 3.049e+02 3.553e+02 4.728e+02 8.625e+02, threshold=7.107e+02, percent-clipped=2.0 2023-06-19 16:55:41,630 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.68 vs. limit=10.0 2023-06-19 16:55:43,225 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.30 vs. limit=10.0 2023-06-19 16:56:01,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=509994.0, ans=0.125 2023-06-19 16:56:08,894 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.00 vs. limit=22.5 2023-06-19 16:56:09,543 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=8.232e-03 2023-06-19 16:57:10,303 INFO [train.py:996] (3/4) Epoch 3, batch 24050, loss[loss=0.2641, simple_loss=0.3387, pruned_loss=0.09477, over 21642.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3542, pruned_loss=0.1135, over 4274911.42 frames. ], batch size: 263, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:57:43,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=510294.0, ans=0.125 2023-06-19 16:58:23,661 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=15.0 2023-06-19 16:58:28,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=510414.0, ans=0.0 2023-06-19 16:58:53,200 INFO [train.py:996] (3/4) Epoch 3, batch 24100, loss[loss=0.428, simple_loss=0.4546, pruned_loss=0.2007, over 21384.00 frames. ], tot_loss[loss=0.2894, simple_loss=0.3549, pruned_loss=0.1119, over 4265424.69 frames. ], batch size: 507, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:58:56,271 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.988e+02 3.709e+02 5.089e+02 1.009e+03, threshold=7.417e+02, percent-clipped=9.0 2023-06-19 16:59:09,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=510534.0, ans=0.0 2023-06-19 16:59:14,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=510594.0, ans=0.125 2023-06-19 16:59:26,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=510594.0, ans=0.125 2023-06-19 17:00:13,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=510774.0, ans=0.2 2023-06-19 17:00:30,536 INFO [train.py:996] (3/4) Epoch 3, batch 24150, loss[loss=0.3229, simple_loss=0.3726, pruned_loss=0.1367, over 21695.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.3536, pruned_loss=0.1134, over 4271659.76 frames. ], batch size: 473, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:00:49,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=510834.0, ans=0.0 2023-06-19 17:00:59,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=510894.0, ans=0.025 2023-06-19 17:01:37,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=511014.0, ans=0.125 2023-06-19 17:01:42,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=511014.0, ans=0.0 2023-06-19 17:02:14,031 INFO [train.py:996] (3/4) Epoch 3, batch 24200, loss[loss=0.3175, simple_loss=0.3979, pruned_loss=0.1186, over 21587.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3561, pruned_loss=0.1146, over 4276590.10 frames. ], batch size: 441, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:02:14,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=511134.0, ans=0.0 2023-06-19 17:02:14,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=511134.0, ans=0.125 2023-06-19 17:02:17,187 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.540e+02 3.206e+02 3.739e+02 4.662e+02 8.285e+02, threshold=7.479e+02, percent-clipped=1.0 2023-06-19 17:02:24,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=511134.0, ans=0.125 2023-06-19 17:03:38,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=511374.0, ans=0.025 2023-06-19 17:03:48,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=511374.0, ans=0.1 2023-06-19 17:03:51,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=511434.0, ans=0.0 2023-06-19 17:03:52,810 INFO [train.py:996] (3/4) Epoch 3, batch 24250, loss[loss=0.1893, simple_loss=0.2854, pruned_loss=0.04661, over 21665.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3513, pruned_loss=0.1063, over 4280736.39 frames. ], batch size: 247, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:03:53,246 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:04:35,367 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.75 vs. limit=15.0 2023-06-19 17:05:18,596 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2023-06-19 17:05:33,892 INFO [train.py:996] (3/4) Epoch 3, batch 24300, loss[loss=0.2387, simple_loss=0.3184, pruned_loss=0.07954, over 21363.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3427, pruned_loss=0.09902, over 4282555.61 frames. ], batch size: 548, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:05:37,128 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.288e+02 2.786e+02 3.535e+02 7.213e+02, threshold=5.572e+02, percent-clipped=0.0 2023-06-19 17:05:47,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=511734.0, ans=0.0 2023-06-19 17:05:48,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=511794.0, ans=0.09899494936611666 2023-06-19 17:06:20,772 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=22.5 2023-06-19 17:06:21,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=511854.0, ans=0.1 2023-06-19 17:06:21,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=511854.0, ans=0.0 2023-06-19 17:06:39,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=511914.0, ans=0.2 2023-06-19 17:06:57,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=511914.0, ans=0.1 2023-06-19 17:07:16,686 INFO [train.py:996] (3/4) Epoch 3, batch 24350, loss[loss=0.2993, simple_loss=0.3639, pruned_loss=0.1174, over 21857.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.3394, pruned_loss=0.09992, over 4292198.78 frames. ], batch size: 371, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:07:22,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=512034.0, ans=0.2 2023-06-19 17:07:59,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=512154.0, ans=0.125 2023-06-19 17:08:41,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=512214.0, ans=0.125 2023-06-19 17:09:06,332 INFO [train.py:996] (3/4) Epoch 3, batch 24400, loss[loss=0.2116, simple_loss=0.2755, pruned_loss=0.07381, over 21789.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.3451, pruned_loss=0.1055, over 4292894.29 frames. ], batch size: 102, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:09:09,703 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 3.270e+02 4.131e+02 5.260e+02 7.879e+02, threshold=8.262e+02, percent-clipped=18.0 2023-06-19 17:09:12,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=512334.0, ans=0.0 2023-06-19 17:09:18,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=512334.0, ans=0.125 2023-06-19 17:09:25,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=512394.0, ans=0.1 2023-06-19 17:09:37,121 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-06-19 17:09:52,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=512454.0, ans=0.125 2023-06-19 17:10:48,968 INFO [train.py:996] (3/4) Epoch 3, batch 24450, loss[loss=0.3777, simple_loss=0.4373, pruned_loss=0.159, over 21471.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3484, pruned_loss=0.1079, over 4295716.34 frames. ], batch size: 508, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:10:57,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=512634.0, ans=0.125 2023-06-19 17:11:46,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=22.5 2023-06-19 17:12:26,075 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-19 17:12:30,295 INFO [train.py:996] (3/4) Epoch 3, batch 24500, loss[loss=0.2676, simple_loss=0.3495, pruned_loss=0.09287, over 21322.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.3483, pruned_loss=0.1078, over 4297206.79 frames. ], batch size: 176, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:12:33,619 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.872e+02 3.383e+02 4.151e+02 6.413e+02, threshold=6.766e+02, percent-clipped=0.0 2023-06-19 17:12:43,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=512934.0, ans=0.2 2023-06-19 17:12:45,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=512994.0, ans=0.1 2023-06-19 17:13:27,472 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:13:40,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=513114.0, ans=0.125 2023-06-19 17:14:07,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=513174.0, ans=0.2 2023-06-19 17:14:12,252 INFO [train.py:996] (3/4) Epoch 3, batch 24550, loss[loss=0.3589, simple_loss=0.4015, pruned_loss=0.1581, over 21721.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3515, pruned_loss=0.1105, over 4297584.53 frames. ], batch size: 441, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:15:54,302 INFO [train.py:996] (3/4) Epoch 3, batch 24600, loss[loss=0.2367, simple_loss=0.288, pruned_loss=0.09265, over 21186.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3465, pruned_loss=0.1102, over 4297715.89 frames. ], batch size: 176, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:15:57,341 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.888e+02 3.572e+02 4.375e+02 7.058e+02, threshold=7.144e+02, percent-clipped=1.0 2023-06-19 17:16:21,873 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-06-19 17:16:22,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=513594.0, ans=0.125 2023-06-19 17:16:48,535 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=22.5 2023-06-19 17:16:49,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=513654.0, ans=0.125 2023-06-19 17:17:11,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=513714.0, ans=0.125 2023-06-19 17:17:21,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=513774.0, ans=0.1 2023-06-19 17:17:34,802 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:17:35,819 INFO [train.py:996] (3/4) Epoch 3, batch 24650, loss[loss=0.2507, simple_loss=0.2943, pruned_loss=0.1035, over 21656.00 frames. ], tot_loss[loss=0.277, simple_loss=0.3369, pruned_loss=0.1085, over 4285818.08 frames. ], batch size: 248, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:17:44,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=513834.0, ans=0.1 2023-06-19 17:18:08,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=513894.0, ans=0.125 2023-06-19 17:18:09,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=513894.0, ans=0.2 2023-06-19 17:18:18,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=513954.0, ans=0.0 2023-06-19 17:18:36,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=514014.0, ans=0.0 2023-06-19 17:18:57,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514074.0, ans=0.1 2023-06-19 17:19:13,012 INFO [train.py:996] (3/4) Epoch 3, batch 24700, loss[loss=0.3258, simple_loss=0.3681, pruned_loss=0.1417, over 21539.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3349, pruned_loss=0.106, over 4280712.66 frames. ], batch size: 441, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:19:16,026 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.157e+02 3.618e+02 4.336e+02 6.867e+02, threshold=7.236e+02, percent-clipped=0.0 2023-06-19 17:19:54,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=514194.0, ans=0.125 2023-06-19 17:20:03,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-06-19 17:20:16,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=514314.0, ans=0.05 2023-06-19 17:20:32,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=514314.0, ans=0.0 2023-06-19 17:20:35,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=514314.0, ans=0.0 2023-06-19 17:20:48,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=514374.0, ans=0.125 2023-06-19 17:20:55,359 INFO [train.py:996] (3/4) Epoch 3, batch 24750, loss[loss=0.264, simple_loss=0.3074, pruned_loss=0.1103, over 21595.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3291, pruned_loss=0.1029, over 4267429.68 frames. ], batch size: 263, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:21:19,806 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=22.5 2023-06-19 17:22:15,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=514614.0, ans=0.125 2023-06-19 17:22:17,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=514614.0, ans=0.125 2023-06-19 17:22:36,149 INFO [train.py:996] (3/4) Epoch 3, batch 24800, loss[loss=0.2701, simple_loss=0.3164, pruned_loss=0.1119, over 21560.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3237, pruned_loss=0.1018, over 4267916.48 frames. ], batch size: 548, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:22:39,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.753e+02 3.144e+02 3.669e+02 5.851e+02, threshold=6.289e+02, percent-clipped=0.0 2023-06-19 17:22:56,958 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-06-19 17:23:21,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514794.0, ans=0.1 2023-06-19 17:23:23,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=514854.0, ans=0.1 2023-06-19 17:23:30,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=514854.0, ans=0.0 2023-06-19 17:23:31,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=514854.0, ans=0.125 2023-06-19 17:23:38,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=514854.0, ans=0.125 2023-06-19 17:24:19,440 INFO [train.py:996] (3/4) Epoch 3, batch 24850, loss[loss=0.2356, simple_loss=0.3111, pruned_loss=0.08003, over 21068.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3239, pruned_loss=0.1032, over 4277544.62 frames. ], batch size: 608, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:25:18,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=515154.0, ans=0.07 2023-06-19 17:26:02,428 INFO [train.py:996] (3/4) Epoch 3, batch 24900, loss[loss=0.314, simple_loss=0.3644, pruned_loss=0.1318, over 21197.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3246, pruned_loss=0.103, over 4271424.57 frames. ], batch size: 143, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:26:11,121 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 2.873e+02 3.627e+02 4.468e+02 7.935e+02, threshold=7.253e+02, percent-clipped=5.0 2023-06-19 17:27:10,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=515454.0, ans=0.125 2023-06-19 17:27:23,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=515514.0, ans=0.2 2023-06-19 17:27:26,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=515514.0, ans=0.125 2023-06-19 17:27:56,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=515634.0, ans=0.125 2023-06-19 17:27:57,497 INFO [train.py:996] (3/4) Epoch 3, batch 24950, loss[loss=0.288, simple_loss=0.3483, pruned_loss=0.1139, over 21820.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3347, pruned_loss=0.1093, over 4274904.29 frames. ], batch size: 282, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:28:20,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=515694.0, ans=0.125 2023-06-19 17:28:20,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=515694.0, ans=0.0 2023-06-19 17:28:50,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=515754.0, ans=0.125 2023-06-19 17:28:58,139 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-06-19 17:29:46,103 INFO [train.py:996] (3/4) Epoch 3, batch 25000, loss[loss=0.2603, simple_loss=0.318, pruned_loss=0.1013, over 21232.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3423, pruned_loss=0.1114, over 4268877.11 frames. ], batch size: 176, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:29:49,477 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 2.951e+02 3.694e+02 4.326e+02 9.045e+02, threshold=7.388e+02, percent-clipped=1.0 2023-06-19 17:30:01,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=515994.0, ans=0.125 2023-06-19 17:30:26,708 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-19 17:30:46,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=516114.0, ans=0.1 2023-06-19 17:30:57,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=516114.0, ans=0.125 2023-06-19 17:31:28,963 INFO [train.py:996] (3/4) Epoch 3, batch 25050, loss[loss=0.2472, simple_loss=0.303, pruned_loss=0.09573, over 21588.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3341, pruned_loss=0.1086, over 4266201.37 frames. ], batch size: 298, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:32:16,320 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:33:07,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=516474.0, ans=0.0 2023-06-19 17:33:10,992 INFO [train.py:996] (3/4) Epoch 3, batch 25100, loss[loss=0.2423, simple_loss=0.2896, pruned_loss=0.09755, over 21469.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3268, pruned_loss=0.1074, over 4268515.45 frames. ], batch size: 195, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:33:13,881 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 3.071e+02 3.524e+02 4.196e+02 8.233e+02, threshold=7.049e+02, percent-clipped=3.0 2023-06-19 17:33:37,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=516594.0, ans=0.1 2023-06-19 17:33:49,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=516654.0, ans=0.125 2023-06-19 17:33:49,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=516654.0, ans=0.2 2023-06-19 17:34:12,931 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:34:14,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=516714.0, ans=0.125 2023-06-19 17:34:14,902 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.20 vs. limit=15.0 2023-06-19 17:34:17,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=516714.0, ans=0.0 2023-06-19 17:34:47,466 INFO [train.py:996] (3/4) Epoch 3, batch 25150, loss[loss=0.2392, simple_loss=0.3226, pruned_loss=0.07795, over 21660.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3302, pruned_loss=0.1044, over 4243496.55 frames. ], batch size: 263, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:35:11,886 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-19 17:36:03,919 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-06-19 17:36:10,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=517074.0, ans=0.125 2023-06-19 17:36:29,174 INFO [train.py:996] (3/4) Epoch 3, batch 25200, loss[loss=0.257, simple_loss=0.3442, pruned_loss=0.08494, over 21770.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3295, pruned_loss=0.1021, over 4249978.18 frames. ], batch size: 332, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:36:32,446 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.799e+02 2.662e+02 3.153e+02 4.538e+02 8.599e+02, threshold=6.306e+02, percent-clipped=6.0 2023-06-19 17:37:38,220 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-19 17:37:40,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=517314.0, ans=0.125 2023-06-19 17:38:03,127 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=15.0 2023-06-19 17:38:10,283 INFO [train.py:996] (3/4) Epoch 3, batch 25250, loss[loss=0.2223, simple_loss=0.2897, pruned_loss=0.07748, over 21603.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.329, pruned_loss=0.1006, over 4254372.39 frames. ], batch size: 247, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:38:25,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=517434.0, ans=0.07 2023-06-19 17:38:33,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=517494.0, ans=0.1 2023-06-19 17:38:36,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=517494.0, ans=0.1 2023-06-19 17:38:41,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=517494.0, ans=0.125 2023-06-19 17:38:48,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=517554.0, ans=0.07 2023-06-19 17:39:03,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=517554.0, ans=0.125 2023-06-19 17:39:03,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=517554.0, ans=0.025 2023-06-19 17:39:56,950 INFO [train.py:996] (3/4) Epoch 3, batch 25300, loss[loss=0.2863, simple_loss=0.3538, pruned_loss=0.1094, over 21737.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3275, pruned_loss=0.1006, over 4254480.28 frames. ], batch size: 332, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:40:00,372 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 3.138e+02 3.706e+02 4.437e+02 8.805e+02, threshold=7.413e+02, percent-clipped=6.0 2023-06-19 17:40:39,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=517854.0, ans=0.0 2023-06-19 17:41:22,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=517974.0, ans=0.1 2023-06-19 17:41:32,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=517974.0, ans=0.0 2023-06-19 17:41:40,338 INFO [train.py:996] (3/4) Epoch 3, batch 25350, loss[loss=0.2707, simple_loss=0.3645, pruned_loss=0.08847, over 21219.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3313, pruned_loss=0.1004, over 4260150.47 frames. ], batch size: 548, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:41:52,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=518034.0, ans=0.125 2023-06-19 17:42:11,497 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.62 vs. limit=10.0 2023-06-19 17:42:20,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=518154.0, ans=0.1 2023-06-19 17:42:23,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=518154.0, ans=0.2 2023-06-19 17:42:37,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=518214.0, ans=0.125 2023-06-19 17:42:55,993 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:43:17,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=518274.0, ans=0.2 2023-06-19 17:43:20,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=518334.0, ans=0.1 2023-06-19 17:43:21,810 INFO [train.py:996] (3/4) Epoch 3, batch 25400, loss[loss=0.2489, simple_loss=0.3002, pruned_loss=0.09876, over 21323.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3249, pruned_loss=0.09875, over 4258258.22 frames. ], batch size: 144, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:43:24,801 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.818e+02 3.409e+02 4.580e+02 8.063e+02, threshold=6.817e+02, percent-clipped=2.0 2023-06-19 17:43:41,584 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:44:54,767 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:45:02,513 INFO [train.py:996] (3/4) Epoch 3, batch 25450, loss[loss=0.2594, simple_loss=0.3361, pruned_loss=0.09129, over 21673.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.326, pruned_loss=0.1007, over 4239848.37 frames. ], batch size: 230, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:45:16,468 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-19 17:45:19,911 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.41 vs. limit=15.0 2023-06-19 17:45:31,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=518694.0, ans=0.0 2023-06-19 17:45:33,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=518694.0, ans=0.1 2023-06-19 17:45:37,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=518694.0, ans=0.0 2023-06-19 17:45:43,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=518754.0, ans=0.07 2023-06-19 17:45:46,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=518754.0, ans=0.0 2023-06-19 17:46:03,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=518814.0, ans=0.125 2023-06-19 17:46:10,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=518814.0, ans=0.125 2023-06-19 17:46:17,833 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.83 vs. limit=10.0 2023-06-19 17:46:43,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2023-06-19 17:46:46,168 INFO [train.py:996] (3/4) Epoch 3, batch 25500, loss[loss=0.2995, simple_loss=0.3739, pruned_loss=0.1126, over 21669.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3263, pruned_loss=0.09656, over 4240410.15 frames. ], batch size: 441, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:46:49,356 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.719e+02 2.642e+02 3.063e+02 3.580e+02 7.751e+02, threshold=6.127e+02, percent-clipped=1.0 2023-06-19 17:47:09,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=518994.0, ans=0.2 2023-06-19 17:47:24,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=519054.0, ans=0.0 2023-06-19 17:47:38,773 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-19 17:47:45,588 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-19 17:48:31,146 INFO [train.py:996] (3/4) Epoch 3, batch 25550, loss[loss=0.2583, simple_loss=0.3582, pruned_loss=0.07921, over 21864.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.334, pruned_loss=0.09723, over 4246485.20 frames. ], batch size: 371, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:48:33,865 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.66 vs. limit=6.0 2023-06-19 17:49:25,431 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-19 17:49:39,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=519414.0, ans=0.0 2023-06-19 17:49:49,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=519414.0, ans=0.0 2023-06-19 17:49:57,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=519474.0, ans=0.125 2023-06-19 17:49:58,379 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-19 17:50:20,641 INFO [train.py:996] (3/4) Epoch 3, batch 25600, loss[loss=0.3257, simple_loss=0.3916, pruned_loss=0.1299, over 21845.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3383, pruned_loss=0.09825, over 4257199.64 frames. ], batch size: 124, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:50:23,735 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.720e+02 3.211e+02 3.853e+02 6.629e+02, threshold=6.421e+02, percent-clipped=1.0 2023-06-19 17:50:34,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=519534.0, ans=0.125 2023-06-19 17:50:59,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519654.0, ans=0.1 2023-06-19 17:51:06,688 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=12.0 2023-06-19 17:51:15,307 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-19 17:51:29,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=519714.0, ans=0.0 2023-06-19 17:51:42,220 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=12.0 2023-06-19 17:51:44,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=519774.0, ans=0.1 2023-06-19 17:51:57,679 INFO [train.py:996] (3/4) Epoch 3, batch 25650, loss[loss=0.2647, simple_loss=0.3259, pruned_loss=0.1017, over 21390.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.3398, pruned_loss=0.1025, over 4260920.74 frames. ], batch size: 131, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:51:59,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=519834.0, ans=0.0 2023-06-19 17:52:14,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519834.0, ans=0.1 2023-06-19 17:53:39,093 INFO [train.py:996] (3/4) Epoch 3, batch 25700, loss[loss=0.2826, simple_loss=0.3562, pruned_loss=0.1045, over 21289.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3359, pruned_loss=0.1033, over 4248606.11 frames. ], batch size: 159, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:53:46,981 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 3.093e+02 3.779e+02 4.609e+02 9.934e+02, threshold=7.559e+02, percent-clipped=6.0 2023-06-19 17:53:53,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=520134.0, ans=0.0 2023-06-19 17:54:32,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=520254.0, ans=0.1 2023-06-19 17:54:36,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=520254.0, ans=0.1 2023-06-19 17:54:49,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=520314.0, ans=0.125 2023-06-19 17:55:28,423 INFO [train.py:996] (3/4) Epoch 3, batch 25750, loss[loss=0.3262, simple_loss=0.3684, pruned_loss=0.142, over 21247.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3442, pruned_loss=0.1086, over 4255478.92 frames. ], batch size: 143, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:55:30,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=520434.0, ans=0.125 2023-06-19 17:56:15,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=520494.0, ans=0.0 2023-06-19 17:56:49,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=520614.0, ans=0.0 2023-06-19 17:57:15,532 INFO [train.py:996] (3/4) Epoch 3, batch 25800, loss[loss=0.3091, simple_loss=0.3723, pruned_loss=0.123, over 21517.00 frames. ], tot_loss[loss=0.2917, simple_loss=0.3568, pruned_loss=0.1133, over 4257156.80 frames. ], batch size: 194, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:57:25,026 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.576e+02 3.639e+02 4.483e+02 6.036e+02 1.254e+03, threshold=8.967e+02, percent-clipped=11.0 2023-06-19 17:57:27,957 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.64 vs. limit=15.0 2023-06-19 17:57:42,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=520794.0, ans=0.0 2023-06-19 17:57:44,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=520794.0, ans=0.125 2023-06-19 17:58:01,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=520854.0, ans=0.125 2023-06-19 17:58:03,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=520854.0, ans=0.0 2023-06-19 17:58:33,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=520914.0, ans=0.125 2023-06-19 17:59:06,165 INFO [train.py:996] (3/4) Epoch 3, batch 25850, loss[loss=0.2973, simple_loss=0.3489, pruned_loss=0.1228, over 21849.00 frames. ], tot_loss[loss=0.2914, simple_loss=0.3566, pruned_loss=0.1131, over 4259538.24 frames. ], batch size: 282, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:00:03,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=521154.0, ans=0.125 2023-06-19 18:00:56,488 INFO [train.py:996] (3/4) Epoch 3, batch 25900, loss[loss=0.303, simple_loss=0.3852, pruned_loss=0.1104, over 21747.00 frames. ], tot_loss[loss=0.2931, simple_loss=0.3591, pruned_loss=0.1136, over 4267438.54 frames. ], batch size: 298, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:01:01,407 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.484e+02 3.101e+02 3.467e+02 4.368e+02 8.294e+02, threshold=6.933e+02, percent-clipped=0.0 2023-06-19 18:01:23,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=521394.0, ans=0.0 2023-06-19 18:02:23,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=521574.0, ans=0.2 2023-06-19 18:02:39,533 INFO [train.py:996] (3/4) Epoch 3, batch 25950, loss[loss=0.3305, simple_loss=0.3842, pruned_loss=0.1384, over 21751.00 frames. ], tot_loss[loss=0.3017, simple_loss=0.3665, pruned_loss=0.1184, over 4273333.30 frames. ], batch size: 332, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:03:14,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=521694.0, ans=0.125 2023-06-19 18:03:27,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=521754.0, ans=0.125 2023-06-19 18:04:01,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=521814.0, ans=0.125 2023-06-19 18:04:16,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=521874.0, ans=0.0 2023-06-19 18:04:24,049 INFO [train.py:996] (3/4) Epoch 3, batch 26000, loss[loss=0.2831, simple_loss=0.3646, pruned_loss=0.1007, over 21934.00 frames. ], tot_loss[loss=0.3002, simple_loss=0.3671, pruned_loss=0.1167, over 4276261.69 frames. ], batch size: 317, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:04:32,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=521934.0, ans=0.1 2023-06-19 18:04:35,624 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.072e+02 3.699e+02 4.692e+02 7.013e+02, threshold=7.398e+02, percent-clipped=1.0 2023-06-19 18:04:36,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=521934.0, ans=0.125 2023-06-19 18:05:00,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=522054.0, ans=0.0 2023-06-19 18:05:04,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=522054.0, ans=0.5 2023-06-19 18:05:46,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=522114.0, ans=0.125 2023-06-19 18:06:02,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=522174.0, ans=0.125 2023-06-19 18:06:06,908 INFO [train.py:996] (3/4) Epoch 3, batch 26050, loss[loss=0.2615, simple_loss=0.3339, pruned_loss=0.09461, over 20659.00 frames. ], tot_loss[loss=0.3001, simple_loss=0.3657, pruned_loss=0.1172, over 4283671.09 frames. ], batch size: 609, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:06:21,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=522234.0, ans=0.0 2023-06-19 18:06:31,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=522294.0, ans=0.2 2023-06-19 18:06:35,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=522294.0, ans=0.125 2023-06-19 18:06:41,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=522294.0, ans=0.1 2023-06-19 18:07:22,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=522414.0, ans=0.125 2023-06-19 18:07:24,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=522414.0, ans=0.125 2023-06-19 18:07:30,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=522414.0, ans=0.2 2023-06-19 18:07:38,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=522474.0, ans=0.0 2023-06-19 18:07:44,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=522474.0, ans=0.0 2023-06-19 18:07:49,549 INFO [train.py:996] (3/4) Epoch 3, batch 26100, loss[loss=0.3106, simple_loss=0.3565, pruned_loss=0.1324, over 21915.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3591, pruned_loss=0.1163, over 4285471.53 frames. ], batch size: 414, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:08:01,016 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.956e+02 3.379e+02 4.537e+02 7.018e+02, threshold=6.758e+02, percent-clipped=0.0 2023-06-19 18:08:12,161 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-06-19 18:08:13,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=522594.0, ans=0.1 2023-06-19 18:08:46,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=522654.0, ans=0.125 2023-06-19 18:08:56,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=522654.0, ans=0.125 2023-06-19 18:09:01,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=522714.0, ans=0.125 2023-06-19 18:09:39,636 INFO [train.py:996] (3/4) Epoch 3, batch 26150, loss[loss=0.2647, simple_loss=0.3266, pruned_loss=0.1014, over 20947.00 frames. ], tot_loss[loss=0.2924, simple_loss=0.354, pruned_loss=0.1155, over 4290251.69 frames. ], batch size: 608, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:09:46,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=522834.0, ans=0.2 2023-06-19 18:10:21,166 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=22.5 2023-06-19 18:11:10,266 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-19 18:11:23,926 INFO [train.py:996] (3/4) Epoch 3, batch 26200, loss[loss=0.2882, simple_loss=0.3791, pruned_loss=0.09863, over 21807.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3543, pruned_loss=0.1128, over 4292768.61 frames. ], batch size: 282, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:11:30,757 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 3.095e+02 3.569e+02 4.232e+02 6.752e+02, threshold=7.138e+02, percent-clipped=0.0 2023-06-19 18:11:36,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=523134.0, ans=0.125 2023-06-19 18:12:03,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=523194.0, ans=0.125 2023-06-19 18:12:19,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=523254.0, ans=0.125 2023-06-19 18:12:28,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=523314.0, ans=0.0 2023-06-19 18:12:30,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=523314.0, ans=0.1 2023-06-19 18:13:06,878 INFO [train.py:996] (3/4) Epoch 3, batch 26250, loss[loss=0.3049, simple_loss=0.3723, pruned_loss=0.1188, over 21698.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3586, pruned_loss=0.111, over 4292992.12 frames. ], batch size: 389, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:13:46,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=523554.0, ans=0.1 2023-06-19 18:13:58,785 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:13:59,510 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-19 18:14:32,721 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.72 vs. limit=15.0 2023-06-19 18:14:41,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=523674.0, ans=0.0 2023-06-19 18:14:44,539 INFO [train.py:996] (3/4) Epoch 3, batch 26300, loss[loss=0.2544, simple_loss=0.3129, pruned_loss=0.09796, over 21960.00 frames. ], tot_loss[loss=0.289, simple_loss=0.355, pruned_loss=0.1115, over 4300072.41 frames. ], batch size: 316, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:14:51,294 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 3.134e+02 3.781e+02 4.659e+02 7.680e+02, threshold=7.563e+02, percent-clipped=3.0 2023-06-19 18:15:37,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=523854.0, ans=0.125 2023-06-19 18:16:33,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=524034.0, ans=0.0 2023-06-19 18:16:39,669 INFO [train.py:996] (3/4) Epoch 3, batch 26350, loss[loss=0.2908, simple_loss=0.3532, pruned_loss=0.1142, over 21784.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.353, pruned_loss=0.1122, over 4298777.32 frames. ], batch size: 332, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:17:00,446 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.72 vs. limit=10.0 2023-06-19 18:17:50,744 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2023-06-19 18:18:05,907 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=22.5 2023-06-19 18:18:16,286 INFO [train.py:996] (3/4) Epoch 3, batch 26400, loss[loss=0.2265, simple_loss=0.2864, pruned_loss=0.08327, over 21839.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3477, pruned_loss=0.1118, over 4294735.09 frames. ], batch size: 118, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:18:28,549 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 2.857e+02 3.384e+02 4.347e+02 8.285e+02, threshold=6.769e+02, percent-clipped=0.0 2023-06-19 18:18:32,944 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=15.0 2023-06-19 18:18:48,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=524394.0, ans=0.125 2023-06-19 18:20:01,415 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.52 vs. limit=15.0 2023-06-19 18:20:12,385 INFO [train.py:996] (3/4) Epoch 3, batch 26450, loss[loss=0.292, simple_loss=0.3602, pruned_loss=0.1119, over 21377.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3461, pruned_loss=0.1103, over 4289334.08 frames. ], batch size: 211, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:20:46,881 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=22.5 2023-06-19 18:21:47,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=524874.0, ans=0.125 2023-06-19 18:21:47,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=524874.0, ans=0.1 2023-06-19 18:21:50,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=524874.0, ans=0.125 2023-06-19 18:21:55,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=524934.0, ans=0.95 2023-06-19 18:21:56,760 INFO [train.py:996] (3/4) Epoch 3, batch 26500, loss[loss=0.2283, simple_loss=0.2883, pruned_loss=0.08417, over 21361.00 frames. ], tot_loss[loss=0.2814, simple_loss=0.3466, pruned_loss=0.1081, over 4279790.73 frames. ], batch size: 194, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:22:04,691 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.354e+02 4.139e+02 5.566e+02 7.518e+02, threshold=8.277e+02, percent-clipped=7.0 2023-06-19 18:22:36,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=525054.0, ans=0.125 2023-06-19 18:22:42,046 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-19 18:22:52,484 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:23:21,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=525114.0, ans=0.125 2023-06-19 18:23:24,963 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2023-06-19 18:23:42,728 INFO [train.py:996] (3/4) Epoch 3, batch 26550, loss[loss=0.2199, simple_loss=0.2785, pruned_loss=0.08065, over 21188.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3414, pruned_loss=0.1036, over 4270658.00 frames. ], batch size: 159, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:24:16,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=525294.0, ans=0.05 2023-06-19 18:25:30,071 INFO [train.py:996] (3/4) Epoch 3, batch 26600, loss[loss=0.3063, simple_loss=0.3547, pruned_loss=0.1289, over 21339.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3417, pruned_loss=0.1015, over 4264274.48 frames. ], batch size: 471, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:25:38,685 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 3.044e+02 3.646e+02 4.264e+02 8.431e+02, threshold=7.292e+02, percent-clipped=1.0 2023-06-19 18:26:22,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=525654.0, ans=0.125 2023-06-19 18:26:25,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=525654.0, ans=0.05 2023-06-19 18:26:40,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=525714.0, ans=0.0 2023-06-19 18:27:10,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=525774.0, ans=0.125 2023-06-19 18:27:13,237 INFO [train.py:996] (3/4) Epoch 3, batch 26650, loss[loss=0.201, simple_loss=0.2847, pruned_loss=0.05866, over 21676.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3331, pruned_loss=0.09946, over 4258866.81 frames. ], batch size: 415, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:27:28,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=525894.0, ans=0.125 2023-06-19 18:27:38,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=525894.0, ans=0.1 2023-06-19 18:27:43,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=525894.0, ans=0.125 2023-06-19 18:28:25,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=526014.0, ans=0.0 2023-06-19 18:28:35,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=526074.0, ans=0.125 2023-06-19 18:28:53,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=526134.0, ans=0.125 2023-06-19 18:28:55,341 INFO [train.py:996] (3/4) Epoch 3, batch 26700, loss[loss=0.2625, simple_loss=0.3255, pruned_loss=0.09973, over 21911.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3265, pruned_loss=0.0963, over 4262864.34 frames. ], batch size: 351, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:29:03,452 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 2.681e+02 3.249e+02 4.280e+02 9.861e+02, threshold=6.499e+02, percent-clipped=1.0 2023-06-19 18:29:47,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=526254.0, ans=0.125 2023-06-19 18:29:50,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=526254.0, ans=0.0 2023-06-19 18:30:38,038 INFO [train.py:996] (3/4) Epoch 3, batch 26750, loss[loss=0.2249, simple_loss=0.3087, pruned_loss=0.07051, over 21427.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3275, pruned_loss=0.09575, over 4268178.15 frames. ], batch size: 211, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:30:41,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=526434.0, ans=0.125 2023-06-19 18:31:16,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=526494.0, ans=0.125 2023-06-19 18:32:24,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=526674.0, ans=0.125 2023-06-19 18:32:35,829 INFO [train.py:996] (3/4) Epoch 3, batch 26800, loss[loss=0.3354, simple_loss=0.3877, pruned_loss=0.1416, over 21390.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3361, pruned_loss=0.1015, over 4270398.53 frames. ], batch size: 548, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:32:49,009 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 3.066e+02 3.643e+02 4.361e+02 8.068e+02, threshold=7.286e+02, percent-clipped=5.0 2023-06-19 18:33:24,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=526854.0, ans=0.1 2023-06-19 18:34:23,767 INFO [train.py:996] (3/4) Epoch 3, batch 26850, loss[loss=0.2543, simple_loss=0.3109, pruned_loss=0.09881, over 21693.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3394, pruned_loss=0.1053, over 4275488.75 frames. ], batch size: 112, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:34:37,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=527034.0, ans=0.1 2023-06-19 18:34:38,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=527094.0, ans=0.025 2023-06-19 18:35:06,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=527154.0, ans=0.2 2023-06-19 18:35:29,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=527214.0, ans=0.125 2023-06-19 18:36:05,753 INFO [train.py:996] (3/4) Epoch 3, batch 26900, loss[loss=0.223, simple_loss=0.2807, pruned_loss=0.08263, over 21600.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3296, pruned_loss=0.1033, over 4272752.73 frames. ], batch size: 298, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:36:14,251 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.946e+02 3.321e+02 4.106e+02 6.345e+02, threshold=6.642e+02, percent-clipped=0.0 2023-06-19 18:36:31,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=527394.0, ans=0.125 2023-06-19 18:36:47,259 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-19 18:37:11,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.22 vs. limit=15.0 2023-06-19 18:37:27,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.05 vs. limit=10.0 2023-06-19 18:37:35,633 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-06-19 18:37:49,033 INFO [train.py:996] (3/4) Epoch 3, batch 26950, loss[loss=0.2986, simple_loss=0.382, pruned_loss=0.1076, over 21751.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3302, pruned_loss=0.104, over 4272785.62 frames. ], batch size: 351, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:37:57,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=527634.0, ans=0.125 2023-06-19 18:38:01,617 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-19 18:38:09,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=527694.0, ans=0.125 2023-06-19 18:38:24,527 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=15.0 2023-06-19 18:38:27,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=527754.0, ans=0.1 2023-06-19 18:38:32,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=527754.0, ans=0.0 2023-06-19 18:38:46,033 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.68 vs. limit=6.0 2023-06-19 18:38:48,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=527814.0, ans=0.125 2023-06-19 18:39:02,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=527814.0, ans=0.0 2023-06-19 18:39:16,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=527874.0, ans=0.125 2023-06-19 18:39:32,573 INFO [train.py:996] (3/4) Epoch 3, batch 27000, loss[loss=0.3354, simple_loss=0.4075, pruned_loss=0.1317, over 21452.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3302, pruned_loss=0.1014, over 4278019.17 frames. ], batch size: 471, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:39:32,574 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 18:39:49,117 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.2602, simple_loss=0.3579, pruned_loss=0.0813, over 1796401.00 frames. 2023-06-19 18:39:49,117 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-19 18:39:59,039 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.939e+02 3.560e+02 4.603e+02 8.017e+02, threshold=7.120e+02, percent-clipped=5.0 2023-06-19 18:40:34,943 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=22.5 2023-06-19 18:40:58,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=528114.0, ans=0.125 2023-06-19 18:41:08,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=528114.0, ans=0.1 2023-06-19 18:41:24,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=528174.0, ans=0.0 2023-06-19 18:41:33,640 INFO [train.py:996] (3/4) Epoch 3, batch 27050, loss[loss=0.2928, simple_loss=0.3504, pruned_loss=0.1176, over 21330.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3328, pruned_loss=0.09895, over 4276223.09 frames. ], batch size: 144, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:42:10,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=528354.0, ans=0.125 2023-06-19 18:43:11,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=528474.0, ans=0.125 2023-06-19 18:43:16,441 INFO [train.py:996] (3/4) Epoch 3, batch 27100, loss[loss=0.2482, simple_loss=0.3406, pruned_loss=0.07786, over 21891.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3353, pruned_loss=0.09911, over 4280149.14 frames. ], batch size: 316, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:43:30,992 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.773e+02 3.198e+02 4.013e+02 8.418e+02, threshold=6.395e+02, percent-clipped=2.0 2023-06-19 18:44:18,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=528654.0, ans=0.0 2023-06-19 18:44:42,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=528774.0, ans=0.1 2023-06-19 18:45:00,947 INFO [train.py:996] (3/4) Epoch 3, batch 27150, loss[loss=0.294, simple_loss=0.3642, pruned_loss=0.1119, over 21279.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3465, pruned_loss=0.1026, over 4280742.60 frames. ], batch size: 176, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:45:02,061 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.83 vs. limit=6.0 2023-06-19 18:45:22,400 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-06-19 18:45:37,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=528894.0, ans=0.125 2023-06-19 18:45:58,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=528954.0, ans=0.2 2023-06-19 18:46:26,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=529074.0, ans=0.2 2023-06-19 18:46:49,405 INFO [train.py:996] (3/4) Epoch 3, batch 27200, loss[loss=0.325, simple_loss=0.3968, pruned_loss=0.1266, over 21711.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.355, pruned_loss=0.1057, over 4281200.78 frames. ], batch size: 441, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 18:46:49,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=529134.0, ans=0.0 2023-06-19 18:46:59,295 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 3.363e+02 3.936e+02 4.684e+02 8.685e+02, threshold=7.872e+02, percent-clipped=10.0 2023-06-19 18:47:11,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=529194.0, ans=0.0 2023-06-19 18:47:42,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=529254.0, ans=0.02 2023-06-19 18:47:42,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=529254.0, ans=0.125 2023-06-19 18:48:17,316 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-19 18:48:33,207 INFO [train.py:996] (3/4) Epoch 3, batch 27250, loss[loss=0.3387, simple_loss=0.3832, pruned_loss=0.1471, over 21832.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3585, pruned_loss=0.1111, over 4279441.36 frames. ], batch size: 441, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 18:48:47,817 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=15.0 2023-06-19 18:49:48,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=529614.0, ans=0.125 2023-06-19 18:49:52,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=529614.0, ans=0.0 2023-06-19 18:50:09,515 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-19 18:50:28,578 INFO [train.py:996] (3/4) Epoch 3, batch 27300, loss[loss=0.3165, simple_loss=0.3924, pruned_loss=0.1203, over 21336.00 frames. ], tot_loss[loss=0.2934, simple_loss=0.3615, pruned_loss=0.1126, over 4281990.76 frames. ], batch size: 549, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 18:50:28,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=529734.0, ans=0.125 2023-06-19 18:50:43,357 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.140e+02 3.530e+02 4.339e+02 7.752e+02, threshold=7.060e+02, percent-clipped=0.0 2023-06-19 18:50:44,370 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=12.0 2023-06-19 18:51:02,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=529794.0, ans=0.2 2023-06-19 18:51:09,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=529854.0, ans=0.0 2023-06-19 18:52:03,368 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-19 18:52:04,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=529974.0, ans=0.2 2023-06-19 18:52:17,057 INFO [train.py:996] (3/4) Epoch 3, batch 27350, loss[loss=0.2731, simple_loss=0.3425, pruned_loss=0.1018, over 21265.00 frames. ], tot_loss[loss=0.2956, simple_loss=0.3634, pruned_loss=0.1139, over 4275688.53 frames. ], batch size: 159, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 18:52:30,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=530034.0, ans=0.0 2023-06-19 18:52:30,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=530034.0, ans=0.125 2023-06-19 18:52:32,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=530094.0, ans=0.1 2023-06-19 18:53:23,591 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-19 18:53:45,059 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-19 18:53:58,592 INFO [train.py:996] (3/4) Epoch 3, batch 27400, loss[loss=0.2505, simple_loss=0.3007, pruned_loss=0.1001, over 21407.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3585, pruned_loss=0.113, over 4281320.40 frames. ], batch size: 177, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:54:09,808 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 3.048e+02 3.444e+02 4.008e+02 7.916e+02, threshold=6.888e+02, percent-clipped=1.0 2023-06-19 18:54:22,586 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-19 18:54:24,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=530394.0, ans=0.1 2023-06-19 18:54:48,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=530454.0, ans=0.1 2023-06-19 18:55:25,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=530574.0, ans=0.0 2023-06-19 18:55:39,640 INFO [train.py:996] (3/4) Epoch 3, batch 27450, loss[loss=0.2999, simple_loss=0.3692, pruned_loss=0.1153, over 21707.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.352, pruned_loss=0.1112, over 4277754.46 frames. ], batch size: 332, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:55:48,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=530634.0, ans=0.125 2023-06-19 18:56:03,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=530694.0, ans=0.125 2023-06-19 18:56:42,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=530814.0, ans=0.0 2023-06-19 18:56:52,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=530814.0, ans=0.125 2023-06-19 18:57:21,066 INFO [train.py:996] (3/4) Epoch 3, batch 27500, loss[loss=0.2548, simple_loss=0.31, pruned_loss=0.09983, over 21125.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.351, pruned_loss=0.1118, over 4280174.28 frames. ], batch size: 608, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:57:21,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=530934.0, ans=0.05 2023-06-19 18:57:26,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=530934.0, ans=0.0 2023-06-19 18:57:32,515 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.096e+02 3.719e+02 4.715e+02 7.955e+02, threshold=7.439e+02, percent-clipped=2.0 2023-06-19 18:57:57,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=531054.0, ans=0.0 2023-06-19 18:58:28,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=531114.0, ans=0.125 2023-06-19 18:58:46,742 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-19 18:59:01,846 INFO [train.py:996] (3/4) Epoch 3, batch 27550, loss[loss=0.2816, simple_loss=0.3259, pruned_loss=0.1186, over 21517.00 frames. ], tot_loss[loss=0.2811, simple_loss=0.3448, pruned_loss=0.1087, over 4272702.00 frames. ], batch size: 441, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:59:04,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=531234.0, ans=0.125 2023-06-19 19:00:42,095 INFO [train.py:996] (3/4) Epoch 3, batch 27600, loss[loss=0.2476, simple_loss=0.3007, pruned_loss=0.09726, over 21209.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3364, pruned_loss=0.1062, over 4264218.32 frames. ], batch size: 176, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:00:53,346 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 2.652e+02 3.386e+02 4.273e+02 7.001e+02, threshold=6.773e+02, percent-clipped=0.0 2023-06-19 19:01:51,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=531714.0, ans=0.0 2023-06-19 19:02:23,389 INFO [train.py:996] (3/4) Epoch 3, batch 27650, loss[loss=0.2993, simple_loss=0.3484, pruned_loss=0.1251, over 21407.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3312, pruned_loss=0.106, over 4258344.02 frames. ], batch size: 548, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:02:32,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=531834.0, ans=0.125 2023-06-19 19:02:38,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=531894.0, ans=0.2 2023-06-19 19:02:43,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=531894.0, ans=0.05 2023-06-19 19:03:12,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=531954.0, ans=0.125 2023-06-19 19:04:05,891 INFO [train.py:996] (3/4) Epoch 3, batch 27700, loss[loss=0.286, simple_loss=0.374, pruned_loss=0.09905, over 19854.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3303, pruned_loss=0.1028, over 4260677.36 frames. ], batch size: 703, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:04:12,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=532134.0, ans=0.09899494936611666 2023-06-19 19:04:17,179 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 2.929e+02 3.310e+02 4.348e+02 7.080e+02, threshold=6.619e+02, percent-clipped=1.0 2023-06-19 19:05:19,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=532314.0, ans=0.1 2023-06-19 19:05:47,563 INFO [train.py:996] (3/4) Epoch 3, batch 27750, loss[loss=0.2383, simple_loss=0.3044, pruned_loss=0.08614, over 21820.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3345, pruned_loss=0.1029, over 4257444.25 frames. ], batch size: 118, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:07:29,262 INFO [train.py:996] (3/4) Epoch 3, batch 27800, loss[loss=0.2849, simple_loss=0.3423, pruned_loss=0.1137, over 21758.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3339, pruned_loss=0.104, over 4271308.11 frames. ], batch size: 389, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:07:40,249 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.741e+02 3.231e+02 4.040e+02 7.271e+02, threshold=6.461e+02, percent-clipped=1.0 2023-06-19 19:08:52,618 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:09:12,031 INFO [train.py:996] (3/4) Epoch 3, batch 27850, loss[loss=0.2816, simple_loss=0.3481, pruned_loss=0.1075, over 21880.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3336, pruned_loss=0.106, over 4274676.93 frames. ], batch size: 118, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:09:16,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=533034.0, ans=0.1 2023-06-19 19:09:46,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=533094.0, ans=0.05 2023-06-19 19:10:03,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=533154.0, ans=0.05 2023-06-19 19:10:07,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=533154.0, ans=0.125 2023-06-19 19:10:18,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=533214.0, ans=0.125 2023-06-19 19:10:36,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=533214.0, ans=0.125 2023-06-19 19:10:54,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=533274.0, ans=0.1 2023-06-19 19:10:58,465 INFO [train.py:996] (3/4) Epoch 3, batch 27900, loss[loss=0.3277, simple_loss=0.4077, pruned_loss=0.1238, over 21513.00 frames. ], tot_loss[loss=0.28, simple_loss=0.3448, pruned_loss=0.1076, over 4271990.56 frames. ], batch size: 471, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:11:08,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=533334.0, ans=0.0 2023-06-19 19:11:16,406 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 3.014e+02 3.653e+02 4.966e+02 8.433e+02, threshold=7.306e+02, percent-clipped=7.0 2023-06-19 19:12:16,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=533514.0, ans=15.0 2023-06-19 19:12:37,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=533574.0, ans=0.1 2023-06-19 19:12:47,288 INFO [train.py:996] (3/4) Epoch 3, batch 27950, loss[loss=0.2777, simple_loss=0.3447, pruned_loss=0.1054, over 21428.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3436, pruned_loss=0.1031, over 4273047.44 frames. ], batch size: 211, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:13:45,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=533754.0, ans=0.125 2023-06-19 19:14:16,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=533874.0, ans=0.125 2023-06-19 19:14:30,697 INFO [train.py:996] (3/4) Epoch 3, batch 28000, loss[loss=0.2488, simple_loss=0.3201, pruned_loss=0.08877, over 21682.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3401, pruned_loss=0.1003, over 4280660.62 frames. ], batch size: 263, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:14:48,825 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.713e+02 3.305e+02 4.071e+02 8.310e+02, threshold=6.609e+02, percent-clipped=2.0 2023-06-19 19:15:21,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=534054.0, ans=0.0 2023-06-19 19:15:49,454 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-19 19:16:05,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=534174.0, ans=0.1 2023-06-19 19:16:20,839 INFO [train.py:996] (3/4) Epoch 3, batch 28050, loss[loss=0.2323, simple_loss=0.2807, pruned_loss=0.09198, over 21304.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3385, pruned_loss=0.102, over 4290171.13 frames. ], batch size: 159, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:16:44,968 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-19 19:16:59,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=534294.0, ans=0.0 2023-06-19 19:17:02,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=534354.0, ans=0.0 2023-06-19 19:17:06,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=534354.0, ans=0.015 2023-06-19 19:17:24,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=534414.0, ans=0.0 2023-06-19 19:17:45,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=534474.0, ans=0.125 2023-06-19 19:17:55,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=534474.0, ans=0.2 2023-06-19 19:18:03,131 INFO [train.py:996] (3/4) Epoch 3, batch 28100, loss[loss=0.2306, simple_loss=0.2861, pruned_loss=0.08749, over 21210.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3359, pruned_loss=0.1015, over 4280995.36 frames. ], batch size: 548, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:18:21,029 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 2.997e+02 3.623e+02 4.325e+02 7.130e+02, threshold=7.246e+02, percent-clipped=1.0 2023-06-19 19:19:33,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=534774.0, ans=0.0 2023-06-19 19:19:44,116 INFO [train.py:996] (3/4) Epoch 3, batch 28150, loss[loss=0.2442, simple_loss=0.3043, pruned_loss=0.09201, over 21511.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3298, pruned_loss=0.1019, over 4279089.43 frames. ], batch size: 391, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:21:02,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=535014.0, ans=0.125 2023-06-19 19:21:08,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=535074.0, ans=0.1 2023-06-19 19:21:27,381 INFO [train.py:996] (3/4) Epoch 3, batch 28200, loss[loss=0.311, simple_loss=0.4435, pruned_loss=0.08925, over 19776.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3312, pruned_loss=0.104, over 4270374.08 frames. ], batch size: 702, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:21:38,181 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.89 vs. limit=10.0 2023-06-19 19:21:50,054 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.413e+02 3.171e+02 3.933e+02 4.825e+02 1.002e+03, threshold=7.866e+02, percent-clipped=3.0 2023-06-19 19:21:54,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=12.0 2023-06-19 19:22:30,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=535314.0, ans=10.0 2023-06-19 19:22:54,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=535374.0, ans=0.2 2023-06-19 19:23:19,289 INFO [train.py:996] (3/4) Epoch 3, batch 28250, loss[loss=0.2348, simple_loss=0.2945, pruned_loss=0.08753, over 21617.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3353, pruned_loss=0.1075, over 4267734.82 frames. ], batch size: 247, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:24:38,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=535674.0, ans=0.125 2023-06-19 19:25:00,421 INFO [train.py:996] (3/4) Epoch 3, batch 28300, loss[loss=0.19, simple_loss=0.268, pruned_loss=0.05603, over 21410.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3324, pruned_loss=0.1043, over 4268896.17 frames. ], batch size: 211, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:25:13,873 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.857e+02 3.337e+02 4.140e+02 8.167e+02, threshold=6.674e+02, percent-clipped=3.0 2023-06-19 19:25:44,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-19 19:25:48,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=535854.0, ans=0.125 2023-06-19 19:26:11,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=535914.0, ans=0.125 2023-06-19 19:26:35,821 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.22 vs. limit=6.0 2023-06-19 19:26:42,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=536034.0, ans=0.0 2023-06-19 19:26:43,612 INFO [train.py:996] (3/4) Epoch 3, batch 28350, loss[loss=0.2626, simple_loss=0.321, pruned_loss=0.1021, over 21848.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3288, pruned_loss=0.09833, over 4258436.21 frames. ], batch size: 372, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:26:43,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=536034.0, ans=0.0 2023-06-19 19:26:53,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=536034.0, ans=0.0 2023-06-19 19:27:41,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=536214.0, ans=0.5 2023-06-19 19:28:25,757 INFO [train.py:996] (3/4) Epoch 3, batch 28400, loss[loss=0.3063, simple_loss=0.3466, pruned_loss=0.133, over 21305.00 frames. ], tot_loss[loss=0.262, simple_loss=0.326, pruned_loss=0.099, over 4257193.23 frames. ], batch size: 471, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:28:27,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=536334.0, ans=0.05 2023-06-19 19:28:44,191 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.716e+02 3.452e+02 4.220e+02 6.740e+02, threshold=6.905e+02, percent-clipped=2.0 2023-06-19 19:28:56,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=536394.0, ans=0.2 2023-06-19 19:29:04,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=536454.0, ans=0.1 2023-06-19 19:29:41,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=536514.0, ans=0.2 2023-06-19 19:29:43,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=536514.0, ans=0.125 2023-06-19 19:30:00,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=536574.0, ans=0.125 2023-06-19 19:30:03,697 INFO [train.py:996] (3/4) Epoch 3, batch 28450, loss[loss=0.3754, simple_loss=0.3989, pruned_loss=0.176, over 21621.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3302, pruned_loss=0.1021, over 4254558.31 frames. ], batch size: 507, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:30:14,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-19 19:30:49,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=536754.0, ans=0.125 2023-06-19 19:30:53,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=536754.0, ans=0.125 2023-06-19 19:31:16,662 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.65 vs. limit=10.0 2023-06-19 19:31:42,084 INFO [train.py:996] (3/4) Epoch 3, batch 28500, loss[loss=0.2619, simple_loss=0.3145, pruned_loss=0.1046, over 21142.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3325, pruned_loss=0.1051, over 4265048.95 frames. ], batch size: 608, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:31:43,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=536934.0, ans=0.125 2023-06-19 19:31:59,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=536934.0, ans=0.1 2023-06-19 19:32:01,950 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.952e+02 3.630e+02 4.610e+02 9.107e+02, threshold=7.260e+02, percent-clipped=2.0 2023-06-19 19:32:51,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=537114.0, ans=0.035 2023-06-19 19:33:14,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=537174.0, ans=0.0 2023-06-19 19:33:30,073 INFO [train.py:996] (3/4) Epoch 3, batch 28550, loss[loss=0.4006, simple_loss=0.4623, pruned_loss=0.1694, over 21428.00 frames. ], tot_loss[loss=0.279, simple_loss=0.3414, pruned_loss=0.1083, over 4272965.46 frames. ], batch size: 507, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:34:29,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=537354.0, ans=0.125 2023-06-19 19:34:49,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=537414.0, ans=0.125 2023-06-19 19:35:06,961 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-06-19 19:35:11,040 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:35:13,502 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-06-19 19:35:14,023 INFO [train.py:996] (3/4) Epoch 3, batch 28600, loss[loss=0.2604, simple_loss=0.3332, pruned_loss=0.09377, over 21471.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.349, pruned_loss=0.1109, over 4279200.98 frames. ], batch size: 194, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:35:38,659 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.072e+02 3.686e+02 4.724e+02 8.342e+02, threshold=7.372e+02, percent-clipped=3.0 2023-06-19 19:36:18,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=537714.0, ans=0.125 2023-06-19 19:36:19,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=537714.0, ans=0.0 2023-06-19 19:36:39,882 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-19 19:36:42,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=537774.0, ans=0.1 2023-06-19 19:36:47,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=537774.0, ans=0.125 2023-06-19 19:37:00,661 INFO [train.py:996] (3/4) Epoch 3, batch 28650, loss[loss=0.3256, simple_loss=0.3524, pruned_loss=0.1494, over 21220.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.3419, pruned_loss=0.1095, over 4279389.70 frames. ], batch size: 471, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:37:57,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=538014.0, ans=0.0 2023-06-19 19:38:04,454 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=15.0 2023-06-19 19:38:38,240 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=15.0 2023-06-19 19:38:39,931 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2023-06-19 19:38:41,486 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-19 19:38:42,131 INFO [train.py:996] (3/4) Epoch 3, batch 28700, loss[loss=0.31, simple_loss=0.3688, pruned_loss=0.1256, over 21765.00 frames. ], tot_loss[loss=0.2812, simple_loss=0.3406, pruned_loss=0.1109, over 4271235.82 frames. ], batch size: 351, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:38:44,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=538134.0, ans=0.5 2023-06-19 19:38:59,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=538134.0, ans=0.1 2023-06-19 19:39:01,911 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 2.971e+02 3.318e+02 4.254e+02 6.959e+02, threshold=6.637e+02, percent-clipped=0.0 2023-06-19 19:39:15,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=538194.0, ans=0.2 2023-06-19 19:39:54,680 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:40:24,361 INFO [train.py:996] (3/4) Epoch 3, batch 28750, loss[loss=0.2652, simple_loss=0.3358, pruned_loss=0.09726, over 21827.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.3425, pruned_loss=0.1119, over 4278723.06 frames. ], batch size: 332, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:40:43,714 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.32 vs. limit=5.0 2023-06-19 19:41:06,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=538554.0, ans=0.0 2023-06-19 19:41:49,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=538674.0, ans=0.0 2023-06-19 19:41:50,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=538674.0, ans=0.025 2023-06-19 19:42:04,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=538674.0, ans=0.0 2023-06-19 19:42:07,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=538674.0, ans=0.0 2023-06-19 19:42:11,970 INFO [train.py:996] (3/4) Epoch 3, batch 28800, loss[loss=0.2893, simple_loss=0.3461, pruned_loss=0.1163, over 21608.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.3439, pruned_loss=0.1113, over 4276427.28 frames. ], batch size: 263, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:42:31,958 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.000e+02 3.700e+02 5.247e+02 1.056e+03, threshold=7.400e+02, percent-clipped=15.0 2023-06-19 19:43:05,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=538854.0, ans=0.025 2023-06-19 19:43:57,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=538974.0, ans=0.09899494936611666 2023-06-19 19:43:59,882 INFO [train.py:996] (3/4) Epoch 3, batch 28850, loss[loss=0.2467, simple_loss=0.3127, pruned_loss=0.09033, over 21648.00 frames. ], tot_loss[loss=0.285, simple_loss=0.3451, pruned_loss=0.1125, over 4285027.66 frames. ], batch size: 263, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:44:10,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=539034.0, ans=0.125 2023-06-19 19:44:18,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=539094.0, ans=10.0 2023-06-19 19:44:34,319 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.68 vs. limit=15.0 2023-06-19 19:44:42,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=539154.0, ans=0.125 2023-06-19 19:44:45,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=539154.0, ans=0.1 2023-06-19 19:44:52,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=539154.0, ans=0.125 2023-06-19 19:45:08,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=539214.0, ans=0.125 2023-06-19 19:45:28,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=539274.0, ans=0.0 2023-06-19 19:45:28,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=539274.0, ans=0.1 2023-06-19 19:45:43,211 INFO [train.py:996] (3/4) Epoch 3, batch 28900, loss[loss=0.2876, simple_loss=0.3477, pruned_loss=0.1137, over 21731.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.3492, pruned_loss=0.1149, over 4283700.40 frames. ], batch size: 298, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:45:50,524 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-19 19:45:58,267 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.142e+02 3.259e+02 3.859e+02 4.928e+02 8.850e+02, threshold=7.718e+02, percent-clipped=4.0 2023-06-19 19:47:26,866 INFO [train.py:996] (3/4) Epoch 3, batch 28950, loss[loss=0.2507, simple_loss=0.3142, pruned_loss=0.09362, over 20962.00 frames. ], tot_loss[loss=0.2869, simple_loss=0.3477, pruned_loss=0.1131, over 4281008.74 frames. ], batch size: 608, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:48:10,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=539754.0, ans=0.125 2023-06-19 19:48:40,658 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.48 vs. limit=6.0 2023-06-19 19:48:44,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=539814.0, ans=0.1 2023-06-19 19:49:13,082 INFO [train.py:996] (3/4) Epoch 3, batch 29000, loss[loss=0.3032, simple_loss=0.3758, pruned_loss=0.1153, over 21362.00 frames. ], tot_loss[loss=0.289, simple_loss=0.352, pruned_loss=0.113, over 4284529.81 frames. ], batch size: 131, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:49:27,891 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 2.902e+02 3.366e+02 4.190e+02 7.172e+02, threshold=6.731e+02, percent-clipped=0.0 2023-06-19 19:49:31,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=539994.0, ans=0.125 2023-06-19 19:50:55,865 INFO [train.py:996] (3/4) Epoch 3, batch 29050, loss[loss=0.2887, simple_loss=0.3443, pruned_loss=0.1166, over 21826.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.3519, pruned_loss=0.1128, over 4287010.22 frames. ], batch size: 441, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:51:08,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=540234.0, ans=0.125 2023-06-19 19:51:35,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=540294.0, ans=0.2 2023-06-19 19:51:37,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=540294.0, ans=0.0 2023-06-19 19:52:13,992 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.79 vs. limit=6.0 2023-06-19 19:52:25,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=540474.0, ans=0.05 2023-06-19 19:52:38,108 INFO [train.py:996] (3/4) Epoch 3, batch 29100, loss[loss=0.2558, simple_loss=0.3119, pruned_loss=0.09986, over 21252.00 frames. ], tot_loss[loss=0.2802, simple_loss=0.3417, pruned_loss=0.1094, over 4290458.81 frames. ], batch size: 159, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:52:57,598 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.942e+02 3.636e+02 4.444e+02 9.761e+02, threshold=7.273e+02, percent-clipped=4.0 2023-06-19 19:53:36,171 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.65 vs. limit=15.0 2023-06-19 19:53:52,194 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.66 vs. limit=6.0 2023-06-19 19:54:18,756 INFO [train.py:996] (3/4) Epoch 3, batch 29150, loss[loss=0.2644, simple_loss=0.3477, pruned_loss=0.09053, over 21780.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3389, pruned_loss=0.108, over 4283219.86 frames. ], batch size: 316, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:54:19,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=540834.0, ans=0.125 2023-06-19 19:54:45,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=540894.0, ans=0.1 2023-06-19 19:55:28,591 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:55:58,476 INFO [train.py:996] (3/4) Epoch 3, batch 29200, loss[loss=0.2348, simple_loss=0.295, pruned_loss=0.08729, over 21786.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3347, pruned_loss=0.1071, over 4281978.93 frames. ], batch size: 371, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:56:18,614 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.074e+02 3.815e+02 4.848e+02 9.248e+02, threshold=7.630e+02, percent-clipped=3.0 2023-06-19 19:56:55,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=541254.0, ans=0.0 2023-06-19 19:57:12,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=541314.0, ans=0.0 2023-06-19 19:57:41,200 INFO [train.py:996] (3/4) Epoch 3, batch 29250, loss[loss=0.2691, simple_loss=0.3531, pruned_loss=0.09257, over 21706.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3325, pruned_loss=0.1041, over 4284712.78 frames. ], batch size: 391, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 19:57:57,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=541434.0, ans=0.1 2023-06-19 19:58:16,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=541494.0, ans=15.0 2023-06-19 19:59:28,961 INFO [train.py:996] (3/4) Epoch 3, batch 29300, loss[loss=0.2477, simple_loss=0.3098, pruned_loss=0.09277, over 21316.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3337, pruned_loss=0.1027, over 4284931.49 frames. ], batch size: 549, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 19:59:48,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=541794.0, ans=0.07 2023-06-19 19:59:49,541 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 3.065e+02 3.693e+02 4.587e+02 7.138e+02, threshold=7.387e+02, percent-clipped=0.0 2023-06-19 20:00:01,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=541794.0, ans=0.125 2023-06-19 20:00:08,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=541794.0, ans=0.125 2023-06-19 20:00:36,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=541914.0, ans=0.0 2023-06-19 20:01:03,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=541974.0, ans=0.1 2023-06-19 20:01:11,347 INFO [train.py:996] (3/4) Epoch 3, batch 29350, loss[loss=0.2481, simple_loss=0.2907, pruned_loss=0.1028, over 21470.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3299, pruned_loss=0.1017, over 4284373.67 frames. ], batch size: 195, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:02:01,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=542154.0, ans=0.2 2023-06-19 20:02:25,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=542214.0, ans=0.05 2023-06-19 20:02:30,065 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:02:38,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=542274.0, ans=0.125 2023-06-19 20:02:59,325 INFO [train.py:996] (3/4) Epoch 3, batch 29400, loss[loss=0.2233, simple_loss=0.2832, pruned_loss=0.0817, over 21768.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3286, pruned_loss=0.09902, over 4267442.10 frames. ], batch size: 282, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:03:21,659 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.918e+02 3.507e+02 4.489e+02 7.938e+02, threshold=7.015e+02, percent-clipped=2.0 2023-06-19 20:03:56,798 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:03:58,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=542514.0, ans=0.04949747468305833 2023-06-19 20:04:09,613 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=10.0 2023-06-19 20:04:22,718 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=15.0 2023-06-19 20:04:24,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=542574.0, ans=0.0 2023-06-19 20:04:27,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=542574.0, ans=0.07 2023-06-19 20:04:37,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=542574.0, ans=0.035 2023-06-19 20:04:43,521 INFO [train.py:996] (3/4) Epoch 3, batch 29450, loss[loss=0.2838, simple_loss=0.3452, pruned_loss=0.1112, over 21744.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3296, pruned_loss=0.09926, over 4267743.02 frames. ], batch size: 247, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:04:57,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=542634.0, ans=0.125 2023-06-19 20:05:34,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=542754.0, ans=0.0 2023-06-19 20:05:42,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=542814.0, ans=0.1 2023-06-19 20:05:50,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=542814.0, ans=0.125 2023-06-19 20:05:58,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=542814.0, ans=0.125 2023-06-19 20:06:23,377 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.684e-03 2023-06-19 20:06:29,876 INFO [train.py:996] (3/4) Epoch 3, batch 29500, loss[loss=0.2751, simple_loss=0.3383, pruned_loss=0.1059, over 21797.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3333, pruned_loss=0.1027, over 4274526.98 frames. ], batch size: 351, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:06:45,767 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.340e+02 3.087e+02 3.959e+02 5.251e+02 8.059e+02, threshold=7.918e+02, percent-clipped=6.0 2023-06-19 20:06:55,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=542994.0, ans=0.0 2023-06-19 20:06:58,077 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=12.0 2023-06-19 20:07:06,237 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=12.0 2023-06-19 20:07:45,146 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-06-19 20:07:47,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=543174.0, ans=0.125 2023-06-19 20:08:00,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=543174.0, ans=0.0 2023-06-19 20:08:10,022 INFO [train.py:996] (3/4) Epoch 3, batch 29550, loss[loss=0.2892, simple_loss=0.3419, pruned_loss=0.1183, over 21305.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3339, pruned_loss=0.1052, over 4287352.56 frames. ], batch size: 159, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:08:36,114 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-19 20:09:06,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=543414.0, ans=0.0 2023-06-19 20:09:54,119 INFO [train.py:996] (3/4) Epoch 3, batch 29600, loss[loss=0.3771, simple_loss=0.4423, pruned_loss=0.156, over 21517.00 frames. ], tot_loss[loss=0.2824, simple_loss=0.3454, pruned_loss=0.1097, over 4282115.55 frames. ], batch size: 471, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 20:10:15,949 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 3.027e+02 3.599e+02 4.338e+02 7.072e+02, threshold=7.197e+02, percent-clipped=0.0 2023-06-19 20:11:36,783 INFO [train.py:996] (3/4) Epoch 3, batch 29650, loss[loss=0.1835, simple_loss=0.2629, pruned_loss=0.05206, over 21625.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3422, pruned_loss=0.1058, over 4276514.12 frames. ], batch size: 263, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 20:11:44,055 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=15.0 2023-06-19 20:12:17,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=543954.0, ans=0.125 2023-06-19 20:12:37,019 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-19 20:13:07,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=544074.0, ans=0.1 2023-06-19 20:13:17,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=544074.0, ans=0.125 2023-06-19 20:13:20,488 INFO [train.py:996] (3/4) Epoch 3, batch 29700, loss[loss=0.2732, simple_loss=0.3651, pruned_loss=0.0906, over 21628.00 frames. ], tot_loss[loss=0.277, simple_loss=0.3433, pruned_loss=0.1054, over 4277248.31 frames. ], batch size: 230, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 20:13:41,155 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.649e+02 2.987e+02 3.970e+02 7.304e+02, threshold=5.973e+02, percent-clipped=1.0 2023-06-19 20:13:43,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=544194.0, ans=0.125 2023-06-19 20:14:59,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=544374.0, ans=0.125 2023-06-19 20:15:01,829 INFO [train.py:996] (3/4) Epoch 3, batch 29750, loss[loss=0.2498, simple_loss=0.3316, pruned_loss=0.08397, over 21890.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3466, pruned_loss=0.105, over 4271481.15 frames. ], batch size: 316, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 20:15:24,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=544494.0, ans=10.0 2023-06-19 20:15:28,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=544494.0, ans=0.1 2023-06-19 20:15:33,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=544494.0, ans=0.125 2023-06-19 20:15:38,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=544554.0, ans=0.0 2023-06-19 20:15:49,171 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-19 20:16:06,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=544614.0, ans=0.1 2023-06-19 20:16:30,829 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-19 20:16:35,972 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-19 20:16:36,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=544674.0, ans=0.1 2023-06-19 20:16:47,575 INFO [train.py:996] (3/4) Epoch 3, batch 29800, loss[loss=0.2815, simple_loss=0.3535, pruned_loss=0.1048, over 21861.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.3488, pruned_loss=0.1053, over 4270281.49 frames. ], batch size: 371, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:16:59,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=544734.0, ans=0.125 2023-06-19 20:17:03,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=544794.0, ans=0.0 2023-06-19 20:17:05,115 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 3.342e+02 4.045e+02 4.978e+02 1.039e+03, threshold=8.090e+02, percent-clipped=10.0 2023-06-19 20:18:04,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=544974.0, ans=0.125 2023-06-19 20:18:22,290 INFO [train.py:996] (3/4) Epoch 3, batch 29850, loss[loss=0.2255, simple_loss=0.2988, pruned_loss=0.07609, over 21820.00 frames. ], tot_loss[loss=0.273, simple_loss=0.3424, pruned_loss=0.1018, over 4271909.14 frames. ], batch size: 282, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:18:42,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=545094.0, ans=0.125 2023-06-19 20:19:04,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=545154.0, ans=0.0 2023-06-19 20:20:08,631 INFO [train.py:996] (3/4) Epoch 3, batch 29900, loss[loss=0.3548, simple_loss=0.3948, pruned_loss=0.1574, over 21530.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3402, pruned_loss=0.1034, over 4276561.15 frames. ], batch size: 471, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:20:26,513 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.681e+02 3.110e+02 3.688e+02 5.256e+02, threshold=6.220e+02, percent-clipped=0.0 2023-06-19 20:20:30,940 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-19 20:21:25,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=545574.0, ans=0.125 2023-06-19 20:21:34,569 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-19 20:21:46,656 INFO [train.py:996] (3/4) Epoch 3, batch 29950, loss[loss=0.317, simple_loss=0.3725, pruned_loss=0.1308, over 21614.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.3441, pruned_loss=0.1077, over 4281238.31 frames. ], batch size: 415, lr: 9.99e-03, grad_scale: 16.0 2023-06-19 20:21:52,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=545634.0, ans=0.2 2023-06-19 20:22:37,753 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=15.0 2023-06-19 20:22:47,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=545754.0, ans=0.1 2023-06-19 20:23:29,416 INFO [train.py:996] (3/4) Epoch 3, batch 30000, loss[loss=0.2466, simple_loss=0.3306, pruned_loss=0.08132, over 21781.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.346, pruned_loss=0.1075, over 4279488.95 frames. ], batch size: 332, lr: 9.99e-03, grad_scale: 32.0 2023-06-19 20:23:29,416 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 20:23:45,897 INFO [train.py:1028] (3/4) Epoch 3, validation: loss=0.254, simple_loss=0.3581, pruned_loss=0.075, over 1796401.00 frames. 2023-06-19 20:23:45,897 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-19 20:24:05,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=545934.0, ans=0.05 2023-06-19 20:24:06,612 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.02 vs. limit=15.0 2023-06-19 20:24:15,682 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.901e+02 3.447e+02 4.272e+02 9.118e+02, threshold=6.893e+02, percent-clipped=6.0 2023-06-19 20:25:00,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=546114.0, ans=0.0 2023-06-19 20:25:09,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=546114.0, ans=0.0 2023-06-19 20:25:43,251 INFO [train.py:996] (3/4) Epoch 3, batch 30050, loss[loss=0.298, simple_loss=0.3788, pruned_loss=0.1086, over 21672.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3479, pruned_loss=0.1032, over 4264043.98 frames. ], batch size: 247, lr: 9.99e-03, grad_scale: 32.0 2023-06-19 20:26:13,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=546294.0, ans=22.5 2023-06-19 20:26:21,238 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:27:23,640 INFO [train.py:996] (3/4) Epoch 3, batch 30100, loss[loss=0.282, simple_loss=0.3263, pruned_loss=0.1188, over 21496.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3463, pruned_loss=0.1027, over 4261938.80 frames. ], batch size: 132, lr: 9.99e-03, grad_scale: 32.0 2023-06-19 20:27:46,640 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.970e+02 3.475e+02 4.229e+02 7.609e+02, threshold=6.950e+02, percent-clipped=3.0 2023-06-19 20:28:14,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.94 vs. limit=22.5 2023-06-19 20:29:04,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=546774.0, ans=0.125 2023-06-19 20:29:11,561 INFO [train.py:996] (3/4) Epoch 3, batch 30150, loss[loss=0.3297, simple_loss=0.378, pruned_loss=0.1407, over 21947.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3433, pruned_loss=0.1055, over 4269746.32 frames. ], batch size: 372, lr: 9.98e-03, grad_scale: 32.0 2023-06-19 20:29:39,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=546894.0, ans=0.125 2023-06-19 20:30:45,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=547074.0, ans=0.125 2023-06-19 20:30:58,950 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2023-06-19 20:31:01,075 INFO [train.py:996] (3/4) Epoch 3, batch 30200, loss[loss=0.2861, simple_loss=0.3346, pruned_loss=0.1188, over 20114.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3466, pruned_loss=0.1054, over 4271287.91 frames. ], batch size: 702, lr: 9.98e-03, grad_scale: 32.0 2023-06-19 20:31:05,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=547134.0, ans=0.125 2023-06-19 20:31:14,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=547134.0, ans=0.1 2023-06-19 20:31:18,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=547194.0, ans=0.125 2023-06-19 20:31:20,342 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.884e+02 3.477e+02 4.360e+02 6.992e+02, threshold=6.953e+02, percent-clipped=1.0 2023-06-19 20:32:17,568 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=22.5 2023-06-19 20:32:45,958 INFO [train.py:996] (3/4) Epoch 3, batch 30250, loss[loss=0.3586, simple_loss=0.4362, pruned_loss=0.1405, over 21684.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3555, pruned_loss=0.1085, over 4273334.88 frames. ], batch size: 441, lr: 9.98e-03, grad_scale: 32.0 2023-06-19 20:32:48,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.12 vs. limit=22.5 2023-06-19 20:33:16,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=547494.0, ans=0.0 2023-06-19 20:34:29,298 INFO [train.py:996] (3/4) Epoch 3, batch 30300, loss[loss=0.2153, simple_loss=0.2775, pruned_loss=0.07656, over 21262.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3528, pruned_loss=0.1085, over 4264710.73 frames. ], batch size: 177, lr: 9.97e-03, grad_scale: 32.0 2023-06-19 20:34:33,744 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=12.0 2023-06-19 20:34:34,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=547734.0, ans=0.1 2023-06-19 20:34:51,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=547794.0, ans=0.125 2023-06-19 20:34:52,620 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 3.189e+02 3.746e+02 4.977e+02 8.102e+02, threshold=7.493e+02, percent-clipped=4.0 2023-06-19 20:35:07,567 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=22.5 2023-06-19 20:35:13,998 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2023-06-19 20:35:15,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=547854.0, ans=0.1 2023-06-19 20:35:15,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=547854.0, ans=0.125 2023-06-19 20:35:27,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=547854.0, ans=0.125 2023-06-19 20:35:42,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=547914.0, ans=0.0 2023-06-19 20:36:04,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=547974.0, ans=0.125 2023-06-19 20:36:11,708 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.73 vs. limit=10.0 2023-06-19 20:36:13,941 INFO [train.py:996] (3/4) Epoch 3, batch 30350, loss[loss=0.3407, simple_loss=0.3993, pruned_loss=0.141, over 21855.00 frames. ], tot_loss[loss=0.2885, simple_loss=0.3548, pruned_loss=0.111, over 4269666.96 frames. ], batch size: 372, lr: 9.97e-03, grad_scale: 32.0 2023-06-19 20:36:21,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=548034.0, ans=0.125 2023-06-19 20:37:05,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=548154.0, ans=0.0 2023-06-19 20:37:43,097 INFO [train.py:996] (3/4) Epoch 3, batch 30400, loss[loss=0.2848, simple_loss=0.3174, pruned_loss=0.1261, over 20341.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3499, pruned_loss=0.1093, over 4247330.78 frames. ], batch size: 703, lr: 9.97e-03, grad_scale: 32.0 2023-06-19 20:37:59,834 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 3.466e+02 4.166e+02 5.135e+02 9.055e+02, threshold=8.331e+02, percent-clipped=4.0 2023-06-19 20:38:25,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=548454.0, ans=0.1 2023-06-19 20:39:04,605 INFO [train.py:996] (3/4) Epoch 3, batch 30450, loss[loss=0.3919, simple_loss=0.4914, pruned_loss=0.1462, over 19825.00 frames. ], tot_loss[loss=0.2867, simple_loss=0.3527, pruned_loss=0.1103, over 4190954.47 frames. ], batch size: 702, lr: 9.97e-03, grad_scale: 32.0 2023-06-19 20:41:58,153 INFO [train.py:996] (3/4) Epoch 4, batch 0, loss[loss=0.3242, simple_loss=0.3622, pruned_loss=0.1431, over 21353.00 frames. ], tot_loss[loss=0.3242, simple_loss=0.3622, pruned_loss=0.1431, over 21353.00 frames. ], batch size: 473, lr: 8.60e-03, grad_scale: 32.0 2023-06-19 20:41:58,154 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 20:42:15,975 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2612, simple_loss=0.3698, pruned_loss=0.07632, over 1796401.00 frames. 2023-06-19 20:42:15,976 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-19 20:42:45,414 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.032e+02 5.518e+02 8.293e+02 1.240e+03 3.012e+03, threshold=1.659e+03, percent-clipped=49.0 2023-06-19 20:42:48,183 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-06-19 20:42:56,444 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=15.0 2023-06-19 20:43:16,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=549024.0, ans=0.125 2023-06-19 20:43:24,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=549084.0, ans=0.0 2023-06-19 20:43:43,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=549144.0, ans=0.04949747468305833 2023-06-19 20:43:51,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=549204.0, ans=0.0 2023-06-19 20:43:52,596 INFO [train.py:996] (3/4) Epoch 4, batch 50, loss[loss=0.2686, simple_loss=0.3564, pruned_loss=0.09043, over 21737.00 frames. ], tot_loss[loss=0.2828, simple_loss=0.3457, pruned_loss=0.1099, over 942531.60 frames. ], batch size: 332, lr: 8.60e-03, grad_scale: 32.0 2023-06-19 20:44:23,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=549264.0, ans=0.07 2023-06-19 20:44:24,118 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.61 vs. limit=6.0 2023-06-19 20:45:05,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=549384.0, ans=0.2 2023-06-19 20:45:33,214 INFO [train.py:996] (3/4) Epoch 4, batch 100, loss[loss=0.3046, simple_loss=0.383, pruned_loss=0.1131, over 21674.00 frames. ], tot_loss[loss=0.2958, simple_loss=0.3669, pruned_loss=0.1123, over 1683424.78 frames. ], batch size: 389, lr: 8.60e-03, grad_scale: 32.0 2023-06-19 20:45:51,886 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=15.0 2023-06-19 20:46:08,202 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.71 vs. limit=15.0 2023-06-19 20:46:08,874 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.893e+02 3.441e+02 3.943e+02 7.428e+02, threshold=6.883e+02, percent-clipped=0.0 2023-06-19 20:46:43,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=549684.0, ans=0.0 2023-06-19 20:46:59,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=549744.0, ans=0.04949747468305833 2023-06-19 20:47:03,230 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:47:09,261 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:47:13,438 INFO [train.py:996] (3/4) Epoch 4, batch 150, loss[loss=0.2879, simple_loss=0.3655, pruned_loss=0.1052, over 21229.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3667, pruned_loss=0.1092, over 2244396.93 frames. ], batch size: 549, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:48:37,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=550044.0, ans=0.05 2023-06-19 20:48:53,398 INFO [train.py:996] (3/4) Epoch 4, batch 200, loss[loss=0.311, simple_loss=0.3679, pruned_loss=0.127, over 21849.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.3623, pruned_loss=0.109, over 2691847.81 frames. ], batch size: 124, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:49:29,386 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.787e+02 3.303e+02 4.395e+02 6.398e+02, threshold=6.606e+02, percent-clipped=0.0 2023-06-19 20:49:40,053 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-19 20:49:45,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=550224.0, ans=0.2 2023-06-19 20:49:58,174 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-19 20:50:28,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-06-19 20:50:35,677 INFO [train.py:996] (3/4) Epoch 4, batch 250, loss[loss=0.2698, simple_loss=0.363, pruned_loss=0.08828, over 21640.00 frames. ], tot_loss[loss=0.288, simple_loss=0.3595, pruned_loss=0.1083, over 3043260.34 frames. ], batch size: 414, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:50:37,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=550404.0, ans=0.2 2023-06-19 20:50:52,731 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:50:54,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=550404.0, ans=0.035 2023-06-19 20:50:56,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=550464.0, ans=0.125 2023-06-19 20:51:36,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=550524.0, ans=0.0 2023-06-19 20:52:19,234 INFO [train.py:996] (3/4) Epoch 4, batch 300, loss[loss=0.3863, simple_loss=0.421, pruned_loss=0.1757, over 21429.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3536, pruned_loss=0.1066, over 3307608.74 frames. ], batch size: 471, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:52:57,499 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.319e+02 3.088e+02 3.665e+02 5.063e+02 1.079e+03, threshold=7.330e+02, percent-clipped=8.0 2023-06-19 20:54:05,654 INFO [train.py:996] (3/4) Epoch 4, batch 350, loss[loss=0.2769, simple_loss=0.3517, pruned_loss=0.1011, over 21748.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.3454, pruned_loss=0.105, over 3520958.11 frames. ], batch size: 351, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:54:45,505 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-19 20:54:48,432 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.32 vs. limit=10.0 2023-06-19 20:54:49,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=551124.0, ans=0.0 2023-06-19 20:55:07,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=551124.0, ans=15.0 2023-06-19 20:55:54,932 INFO [train.py:996] (3/4) Epoch 4, batch 400, loss[loss=0.2392, simple_loss=0.3057, pruned_loss=0.08637, over 21662.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3361, pruned_loss=0.103, over 3682512.33 frames. ], batch size: 298, lr: 8.58e-03, grad_scale: 32.0 2023-06-19 20:56:03,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=551304.0, ans=0.035 2023-06-19 20:56:12,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=551304.0, ans=0.0 2023-06-19 20:56:26,008 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.883e+02 3.575e+02 4.503e+02 7.615e+02, threshold=7.149e+02, percent-clipped=2.0 2023-06-19 20:57:04,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=551484.0, ans=0.2 2023-06-19 20:57:15,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=551544.0, ans=0.2 2023-06-19 20:57:24,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=551544.0, ans=0.2 2023-06-19 20:57:37,278 INFO [train.py:996] (3/4) Epoch 4, batch 450, loss[loss=0.2408, simple_loss=0.2921, pruned_loss=0.0947, over 21303.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3338, pruned_loss=0.1009, over 3824031.92 frames. ], batch size: 177, lr: 8.58e-03, grad_scale: 32.0 2023-06-19 20:59:11,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=551844.0, ans=0.1 2023-06-19 20:59:19,191 INFO [train.py:996] (3/4) Epoch 4, batch 500, loss[loss=0.2014, simple_loss=0.3013, pruned_loss=0.0507, over 21644.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3359, pruned_loss=0.09954, over 3927553.18 frames. ], batch size: 263, lr: 8.58e-03, grad_scale: 32.0 2023-06-19 20:59:53,156 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.333e+02 2.948e+02 3.424e+02 4.506e+02 6.960e+02, threshold=6.848e+02, percent-clipped=0.0 2023-06-19 20:59:55,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=551964.0, ans=0.1 2023-06-19 21:00:21,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=552024.0, ans=0.05 2023-06-19 21:00:48,119 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:01:02,297 INFO [train.py:996] (3/4) Epoch 4, batch 550, loss[loss=0.2718, simple_loss=0.3357, pruned_loss=0.1039, over 21385.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3369, pruned_loss=0.09896, over 4013195.12 frames. ], batch size: 194, lr: 8.58e-03, grad_scale: 16.0 2023-06-19 21:01:58,586 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=22.5 2023-06-19 21:02:45,449 INFO [train.py:996] (3/4) Epoch 4, batch 600, loss[loss=0.2448, simple_loss=0.3052, pruned_loss=0.09214, over 21813.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3411, pruned_loss=0.1004, over 4076250.66 frames. ], batch size: 118, lr: 8.57e-03, grad_scale: 16.0 2023-06-19 21:02:47,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=552504.0, ans=0.0 2023-06-19 21:02:47,815 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-19 21:03:17,431 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 3.276e+02 3.981e+02 4.951e+02 8.718e+02, threshold=7.962e+02, percent-clipped=3.0 2023-06-19 21:03:47,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=552684.0, ans=0.125 2023-06-19 21:03:49,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=552684.0, ans=0.125 2023-06-19 21:03:56,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=552684.0, ans=0.1 2023-06-19 21:03:58,809 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=22.5 2023-06-19 21:04:28,049 INFO [train.py:996] (3/4) Epoch 4, batch 650, loss[loss=0.2448, simple_loss=0.3044, pruned_loss=0.09258, over 21683.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3419, pruned_loss=0.1011, over 4130645.11 frames. ], batch size: 333, lr: 8.57e-03, grad_scale: 16.0 2023-06-19 21:04:34,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=552804.0, ans=0.125 2023-06-19 21:04:38,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=552804.0, ans=0.125 2023-06-19 21:04:48,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=552864.0, ans=0.09899494936611666 2023-06-19 21:05:03,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=552864.0, ans=0.0 2023-06-19 21:05:23,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=552924.0, ans=0.125 2023-06-19 21:05:42,232 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-06-19 21:05:51,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=553044.0, ans=0.125 2023-06-19 21:05:57,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=553044.0, ans=0.015 2023-06-19 21:06:04,626 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-06-19 21:06:10,723 INFO [train.py:996] (3/4) Epoch 4, batch 700, loss[loss=0.2395, simple_loss=0.3142, pruned_loss=0.08245, over 21761.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3396, pruned_loss=0.1008, over 4161181.89 frames. ], batch size: 351, lr: 8.57e-03, grad_scale: 16.0 2023-06-19 21:06:43,336 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.524e+02 3.407e+02 4.015e+02 5.310e+02 1.031e+03, threshold=8.030e+02, percent-clipped=3.0 2023-06-19 21:06:43,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=553164.0, ans=0.0 2023-06-19 21:06:50,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=553164.0, ans=0.125 2023-06-19 21:07:09,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=553224.0, ans=15.0 2023-06-19 21:07:32,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=553344.0, ans=0.1 2023-06-19 21:07:43,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=553344.0, ans=0.0 2023-06-19 21:07:52,966 INFO [train.py:996] (3/4) Epoch 4, batch 750, loss[loss=0.2611, simple_loss=0.322, pruned_loss=0.1001, over 21727.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3387, pruned_loss=0.1008, over 4189021.44 frames. ], batch size: 298, lr: 8.57e-03, grad_scale: 16.0 2023-06-19 21:08:15,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=553464.0, ans=0.125 2023-06-19 21:08:17,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=553464.0, ans=0.2 2023-06-19 21:08:23,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=553464.0, ans=0.07 2023-06-19 21:08:38,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.40 vs. limit=15.0 2023-06-19 21:08:45,694 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-06-19 21:09:23,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=553644.0, ans=0.125 2023-06-19 21:09:25,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=553644.0, ans=0.1 2023-06-19 21:09:34,448 INFO [train.py:996] (3/4) Epoch 4, batch 800, loss[loss=0.3112, simple_loss=0.3901, pruned_loss=0.1162, over 21414.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.3366, pruned_loss=0.1013, over 4198284.41 frames. ], batch size: 548, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:10:04,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=553764.0, ans=0.125 2023-06-19 21:10:07,070 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 3.089e+02 3.541e+02 4.418e+02 8.046e+02, threshold=7.083e+02, percent-clipped=1.0 2023-06-19 21:10:09,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=553764.0, ans=0.0 2023-06-19 21:10:19,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=553824.0, ans=0.125 2023-06-19 21:10:24,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=553824.0, ans=0.1 2023-06-19 21:11:11,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=553944.0, ans=0.125 2023-06-19 21:11:18,246 INFO [train.py:996] (3/4) Epoch 4, batch 850, loss[loss=0.2357, simple_loss=0.2848, pruned_loss=0.09332, over 21196.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3338, pruned_loss=0.1005, over 4215053.25 frames. ], batch size: 159, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:11:27,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=554004.0, ans=0.0 2023-06-19 21:12:38,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=554184.0, ans=0.04949747468305833 2023-06-19 21:12:51,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=554244.0, ans=0.1 2023-06-19 21:13:02,642 INFO [train.py:996] (3/4) Epoch 4, batch 900, loss[loss=0.2512, simple_loss=0.3143, pruned_loss=0.0941, over 21301.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3329, pruned_loss=0.1008, over 4229085.26 frames. ], batch size: 159, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:13:19,262 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:13:40,689 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.142e+02 3.017e+02 3.559e+02 4.118e+02 8.031e+02, threshold=7.118e+02, percent-clipped=1.0 2023-06-19 21:14:01,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=554424.0, ans=10.0 2023-06-19 21:14:01,893 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-06-19 21:14:18,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=554484.0, ans=0.125 2023-06-19 21:14:34,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=554544.0, ans=0.125 2023-06-19 21:14:45,087 INFO [train.py:996] (3/4) Epoch 4, batch 950, loss[loss=0.2311, simple_loss=0.3126, pruned_loss=0.07479, over 21771.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3315, pruned_loss=0.09986, over 4240095.68 frames. ], batch size: 282, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:14:59,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=554604.0, ans=0.2 2023-06-19 21:15:42,791 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:16:21,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=554844.0, ans=0.125 2023-06-19 21:16:21,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=554844.0, ans=0.1 2023-06-19 21:16:27,597 INFO [train.py:996] (3/4) Epoch 4, batch 1000, loss[loss=0.2643, simple_loss=0.3378, pruned_loss=0.09538, over 21895.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3306, pruned_loss=0.0995, over 4247164.62 frames. ], batch size: 316, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:16:49,121 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.54 vs. limit=15.0 2023-06-19 21:17:12,741 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.951e+02 3.502e+02 4.133e+02 7.133e+02, threshold=7.004e+02, percent-clipped=1.0 2023-06-19 21:17:57,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=555144.0, ans=0.125 2023-06-19 21:18:15,437 INFO [train.py:996] (3/4) Epoch 4, batch 1050, loss[loss=0.3551, simple_loss=0.4464, pruned_loss=0.1319, over 20831.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3337, pruned_loss=0.1015, over 4257253.63 frames. ], batch size: 608, lr: 8.55e-03, grad_scale: 16.0 2023-06-19 21:18:57,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=555264.0, ans=0.0 2023-06-19 21:19:26,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=555384.0, ans=10.0 2023-06-19 21:19:46,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=555444.0, ans=0.1 2023-06-19 21:19:58,899 INFO [train.py:996] (3/4) Epoch 4, batch 1100, loss[loss=0.2603, simple_loss=0.3337, pruned_loss=0.09349, over 21801.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3356, pruned_loss=0.1009, over 4258890.56 frames. ], batch size: 247, lr: 8.55e-03, grad_scale: 16.0 2023-06-19 21:20:39,863 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 3.086e+02 3.737e+02 4.742e+02 7.537e+02, threshold=7.473e+02, percent-clipped=2.0 2023-06-19 21:20:46,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=555624.0, ans=0.05 2023-06-19 21:21:43,848 INFO [train.py:996] (3/4) Epoch 4, batch 1150, loss[loss=0.2078, simple_loss=0.2874, pruned_loss=0.06407, over 21401.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3361, pruned_loss=0.1008, over 4268000.04 frames. ], batch size: 211, lr: 8.55e-03, grad_scale: 16.0 2023-06-19 21:22:13,584 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=22.5 2023-06-19 21:22:58,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=555984.0, ans=0.1 2023-06-19 21:23:15,724 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=10.0 2023-06-19 21:23:27,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=556044.0, ans=0.0 2023-06-19 21:23:33,568 INFO [train.py:996] (3/4) Epoch 4, batch 1200, loss[loss=0.2913, simple_loss=0.3614, pruned_loss=0.1106, over 21741.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.3365, pruned_loss=0.101, over 4275771.03 frames. ], batch size: 282, lr: 8.55e-03, grad_scale: 32.0 2023-06-19 21:23:39,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=556104.0, ans=0.125 2023-06-19 21:24:08,659 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 2.755e+02 3.087e+02 3.854e+02 6.716e+02, threshold=6.173e+02, percent-clipped=0.0 2023-06-19 21:24:15,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=556224.0, ans=0.0 2023-06-19 21:25:13,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=556344.0, ans=15.0 2023-06-19 21:25:17,515 INFO [train.py:996] (3/4) Epoch 4, batch 1250, loss[loss=0.2847, simple_loss=0.34, pruned_loss=0.1146, over 21888.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3407, pruned_loss=0.1039, over 4282742.47 frames. ], batch size: 118, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:26:01,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=556524.0, ans=0.125 2023-06-19 21:26:35,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=556584.0, ans=0.0 2023-06-19 21:26:57,801 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.55 vs. limit=12.0 2023-06-19 21:27:02,097 INFO [train.py:996] (3/4) Epoch 4, batch 1300, loss[loss=0.2574, simple_loss=0.3383, pruned_loss=0.08824, over 21632.00 frames. ], tot_loss[loss=0.2752, simple_loss=0.3411, pruned_loss=0.1047, over 4285395.68 frames. ], batch size: 230, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:27:36,347 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.941e+02 3.345e+02 4.151e+02 1.109e+03, threshold=6.689e+02, percent-clipped=6.0 2023-06-19 21:27:39,159 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-19 21:27:47,472 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=12.0 2023-06-19 21:28:10,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=556884.0, ans=0.0 2023-06-19 21:28:14,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=556884.0, ans=0.125 2023-06-19 21:28:14,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=556884.0, ans=0.125 2023-06-19 21:28:17,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=556884.0, ans=0.1 2023-06-19 21:28:17,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=556884.0, ans=0.0 2023-06-19 21:28:44,709 INFO [train.py:996] (3/4) Epoch 4, batch 1350, loss[loss=0.2909, simple_loss=0.3473, pruned_loss=0.1173, over 21855.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3428, pruned_loss=0.1055, over 4289428.73 frames. ], batch size: 124, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:29:08,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=557064.0, ans=0.0 2023-06-19 21:29:55,511 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-19 21:30:03,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=557184.0, ans=0.125 2023-06-19 21:30:13,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=557244.0, ans=0.125 2023-06-19 21:30:27,824 INFO [train.py:996] (3/4) Epoch 4, batch 1400, loss[loss=0.2354, simple_loss=0.2934, pruned_loss=0.08873, over 21897.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3398, pruned_loss=0.1046, over 4277377.31 frames. ], batch size: 373, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:30:41,701 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.91 vs. limit=15.0 2023-06-19 21:31:03,560 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 3.007e+02 3.409e+02 4.154e+02 6.851e+02, threshold=6.817e+02, percent-clipped=4.0 2023-06-19 21:32:18,680 INFO [train.py:996] (3/4) Epoch 4, batch 1450, loss[loss=0.3036, simple_loss=0.3645, pruned_loss=0.1214, over 21579.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3397, pruned_loss=0.1051, over 4267471.50 frames. ], batch size: 471, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:32:32,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=557604.0, ans=0.125 2023-06-19 21:32:34,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=557664.0, ans=0.125 2023-06-19 21:32:52,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=557724.0, ans=0.125 2023-06-19 21:33:03,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=557724.0, ans=0.125 2023-06-19 21:33:39,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=557844.0, ans=0.035 2023-06-19 21:33:48,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=557844.0, ans=0.125 2023-06-19 21:34:02,920 INFO [train.py:996] (3/4) Epoch 4, batch 1500, loss[loss=0.324, simple_loss=0.3951, pruned_loss=0.1265, over 21562.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.343, pruned_loss=0.1069, over 4272719.00 frames. ], batch size: 471, lr: 8.53e-03, grad_scale: 32.0 2023-06-19 21:34:33,014 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 2.980e+02 3.543e+02 4.143e+02 6.339e+02, threshold=7.086e+02, percent-clipped=0.0 2023-06-19 21:34:33,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=557964.0, ans=0.125 2023-06-19 21:35:03,696 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2023-06-19 21:35:04,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=558084.0, ans=0.125 2023-06-19 21:35:31,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=558144.0, ans=0.1 2023-06-19 21:35:49,402 INFO [train.py:996] (3/4) Epoch 4, batch 1550, loss[loss=0.2188, simple_loss=0.3072, pruned_loss=0.06518, over 21791.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3394, pruned_loss=0.1045, over 4275917.82 frames. ], batch size: 371, lr: 8.53e-03, grad_scale: 32.0 2023-06-19 21:35:58,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=558204.0, ans=0.07 2023-06-19 21:36:34,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=558324.0, ans=0.125 2023-06-19 21:36:41,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=558324.0, ans=0.125 2023-06-19 21:37:34,540 INFO [train.py:996] (3/4) Epoch 4, batch 1600, loss[loss=0.316, simple_loss=0.3678, pruned_loss=0.1321, over 21315.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3374, pruned_loss=0.1026, over 4282051.95 frames. ], batch size: 176, lr: 8.53e-03, grad_scale: 32.0 2023-06-19 21:37:44,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=558504.0, ans=0.2 2023-06-19 21:38:13,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=558564.0, ans=0.04949747468305833 2023-06-19 21:38:15,081 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.278e+02 2.993e+02 3.386e+02 4.443e+02 8.016e+02, threshold=6.773e+02, percent-clipped=2.0 2023-06-19 21:38:15,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=558564.0, ans=0.125 2023-06-19 21:38:21,425 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-19 21:38:35,137 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-19 21:38:44,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=558684.0, ans=0.07 2023-06-19 21:39:10,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=558744.0, ans=0.125 2023-06-19 21:39:19,159 INFO [train.py:996] (3/4) Epoch 4, batch 1650, loss[loss=0.2422, simple_loss=0.3045, pruned_loss=0.08996, over 21297.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3349, pruned_loss=0.101, over 4266606.37 frames. ], batch size: 159, lr: 8.53e-03, grad_scale: 32.0 2023-06-19 21:40:18,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=558924.0, ans=0.0 2023-06-19 21:40:34,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=558984.0, ans=0.125 2023-06-19 21:40:40,891 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-19 21:40:43,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=558984.0, ans=0.125 2023-06-19 21:41:04,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=559104.0, ans=0.125 2023-06-19 21:41:05,508 INFO [train.py:996] (3/4) Epoch 4, batch 1700, loss[loss=0.2781, simple_loss=0.3393, pruned_loss=0.1085, over 21389.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.339, pruned_loss=0.1031, over 4273751.18 frames. ], batch size: 131, lr: 8.52e-03, grad_scale: 16.0 2023-06-19 21:41:12,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=559104.0, ans=0.1 2023-06-19 21:41:37,580 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-19 21:41:53,212 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.180e+02 2.877e+02 3.357e+02 4.119e+02 6.244e+02, threshold=6.713e+02, percent-clipped=0.0 2023-06-19 21:41:55,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=559224.0, ans=0.1 2023-06-19 21:41:58,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=559224.0, ans=0.0 2023-06-19 21:41:59,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=559224.0, ans=0.125 2023-06-19 21:42:38,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=559344.0, ans=0.125 2023-06-19 21:42:41,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=559344.0, ans=0.0 2023-06-19 21:42:56,581 INFO [train.py:996] (3/4) Epoch 4, batch 1750, loss[loss=0.2377, simple_loss=0.3336, pruned_loss=0.07091, over 21609.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3385, pruned_loss=0.1017, over 4268132.50 frames. ], batch size: 441, lr: 8.52e-03, grad_scale: 16.0 2023-06-19 21:43:29,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=559464.0, ans=0.125 2023-06-19 21:43:35,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.94 vs. limit=15.0 2023-06-19 21:43:59,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=559584.0, ans=0.0 2023-06-19 21:44:44,120 INFO [train.py:996] (3/4) Epoch 4, batch 1800, loss[loss=0.2773, simple_loss=0.3467, pruned_loss=0.104, over 21535.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3345, pruned_loss=0.09814, over 4270243.02 frames. ], batch size: 441, lr: 8.52e-03, grad_scale: 16.0 2023-06-19 21:45:27,954 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.187e+02 3.069e+02 3.500e+02 4.481e+02 7.550e+02, threshold=6.999e+02, percent-clipped=2.0 2023-06-19 21:45:32,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=559824.0, ans=0.1 2023-06-19 21:46:06,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=559884.0, ans=0.07 2023-06-19 21:46:34,183 INFO [train.py:996] (3/4) Epoch 4, batch 1850, loss[loss=0.2874, simple_loss=0.3699, pruned_loss=0.1025, over 21669.00 frames. ], tot_loss[loss=0.265, simple_loss=0.336, pruned_loss=0.09701, over 4267809.11 frames. ], batch size: 389, lr: 8.52e-03, grad_scale: 16.0 2023-06-19 21:46:35,150 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.32 vs. limit=15.0 2023-06-19 21:46:37,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=560004.0, ans=0.025 2023-06-19 21:46:39,971 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.97 vs. limit=10.0 2023-06-19 21:47:02,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.71 vs. limit=15.0 2023-06-19 21:47:16,493 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-19 21:47:19,712 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-06-19 21:48:17,776 INFO [train.py:996] (3/4) Epoch 4, batch 1900, loss[loss=0.2494, simple_loss=0.3178, pruned_loss=0.09048, over 21460.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3348, pruned_loss=0.09666, over 4271141.26 frames. ], batch size: 194, lr: 8.51e-03, grad_scale: 16.0 2023-06-19 21:48:24,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=560304.0, ans=0.2 2023-06-19 21:48:38,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=560364.0, ans=0.05 2023-06-19 21:48:47,908 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:48:53,832 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.971e+02 3.385e+02 4.219e+02 8.098e+02, threshold=6.770e+02, percent-clipped=2.0 2023-06-19 21:49:35,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=560484.0, ans=0.0 2023-06-19 21:49:51,703 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-06-19 21:50:02,250 INFO [train.py:996] (3/4) Epoch 4, batch 1950, loss[loss=0.2462, simple_loss=0.299, pruned_loss=0.09672, over 21449.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3322, pruned_loss=0.09651, over 4272739.77 frames. ], batch size: 389, lr: 8.51e-03, grad_scale: 16.0 2023-06-19 21:50:16,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=560604.0, ans=0.0 2023-06-19 21:51:20,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=560784.0, ans=0.2 2023-06-19 21:51:29,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=560844.0, ans=0.1 2023-06-19 21:51:38,917 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:51:39,533 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-06-19 21:51:46,738 INFO [train.py:996] (3/4) Epoch 4, batch 2000, loss[loss=0.2458, simple_loss=0.3188, pruned_loss=0.08643, over 21566.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3285, pruned_loss=0.09476, over 4277431.37 frames. ], batch size: 212, lr: 8.51e-03, grad_scale: 32.0 2023-06-19 21:52:24,008 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 3.002e+02 3.642e+02 4.364e+02 7.369e+02, threshold=7.284e+02, percent-clipped=1.0 2023-06-19 21:53:13,126 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.13 vs. limit=15.0 2023-06-19 21:53:30,404 INFO [train.py:996] (3/4) Epoch 4, batch 2050, loss[loss=0.2706, simple_loss=0.3433, pruned_loss=0.09889, over 17137.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3284, pruned_loss=0.09445, over 4263741.76 frames. ], batch size: 60, lr: 8.51e-03, grad_scale: 16.0 2023-06-19 21:54:17,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=561324.0, ans=0.1 2023-06-19 21:54:35,857 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-19 21:55:20,870 INFO [train.py:996] (3/4) Epoch 4, batch 2100, loss[loss=0.3114, simple_loss=0.3671, pruned_loss=0.1278, over 20702.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3325, pruned_loss=0.09674, over 4272182.13 frames. ], batch size: 607, lr: 8.51e-03, grad_scale: 16.0 2023-06-19 21:55:59,294 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.354e+02 3.198e+02 3.847e+02 4.816e+02 7.420e+02, threshold=7.693e+02, percent-clipped=1.0 2023-06-19 21:57:06,041 INFO [train.py:996] (3/4) Epoch 4, batch 2150, loss[loss=0.2617, simple_loss=0.321, pruned_loss=0.1012, over 21611.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3314, pruned_loss=0.09752, over 4268824.18 frames. ], batch size: 247, lr: 8.50e-03, grad_scale: 16.0 2023-06-19 21:57:18,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=561804.0, ans=0.1 2023-06-19 21:57:41,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=561864.0, ans=0.125 2023-06-19 21:57:54,370 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-06-19 21:58:22,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=562044.0, ans=0.0 2023-06-19 21:58:26,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=562044.0, ans=15.0 2023-06-19 21:58:46,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=562044.0, ans=0.125 2023-06-19 21:58:50,866 INFO [train.py:996] (3/4) Epoch 4, batch 2200, loss[loss=0.2664, simple_loss=0.3241, pruned_loss=0.1043, over 21787.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3337, pruned_loss=0.09753, over 4270339.95 frames. ], batch size: 112, lr: 8.50e-03, grad_scale: 16.0 2023-06-19 21:58:51,290 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:59:01,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=562104.0, ans=15.0 2023-06-19 21:59:28,560 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 3.061e+02 3.534e+02 4.711e+02 8.653e+02, threshold=7.068e+02, percent-clipped=2.0 2023-06-19 21:59:57,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=562284.0, ans=0.0 2023-06-19 22:00:29,279 INFO [train.py:996] (3/4) Epoch 4, batch 2250, loss[loss=0.2061, simple_loss=0.2669, pruned_loss=0.07262, over 21724.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3305, pruned_loss=0.09608, over 4274210.45 frames. ], batch size: 124, lr: 8.50e-03, grad_scale: 16.0 2023-06-19 22:01:43,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=562644.0, ans=0.125 2023-06-19 22:01:53,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=562644.0, ans=0.2 2023-06-19 22:02:08,498 INFO [train.py:996] (3/4) Epoch 4, batch 2300, loss[loss=0.248, simple_loss=0.2947, pruned_loss=0.1007, over 21636.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3274, pruned_loss=0.09587, over 4258273.80 frames. ], batch size: 416, lr: 8.50e-03, grad_scale: 16.0 2023-06-19 22:02:12,025 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:02:28,833 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-19 22:02:51,583 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.293e+02 3.061e+02 3.548e+02 4.710e+02 1.046e+03, threshold=7.097e+02, percent-clipped=5.0 2023-06-19 22:03:01,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=562824.0, ans=0.125 2023-06-19 22:03:55,564 INFO [train.py:996] (3/4) Epoch 4, batch 2350, loss[loss=0.3086, simple_loss=0.3619, pruned_loss=0.1277, over 21915.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3229, pruned_loss=0.09618, over 4252676.09 frames. ], batch size: 372, lr: 8.49e-03, grad_scale: 16.0 2023-06-19 22:03:57,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=563004.0, ans=0.125 2023-06-19 22:04:19,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=563064.0, ans=0.125 2023-06-19 22:05:39,325 INFO [train.py:996] (3/4) Epoch 4, batch 2400, loss[loss=0.3104, simple_loss=0.3729, pruned_loss=0.124, over 21456.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.33, pruned_loss=0.1004, over 4261573.85 frames. ], batch size: 131, lr: 8.49e-03, grad_scale: 32.0 2023-06-19 22:06:23,719 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 3.093e+02 3.486e+02 4.537e+02 7.539e+02, threshold=6.972e+02, percent-clipped=1.0 2023-06-19 22:06:35,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=563424.0, ans=0.09899494936611666 2023-06-19 22:07:23,972 INFO [train.py:996] (3/4) Epoch 4, batch 2450, loss[loss=0.2241, simple_loss=0.2948, pruned_loss=0.07677, over 21649.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.336, pruned_loss=0.1033, over 4267362.19 frames. ], batch size: 263, lr: 8.49e-03, grad_scale: 32.0 2023-06-19 22:07:59,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=563664.0, ans=0.0 2023-06-19 22:08:04,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=563664.0, ans=0.125 2023-06-19 22:08:04,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=563664.0, ans=0.0 2023-06-19 22:08:09,862 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-19 22:09:02,297 INFO [train.py:996] (3/4) Epoch 4, batch 2500, loss[loss=0.2772, simple_loss=0.349, pruned_loss=0.1027, over 21670.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3313, pruned_loss=0.1014, over 4257262.08 frames. ], batch size: 298, lr: 8.49e-03, grad_scale: 32.0 2023-06-19 22:09:44,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=564024.0, ans=0.2 2023-06-19 22:09:45,351 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.863e+02 3.660e+02 4.293e+02 8.660e+02, threshold=7.321e+02, percent-clipped=2.0 2023-06-19 22:09:52,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=22.5 2023-06-19 22:09:59,352 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=22.5 2023-06-19 22:10:02,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=564084.0, ans=6.0 2023-06-19 22:10:45,486 INFO [train.py:996] (3/4) Epoch 4, batch 2550, loss[loss=0.2481, simple_loss=0.3144, pruned_loss=0.09094, over 21449.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3312, pruned_loss=0.09969, over 4262564.40 frames. ], batch size: 389, lr: 8.49e-03, grad_scale: 32.0 2023-06-19 22:11:36,666 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-19 22:11:44,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=564384.0, ans=0.125 2023-06-19 22:11:53,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=564384.0, ans=0.125 2023-06-19 22:12:29,206 INFO [train.py:996] (3/4) Epoch 4, batch 2600, loss[loss=0.3154, simple_loss=0.3816, pruned_loss=0.1246, over 21412.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3353, pruned_loss=0.103, over 4265809.89 frames. ], batch size: 131, lr: 8.48e-03, grad_scale: 32.0 2023-06-19 22:12:56,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=564564.0, ans=0.0 2023-06-19 22:13:12,183 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 3.048e+02 3.693e+02 4.515e+02 8.330e+02, threshold=7.386e+02, percent-clipped=1.0 2023-06-19 22:13:16,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=564624.0, ans=0.0 2023-06-19 22:13:58,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=564744.0, ans=0.035 2023-06-19 22:14:11,621 INFO [train.py:996] (3/4) Epoch 4, batch 2650, loss[loss=0.2393, simple_loss=0.3262, pruned_loss=0.0762, over 21832.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3366, pruned_loss=0.1045, over 4271207.20 frames. ], batch size: 351, lr: 8.48e-03, grad_scale: 32.0 2023-06-19 22:15:18,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=564984.0, ans=0.125 2023-06-19 22:15:34,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=564984.0, ans=0.125 2023-06-19 22:15:43,173 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-19 22:15:56,997 INFO [train.py:996] (3/4) Epoch 4, batch 2700, loss[loss=0.2331, simple_loss=0.3059, pruned_loss=0.08013, over 21813.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3346, pruned_loss=0.103, over 4275298.01 frames. ], batch size: 333, lr: 8.48e-03, grad_scale: 32.0 2023-06-19 22:15:58,010 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.04 vs. limit=10.0 2023-06-19 22:15:59,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=565104.0, ans=0.5 2023-06-19 22:16:31,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=565164.0, ans=0.125 2023-06-19 22:16:39,944 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.216e+02 3.006e+02 3.494e+02 4.497e+02 9.129e+02, threshold=6.988e+02, percent-clipped=4.0 2023-06-19 22:17:34,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=565344.0, ans=0.0 2023-06-19 22:17:37,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=565344.0, ans=0.0 2023-06-19 22:17:40,793 INFO [train.py:996] (3/4) Epoch 4, batch 2750, loss[loss=0.2191, simple_loss=0.267, pruned_loss=0.0856, over 21216.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.336, pruned_loss=0.1046, over 4282200.40 frames. ], batch size: 159, lr: 8.48e-03, grad_scale: 32.0 2023-06-19 22:18:26,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-19 22:18:36,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=565524.0, ans=0.0 2023-06-19 22:18:57,694 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.81 vs. limit=10.0 2023-06-19 22:19:32,207 INFO [train.py:996] (3/4) Epoch 4, batch 2800, loss[loss=0.4023, simple_loss=0.464, pruned_loss=0.1703, over 21426.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.3396, pruned_loss=0.1054, over 4277962.75 frames. ], batch size: 507, lr: 8.47e-03, grad_scale: 32.0 2023-06-19 22:20:16,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=565824.0, ans=0.0 2023-06-19 22:20:17,423 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.329e+02 3.042e+02 3.463e+02 4.341e+02 7.810e+02, threshold=6.926e+02, percent-clipped=4.0 2023-06-19 22:20:27,800 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=15.0 2023-06-19 22:20:30,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=565824.0, ans=0.125 2023-06-19 22:20:38,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=565884.0, ans=0.0 2023-06-19 22:21:13,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=565944.0, ans=0.0 2023-06-19 22:21:16,495 INFO [train.py:996] (3/4) Epoch 4, batch 2850, loss[loss=0.2679, simple_loss=0.3339, pruned_loss=0.101, over 21719.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3397, pruned_loss=0.1058, over 4284495.10 frames. ], batch size: 351, lr: 8.47e-03, grad_scale: 32.0 2023-06-19 22:21:43,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=566064.0, ans=0.1 2023-06-19 22:22:04,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=566124.0, ans=0.125 2023-06-19 22:22:12,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=566124.0, ans=0.1 2023-06-19 22:22:16,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=566184.0, ans=0.1 2023-06-19 22:22:25,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=566184.0, ans=0.125 2023-06-19 22:22:59,606 INFO [train.py:996] (3/4) Epoch 4, batch 2900, loss[loss=0.2662, simple_loss=0.3648, pruned_loss=0.08379, over 20737.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.336, pruned_loss=0.1044, over 4282752.10 frames. ], batch size: 607, lr: 8.47e-03, grad_scale: 16.0 2023-06-19 22:23:06,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=566304.0, ans=0.0 2023-06-19 22:23:38,041 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-19 22:23:45,060 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 2.998e+02 3.695e+02 4.530e+02 8.664e+02, threshold=7.390e+02, percent-clipped=3.0 2023-06-19 22:23:52,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=566424.0, ans=0.125 2023-06-19 22:24:00,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=566484.0, ans=0.125 2023-06-19 22:24:04,703 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-06-19 22:24:42,824 INFO [train.py:996] (3/4) Epoch 4, batch 2950, loss[loss=0.2655, simple_loss=0.3089, pruned_loss=0.1111, over 20195.00 frames. ], tot_loss[loss=0.273, simple_loss=0.3375, pruned_loss=0.1042, over 4280214.86 frames. ], batch size: 703, lr: 8.47e-03, grad_scale: 16.0 2023-06-19 22:25:32,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=566724.0, ans=0.1 2023-06-19 22:26:25,932 INFO [train.py:996] (3/4) Epoch 4, batch 3000, loss[loss=0.3311, simple_loss=0.3948, pruned_loss=0.1338, over 21454.00 frames. ], tot_loss[loss=0.2746, simple_loss=0.3408, pruned_loss=0.1042, over 4287180.01 frames. ], batch size: 131, lr: 8.47e-03, grad_scale: 16.0 2023-06-19 22:26:25,933 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-19 22:26:43,397 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2637, simple_loss=0.3577, pruned_loss=0.08486, over 1796401.00 frames. 2023-06-19 22:26:43,398 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-19 22:26:43,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=566904.0, ans=0.0 2023-06-19 22:27:29,287 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 3.065e+02 3.685e+02 4.308e+02 7.209e+02, threshold=7.369e+02, percent-clipped=0.0 2023-06-19 22:27:34,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=567024.0, ans=0.125 2023-06-19 22:27:53,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=567084.0, ans=0.1 2023-06-19 22:28:07,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=567084.0, ans=0.04949747468305833 2023-06-19 22:28:27,654 INFO [train.py:996] (3/4) Epoch 4, batch 3050, loss[loss=0.2288, simple_loss=0.3062, pruned_loss=0.07573, over 21417.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3398, pruned_loss=0.1018, over 4284940.33 frames. ], batch size: 194, lr: 8.46e-03, grad_scale: 16.0 2023-06-19 22:29:06,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=567324.0, ans=0.2 2023-06-19 22:29:06,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=567324.0, ans=0.1 2023-06-19 22:29:11,834 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=22.5 2023-06-19 22:29:45,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=567384.0, ans=0.125 2023-06-19 22:29:51,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=567444.0, ans=0.125 2023-06-19 22:30:12,614 INFO [train.py:996] (3/4) Epoch 4, batch 3100, loss[loss=0.2671, simple_loss=0.3504, pruned_loss=0.0919, over 21773.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3377, pruned_loss=0.1001, over 4283982.22 frames. ], batch size: 414, lr: 8.46e-03, grad_scale: 16.0 2023-06-19 22:30:52,796 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 3.250e+02 3.985e+02 4.690e+02 7.522e+02, threshold=7.970e+02, percent-clipped=1.0 2023-06-19 22:31:14,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=567624.0, ans=0.1 2023-06-19 22:31:20,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=567684.0, ans=0.125 2023-06-19 22:31:29,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=567684.0, ans=15.0 2023-06-19 22:31:58,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=567744.0, ans=0.0 2023-06-19 22:32:03,243 INFO [train.py:996] (3/4) Epoch 4, batch 3150, loss[loss=0.3301, simple_loss=0.3943, pruned_loss=0.133, over 21588.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3399, pruned_loss=0.1005, over 4275024.51 frames. ], batch size: 414, lr: 8.46e-03, grad_scale: 16.0 2023-06-19 22:32:22,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=567864.0, ans=0.0 2023-06-19 22:33:36,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=568044.0, ans=0.125 2023-06-19 22:33:47,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=568104.0, ans=0.125 2023-06-19 22:33:48,458 INFO [train.py:996] (3/4) Epoch 4, batch 3200, loss[loss=0.2998, simple_loss=0.3609, pruned_loss=0.1193, over 21563.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3437, pruned_loss=0.1018, over 4278161.23 frames. ], batch size: 471, lr: 8.46e-03, grad_scale: 32.0 2023-06-19 22:34:22,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=568164.0, ans=0.0 2023-06-19 22:34:34,207 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.110e+02 3.486e+02 4.566e+02 1.016e+03, threshold=6.972e+02, percent-clipped=1.0 2023-06-19 22:34:46,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=568224.0, ans=0.125 2023-06-19 22:35:07,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=568284.0, ans=0.125 2023-06-19 22:35:27,844 INFO [train.py:996] (3/4) Epoch 4, batch 3250, loss[loss=0.2225, simple_loss=0.285, pruned_loss=0.07994, over 21465.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3442, pruned_loss=0.1038, over 4283313.25 frames. ], batch size: 230, lr: 8.45e-03, grad_scale: 32.0 2023-06-19 22:35:33,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=568404.0, ans=0.05 2023-06-19 22:35:54,239 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.58 vs. limit=10.0 2023-06-19 22:36:03,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=568464.0, ans=0.125 2023-06-19 22:36:55,025 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:37:12,283 INFO [train.py:996] (3/4) Epoch 4, batch 3300, loss[loss=0.3065, simple_loss=0.3595, pruned_loss=0.1268, over 21563.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3401, pruned_loss=0.1043, over 4273775.20 frames. ], batch size: 414, lr: 8.45e-03, grad_scale: 32.0 2023-06-19 22:37:42,556 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-19 22:37:57,088 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.879e+02 3.455e+02 4.524e+02 7.307e+02, threshold=6.909e+02, percent-clipped=1.0 2023-06-19 22:38:13,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=568824.0, ans=0.125 2023-06-19 22:38:21,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=568884.0, ans=0.125 2023-06-19 22:38:31,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=568884.0, ans=0.0 2023-06-19 22:38:46,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=568944.0, ans=0.125 2023-06-19 22:38:55,624 INFO [train.py:996] (3/4) Epoch 4, batch 3350, loss[loss=0.2822, simple_loss=0.3405, pruned_loss=0.112, over 21391.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3437, pruned_loss=0.1047, over 4268600.87 frames. ], batch size: 159, lr: 8.45e-03, grad_scale: 32.0 2023-06-19 22:39:21,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=569064.0, ans=0.125 2023-06-19 22:39:57,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=22.5 2023-06-19 22:40:39,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=569244.0, ans=0.0 2023-06-19 22:40:50,337 INFO [train.py:996] (3/4) Epoch 4, batch 3400, loss[loss=0.267, simple_loss=0.3527, pruned_loss=0.09072, over 20912.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.342, pruned_loss=0.1052, over 4269030.78 frames. ], batch size: 607, lr: 8.45e-03, grad_scale: 16.0 2023-06-19 22:41:34,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=569424.0, ans=0.0 2023-06-19 22:41:35,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=569424.0, ans=0.125 2023-06-19 22:41:36,998 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 3.071e+02 3.735e+02 4.641e+02 6.693e+02, threshold=7.470e+02, percent-clipped=0.0 2023-06-19 22:41:39,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=569424.0, ans=0.035 2023-06-19 22:41:39,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=569424.0, ans=0.2 2023-06-19 22:42:11,539 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=15.0 2023-06-19 22:42:29,553 INFO [train.py:996] (3/4) Epoch 4, batch 3450, loss[loss=0.318, simple_loss=0.3921, pruned_loss=0.122, over 21775.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3375, pruned_loss=0.1046, over 4267003.88 frames. ], batch size: 316, lr: 8.45e-03, grad_scale: 16.0 2023-06-19 22:43:31,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=569784.0, ans=0.125 2023-06-19 22:43:58,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=569844.0, ans=10.0 2023-06-19 22:44:15,139 INFO [train.py:996] (3/4) Epoch 4, batch 3500, loss[loss=0.3722, simple_loss=0.4138, pruned_loss=0.1653, over 21353.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3476, pruned_loss=0.1089, over 4268690.89 frames. ], batch size: 507, lr: 8.44e-03, grad_scale: 16.0 2023-06-19 22:44:15,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=569904.0, ans=0.0 2023-06-19 22:44:56,442 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.45 vs. limit=10.0 2023-06-19 22:44:57,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=569964.0, ans=0.2 2023-06-19 22:45:03,482 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 3.084e+02 3.677e+02 4.361e+02 8.360e+02, threshold=7.354e+02, percent-clipped=5.0 2023-06-19 22:46:00,030 INFO [train.py:996] (3/4) Epoch 4, batch 3550, loss[loss=0.2702, simple_loss=0.3286, pruned_loss=0.1059, over 21737.00 frames. ], tot_loss[loss=0.286, simple_loss=0.3511, pruned_loss=0.1105, over 4274882.24 frames. ], batch size: 316, lr: 8.44e-03, grad_scale: 16.0 2023-06-19 22:46:20,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-06-19 22:46:33,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=570264.0, ans=0.2 2023-06-19 22:46:53,467 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.23 vs. limit=6.0 2023-06-19 22:47:15,132 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.87 vs. limit=10.0 2023-06-19 22:47:16,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=570384.0, ans=0.125 2023-06-19 22:47:51,314 INFO [train.py:996] (3/4) Epoch 4, batch 3600, loss[loss=0.235, simple_loss=0.3183, pruned_loss=0.07586, over 20034.00 frames. ], tot_loss[loss=0.2809, simple_loss=0.3439, pruned_loss=0.1089, over 4271726.18 frames. ], batch size: 702, lr: 8.44e-03, grad_scale: 32.0 2023-06-19 22:47:56,281 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.47 vs. limit=15.0 2023-06-19 22:48:02,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=570504.0, ans=0.125 2023-06-19 22:48:08,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=570564.0, ans=0.1 2023-06-19 22:48:18,588 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:48:29,401 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.447e+02 3.242e+02 3.839e+02 4.789e+02 9.292e+02, threshold=7.677e+02, percent-clipped=2.0 2023-06-19 22:48:30,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=570624.0, ans=0.1 2023-06-19 22:49:01,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=570684.0, ans=0.125 2023-06-19 22:49:34,888 INFO [train.py:996] (3/4) Epoch 4, batch 3650, loss[loss=0.3173, simple_loss=0.3871, pruned_loss=0.1238, over 21698.00 frames. ], tot_loss[loss=0.281, simple_loss=0.3447, pruned_loss=0.1086, over 4271592.96 frames. ], batch size: 441, lr: 8.44e-03, grad_scale: 16.0 2023-06-19 22:49:37,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=570804.0, ans=0.125 2023-06-19 22:50:42,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=570984.0, ans=0.1 2023-06-19 22:50:45,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=571044.0, ans=0.0 2023-06-19 22:51:14,657 INFO [train.py:996] (3/4) Epoch 4, batch 3700, loss[loss=0.2749, simple_loss=0.3401, pruned_loss=0.1048, over 21856.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.3434, pruned_loss=0.1076, over 4280653.61 frames. ], batch size: 371, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:51:19,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=571104.0, ans=0.125 2023-06-19 22:51:52,881 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.752e+02 3.200e+02 3.601e+02 6.077e+02, threshold=6.399e+02, percent-clipped=0.0 2023-06-19 22:52:00,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=571224.0, ans=0.2 2023-06-19 22:52:08,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=571224.0, ans=0.125 2023-06-19 22:52:12,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=571284.0, ans=0.125 2023-06-19 22:52:57,315 INFO [train.py:996] (3/4) Epoch 4, batch 3750, loss[loss=0.2976, simple_loss=0.3632, pruned_loss=0.116, over 20770.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.3416, pruned_loss=0.1068, over 4283250.33 frames. ], batch size: 607, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:53:03,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=571404.0, ans=0.0 2023-06-19 22:54:36,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=571644.0, ans=0.125 2023-06-19 22:54:40,563 INFO [train.py:996] (3/4) Epoch 4, batch 3800, loss[loss=0.2824, simple_loss=0.3445, pruned_loss=0.1102, over 21940.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3407, pruned_loss=0.1065, over 4287961.20 frames. ], batch size: 316, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:54:45,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=571704.0, ans=0.1 2023-06-19 22:54:47,709 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2023-06-19 22:54:52,897 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-19 22:55:27,665 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.805e+02 3.314e+02 3.828e+02 7.886e+02, threshold=6.628e+02, percent-clipped=5.0 2023-06-19 22:56:23,666 INFO [train.py:996] (3/4) Epoch 4, batch 3850, loss[loss=0.259, simple_loss=0.3127, pruned_loss=0.1027, over 21794.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3393, pruned_loss=0.1068, over 4283818.13 frames. ], batch size: 352, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:56:46,127 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-19 22:56:46,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=572064.0, ans=0.125 2023-06-19 22:57:11,890 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.71 vs. limit=22.5 2023-06-19 22:57:29,933 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.75 vs. limit=10.0 2023-06-19 22:57:29,986 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.31 vs. limit=10.0 2023-06-19 22:57:49,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=572244.0, ans=0.0 2023-06-19 22:58:06,820 INFO [train.py:996] (3/4) Epoch 4, batch 3900, loss[loss=0.2851, simple_loss=0.3341, pruned_loss=0.118, over 21856.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3348, pruned_loss=0.1063, over 4286336.93 frames. ], batch size: 414, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:58:24,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=572364.0, ans=0.125 2023-06-19 22:58:55,576 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.958e+02 3.677e+02 4.804e+02 9.279e+02, threshold=7.354e+02, percent-clipped=7.0 2023-06-19 22:59:51,706 INFO [train.py:996] (3/4) Epoch 4, batch 3950, loss[loss=0.1969, simple_loss=0.2667, pruned_loss=0.0635, over 21195.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3372, pruned_loss=0.1047, over 4285353.00 frames. ], batch size: 159, lr: 8.42e-03, grad_scale: 16.0 2023-06-19 23:00:29,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=572664.0, ans=0.0 2023-06-19 23:00:47,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=572724.0, ans=0.125 2023-06-19 23:01:01,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=572784.0, ans=0.04949747468305833 2023-06-19 23:01:18,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=572844.0, ans=0.04949747468305833 2023-06-19 23:01:26,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=572844.0, ans=0.0 2023-06-19 23:01:26,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=572844.0, ans=0.125 2023-06-19 23:01:34,291 INFO [train.py:996] (3/4) Epoch 4, batch 4000, loss[loss=0.1716, simple_loss=0.2514, pruned_loss=0.04584, over 21584.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3286, pruned_loss=0.09975, over 4281869.17 frames. ], batch size: 230, lr: 8.42e-03, grad_scale: 32.0 2023-06-19 23:01:38,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=572904.0, ans=0.125 2023-06-19 23:01:47,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=572904.0, ans=15.0 2023-06-19 23:02:22,415 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 2.603e+02 3.194e+02 3.964e+02 9.151e+02, threshold=6.387e+02, percent-clipped=1.0 2023-06-19 23:02:24,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=573024.0, ans=0.2 2023-06-19 23:02:36,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=573084.0, ans=0.125 2023-06-19 23:03:18,124 INFO [train.py:996] (3/4) Epoch 4, batch 4050, loss[loss=0.22, simple_loss=0.2981, pruned_loss=0.07091, over 21764.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3277, pruned_loss=0.09757, over 4283218.72 frames. ], batch size: 247, lr: 8.42e-03, grad_scale: 16.0 2023-06-19 23:03:18,491 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:04:26,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=15.0 2023-06-19 23:04:38,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=573444.0, ans=0.125 2023-06-19 23:04:57,137 INFO [train.py:996] (3/4) Epoch 4, batch 4100, loss[loss=0.249, simple_loss=0.3228, pruned_loss=0.08759, over 21625.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3293, pruned_loss=0.09857, over 4290706.52 frames. ], batch size: 230, lr: 8.42e-03, grad_scale: 16.0 2023-06-19 23:04:57,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=573504.0, ans=0.1 2023-06-19 23:05:46,187 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.845e+02 3.334e+02 4.002e+02 7.963e+02, threshold=6.669e+02, percent-clipped=0.0 2023-06-19 23:06:40,756 INFO [train.py:996] (3/4) Epoch 4, batch 4150, loss[loss=0.1955, simple_loss=0.2857, pruned_loss=0.05266, over 21360.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3268, pruned_loss=0.09447, over 4284986.28 frames. ], batch size: 194, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:06:56,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=573804.0, ans=0.125 2023-06-19 23:07:12,349 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.68 vs. limit=8.0 2023-06-19 23:08:03,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=573984.0, ans=0.05 2023-06-19 23:08:03,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=573984.0, ans=0.1 2023-06-19 23:08:25,537 INFO [train.py:996] (3/4) Epoch 4, batch 4200, loss[loss=0.3142, simple_loss=0.393, pruned_loss=0.1177, over 21539.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3285, pruned_loss=0.09487, over 4290182.14 frames. ], batch size: 441, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:08:25,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=574104.0, ans=0.125 2023-06-19 23:08:49,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=574104.0, ans=0.1 2023-06-19 23:08:54,795 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-19 23:09:26,357 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.689e+02 3.288e+02 4.795e+02 7.055e+02, threshold=6.577e+02, percent-clipped=3.0 2023-06-19 23:09:38,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=574284.0, ans=0.125 2023-06-19 23:09:49,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=574284.0, ans=0.0 2023-06-19 23:09:53,886 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-19 23:10:16,089 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-19 23:10:19,736 INFO [train.py:996] (3/4) Epoch 4, batch 4250, loss[loss=0.3758, simple_loss=0.4295, pruned_loss=0.1611, over 21450.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3345, pruned_loss=0.09689, over 4287103.64 frames. ], batch size: 471, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:10:22,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=574404.0, ans=0.0 2023-06-19 23:11:06,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=574524.0, ans=0.2 2023-06-19 23:11:22,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=574584.0, ans=0.1 2023-06-19 23:11:37,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=574644.0, ans=0.1 2023-06-19 23:11:57,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=574644.0, ans=0.0 2023-06-19 23:12:06,299 INFO [train.py:996] (3/4) Epoch 4, batch 4300, loss[loss=0.2408, simple_loss=0.3234, pruned_loss=0.07913, over 21420.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3414, pruned_loss=0.09961, over 4279924.28 frames. ], batch size: 131, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:12:30,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=574704.0, ans=0.0 2023-06-19 23:12:57,768 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 2.886e+02 3.415e+02 4.755e+02 8.316e+02, threshold=6.829e+02, percent-clipped=3.0 2023-06-19 23:14:00,194 INFO [train.py:996] (3/4) Epoch 4, batch 4350, loss[loss=0.248, simple_loss=0.294, pruned_loss=0.101, over 21692.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3393, pruned_loss=0.09805, over 4275535.21 frames. ], batch size: 232, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:14:25,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=575064.0, ans=0.5 2023-06-19 23:14:55,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=575184.0, ans=0.125 2023-06-19 23:15:15,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=575244.0, ans=0.125 2023-06-19 23:15:20,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=575244.0, ans=0.125 2023-06-19 23:15:40,575 INFO [train.py:996] (3/4) Epoch 4, batch 4400, loss[loss=0.2384, simple_loss=0.3196, pruned_loss=0.07858, over 21283.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3353, pruned_loss=0.09783, over 4272875.24 frames. ], batch size: 176, lr: 8.40e-03, grad_scale: 32.0 2023-06-19 23:15:43,033 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2023-06-19 23:16:01,893 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=14.58 vs. limit=12.0 2023-06-19 23:16:06,651 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:16:08,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=575364.0, ans=0.0 2023-06-19 23:16:25,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=575424.0, ans=0.035 2023-06-19 23:16:26,297 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.820e+02 3.325e+02 4.010e+02 7.079e+02, threshold=6.649e+02, percent-clipped=1.0 2023-06-19 23:16:50,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=575484.0, ans=0.125 2023-06-19 23:16:52,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=575484.0, ans=0.0 2023-06-19 23:17:02,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=575544.0, ans=0.125 2023-06-19 23:17:25,238 INFO [train.py:996] (3/4) Epoch 4, batch 4450, loss[loss=0.2584, simple_loss=0.3447, pruned_loss=0.08607, over 21613.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3422, pruned_loss=0.09957, over 4268579.47 frames. ], batch size: 230, lr: 8.40e-03, grad_scale: 32.0 2023-06-19 23:17:34,607 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-19 23:17:38,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=575604.0, ans=0.015 2023-06-19 23:17:50,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=575664.0, ans=0.1 2023-06-19 23:18:24,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=575784.0, ans=0.125 2023-06-19 23:19:07,751 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.07 vs. limit=5.0 2023-06-19 23:19:08,155 INFO [train.py:996] (3/4) Epoch 4, batch 4500, loss[loss=0.2751, simple_loss=0.3497, pruned_loss=0.1003, over 21257.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3432, pruned_loss=0.102, over 4276840.46 frames. ], batch size: 176, lr: 8.40e-03, grad_scale: 16.0 2023-06-19 23:19:09,293 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.71 vs. limit=15.0 2023-06-19 23:20:01,043 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.953e+02 3.681e+02 4.394e+02 8.500e+02, threshold=7.362e+02, percent-clipped=5.0 2023-06-19 23:20:03,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.56 vs. limit=15.0 2023-06-19 23:20:37,774 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-19 23:20:53,465 INFO [train.py:996] (3/4) Epoch 4, batch 4550, loss[loss=0.273, simple_loss=0.3467, pruned_loss=0.09971, over 21704.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3465, pruned_loss=0.1026, over 4275901.38 frames. ], batch size: 298, lr: 8.40e-03, grad_scale: 16.0 2023-06-19 23:21:41,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=576324.0, ans=0.125 2023-06-19 23:22:33,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-19 23:22:38,807 INFO [train.py:996] (3/4) Epoch 4, batch 4600, loss[loss=0.2632, simple_loss=0.3389, pruned_loss=0.09376, over 21725.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3485, pruned_loss=0.104, over 4279184.46 frames. ], batch size: 414, lr: 8.39e-03, grad_scale: 16.0 2023-06-19 23:22:50,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=576504.0, ans=0.5 2023-06-19 23:23:11,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=576564.0, ans=0.125 2023-06-19 23:23:33,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=576624.0, ans=0.04949747468305833 2023-06-19 23:23:34,658 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 2.974e+02 3.353e+02 4.220e+02 8.842e+02, threshold=6.706e+02, percent-clipped=3.0 2023-06-19 23:23:53,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=576684.0, ans=0.1 2023-06-19 23:24:08,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=576744.0, ans=0.125 2023-06-19 23:24:21,978 INFO [train.py:996] (3/4) Epoch 4, batch 4650, loss[loss=0.3006, simple_loss=0.3745, pruned_loss=0.1133, over 19885.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3423, pruned_loss=0.103, over 4279293.84 frames. ], batch size: 702, lr: 8.39e-03, grad_scale: 16.0 2023-06-19 23:25:19,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-19 23:25:25,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=576984.0, ans=0.125 2023-06-19 23:26:00,205 INFO [train.py:996] (3/4) Epoch 4, batch 4700, loss[loss=0.2224, simple_loss=0.2788, pruned_loss=0.08301, over 21594.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3333, pruned_loss=0.1001, over 4282554.67 frames. ], batch size: 247, lr: 8.39e-03, grad_scale: 16.0 2023-06-19 23:26:27,209 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.11 vs. limit=6.0 2023-06-19 23:26:56,728 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 3.084e+02 3.825e+02 4.515e+02 8.128e+02, threshold=7.651e+02, percent-clipped=5.0 2023-06-19 23:27:19,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=577344.0, ans=0.125 2023-06-19 23:27:42,039 INFO [train.py:996] (3/4) Epoch 4, batch 4750, loss[loss=0.2795, simple_loss=0.3267, pruned_loss=0.1162, over 21470.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3272, pruned_loss=0.0997, over 4274481.01 frames. ], batch size: 144, lr: 8.39e-03, grad_scale: 16.0 2023-06-19 23:27:44,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=577404.0, ans=0.125 2023-06-19 23:28:06,991 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:28:54,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=577584.0, ans=0.0 2023-06-19 23:29:06,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=577644.0, ans=0.125 2023-06-19 23:29:27,849 INFO [train.py:996] (3/4) Epoch 4, batch 4800, loss[loss=0.2531, simple_loss=0.3023, pruned_loss=0.1019, over 21512.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3268, pruned_loss=0.09915, over 4276218.14 frames. ], batch size: 194, lr: 8.39e-03, grad_scale: 32.0 2023-06-19 23:29:43,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=577704.0, ans=0.125 2023-06-19 23:30:08,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=577764.0, ans=0.0 2023-06-19 23:30:25,432 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.139e+02 3.016e+02 3.604e+02 4.520e+02 9.140e+02, threshold=7.207e+02, percent-clipped=2.0 2023-06-19 23:31:11,084 INFO [train.py:996] (3/4) Epoch 4, batch 4850, loss[loss=0.288, simple_loss=0.3917, pruned_loss=0.09213, over 21289.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3272, pruned_loss=0.09816, over 4275682.72 frames. ], batch size: 548, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:31:32,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=578064.0, ans=0.02 2023-06-19 23:31:37,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=578064.0, ans=0.1 2023-06-19 23:32:16,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-19 23:32:50,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=578244.0, ans=0.2 2023-06-19 23:32:53,795 INFO [train.py:996] (3/4) Epoch 4, batch 4900, loss[loss=0.3147, simple_loss=0.3858, pruned_loss=0.1218, over 21477.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.33, pruned_loss=0.1008, over 4282957.41 frames. ], batch size: 471, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:33:14,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=578364.0, ans=0.2 2023-06-19 23:33:50,453 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 3.075e+02 3.679e+02 4.552e+02 8.349e+02, threshold=7.359e+02, percent-clipped=3.0 2023-06-19 23:33:52,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=578424.0, ans=0.1 2023-06-19 23:34:04,798 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.13 vs. limit=6.0 2023-06-19 23:34:10,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=578484.0, ans=0.125 2023-06-19 23:34:37,026 INFO [train.py:996] (3/4) Epoch 4, batch 4950, loss[loss=0.2659, simple_loss=0.3635, pruned_loss=0.08422, over 21447.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.332, pruned_loss=0.09844, over 4276085.29 frames. ], batch size: 471, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:35:41,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=578784.0, ans=0.1 2023-06-19 23:35:43,602 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.80 vs. limit=22.5 2023-06-19 23:35:44,482 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:35:58,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=578784.0, ans=0.125 2023-06-19 23:36:05,746 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-06-19 23:36:16,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=578844.0, ans=0.125 2023-06-19 23:36:19,052 INFO [train.py:996] (3/4) Epoch 4, batch 5000, loss[loss=0.2491, simple_loss=0.3197, pruned_loss=0.0892, over 21521.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3303, pruned_loss=0.09384, over 4280417.49 frames. ], batch size: 212, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:36:33,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=578904.0, ans=0.125 2023-06-19 23:37:15,611 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.766e+02 3.352e+02 4.422e+02 7.725e+02, threshold=6.703e+02, percent-clipped=2.0 2023-06-19 23:37:48,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=579144.0, ans=0.125 2023-06-19 23:37:51,660 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:38:00,370 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-19 23:38:01,087 INFO [train.py:996] (3/4) Epoch 4, batch 5050, loss[loss=0.2557, simple_loss=0.3211, pruned_loss=0.09516, over 21887.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3329, pruned_loss=0.09691, over 4288787.79 frames. ], batch size: 351, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:38:08,479 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-19 23:38:53,252 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=15.0 2023-06-19 23:38:54,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=579324.0, ans=0.2 2023-06-19 23:39:43,694 INFO [train.py:996] (3/4) Epoch 4, batch 5100, loss[loss=0.2622, simple_loss=0.3275, pruned_loss=0.09846, over 21437.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3312, pruned_loss=0.09761, over 4284563.92 frames. ], batch size: 131, lr: 8.37e-03, grad_scale: 32.0 2023-06-19 23:40:05,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=579564.0, ans=0.2 2023-06-19 23:40:06,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=579564.0, ans=0.125 2023-06-19 23:40:39,484 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 2.860e+02 3.323e+02 3.950e+02 6.797e+02, threshold=6.645e+02, percent-clipped=1.0 2023-06-19 23:41:04,262 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.32 vs. limit=10.0 2023-06-19 23:41:17,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=579744.0, ans=0.1 2023-06-19 23:41:26,660 INFO [train.py:996] (3/4) Epoch 4, batch 5150, loss[loss=0.278, simple_loss=0.3312, pruned_loss=0.1124, over 21900.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3316, pruned_loss=0.09927, over 4289574.06 frames. ], batch size: 316, lr: 8.37e-03, grad_scale: 32.0 2023-06-19 23:41:58,773 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=22.5 2023-06-19 23:42:26,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=579924.0, ans=0.125 2023-06-19 23:43:16,510 INFO [train.py:996] (3/4) Epoch 4, batch 5200, loss[loss=0.2976, simple_loss=0.3897, pruned_loss=0.1027, over 21781.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3329, pruned_loss=0.09967, over 4280895.47 frames. ], batch size: 332, lr: 8.37e-03, grad_scale: 32.0 2023-06-19 23:44:10,729 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.305e+02 2.847e+02 3.708e+02 4.367e+02 7.934e+02, threshold=7.417e+02, percent-clipped=2.0 2023-06-19 23:44:11,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=580224.0, ans=0.0 2023-06-19 23:44:15,262 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=15.0 2023-06-19 23:44:21,378 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=2.533e-03 2023-06-19 23:44:26,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=580284.0, ans=0.1 2023-06-19 23:44:39,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=580344.0, ans=0.125 2023-06-19 23:45:01,059 INFO [train.py:996] (3/4) Epoch 4, batch 5250, loss[loss=0.2299, simple_loss=0.3111, pruned_loss=0.07435, over 21601.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3359, pruned_loss=0.09756, over 4271513.74 frames. ], batch size: 230, lr: 8.37e-03, grad_scale: 32.0 2023-06-19 23:45:04,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=580404.0, ans=0.2 2023-06-19 23:46:07,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=580584.0, ans=0.125 2023-06-19 23:46:41,703 INFO [train.py:996] (3/4) Epoch 4, batch 5300, loss[loss=0.2891, simple_loss=0.3378, pruned_loss=0.1203, over 21368.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3362, pruned_loss=0.09821, over 4271560.98 frames. ], batch size: 144, lr: 8.36e-03, grad_scale: 32.0 2023-06-19 23:46:53,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=580704.0, ans=0.035 2023-06-19 23:46:55,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=580704.0, ans=0.125 2023-06-19 23:46:56,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=580704.0, ans=0.125 2023-06-19 23:47:00,212 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:47:07,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=580764.0, ans=0.125 2023-06-19 23:47:33,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=580824.0, ans=0.0 2023-06-19 23:47:34,111 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.860e+02 3.383e+02 4.031e+02 8.552e+02, threshold=6.767e+02, percent-clipped=2.0 2023-06-19 23:48:02,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=580944.0, ans=0.125 2023-06-19 23:48:23,140 INFO [train.py:996] (3/4) Epoch 4, batch 5350, loss[loss=0.2761, simple_loss=0.3354, pruned_loss=0.1084, over 21780.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3378, pruned_loss=0.09975, over 4281375.73 frames. ], batch size: 112, lr: 8.36e-03, grad_scale: 16.0 2023-06-19 23:48:43,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=581064.0, ans=0.0 2023-06-19 23:48:43,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=581064.0, ans=0.125 2023-06-19 23:48:43,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=581064.0, ans=0.0 2023-06-19 23:48:49,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=581064.0, ans=0.125 2023-06-19 23:49:02,438 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.02 vs. limit=6.0 2023-06-19 23:49:27,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=581184.0, ans=0.1 2023-06-19 23:49:34,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=581184.0, ans=0.125 2023-06-19 23:50:10,522 INFO [train.py:996] (3/4) Epoch 4, batch 5400, loss[loss=0.2326, simple_loss=0.3021, pruned_loss=0.08157, over 21397.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3363, pruned_loss=0.101, over 4279173.05 frames. ], batch size: 194, lr: 8.36e-03, grad_scale: 16.0 2023-06-19 23:50:10,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=581304.0, ans=0.1 2023-06-19 23:50:47,757 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-19 23:51:03,994 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=22.5 2023-06-19 23:51:04,828 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 3.139e+02 3.601e+02 4.345e+02 9.321e+02, threshold=7.202e+02, percent-clipped=3.0 2023-06-19 23:51:26,022 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=15.0 2023-06-19 23:51:47,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=581544.0, ans=0.0 2023-06-19 23:51:54,848 INFO [train.py:996] (3/4) Epoch 4, batch 5450, loss[loss=0.2716, simple_loss=0.3601, pruned_loss=0.09156, over 21019.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3351, pruned_loss=0.09825, over 4282511.39 frames. ], batch size: 143, lr: 8.36e-03, grad_scale: 16.0 2023-06-19 23:52:44,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=581724.0, ans=0.125 2023-06-19 23:53:34,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=581844.0, ans=0.125 2023-06-19 23:53:44,849 INFO [train.py:996] (3/4) Epoch 4, batch 5500, loss[loss=0.2382, simple_loss=0.332, pruned_loss=0.07217, over 21727.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.34, pruned_loss=0.09462, over 4272400.59 frames. ], batch size: 332, lr: 8.36e-03, grad_scale: 16.0 2023-06-19 23:53:47,543 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-19 23:54:23,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=582024.0, ans=0.2 2023-06-19 23:54:33,615 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.703e+02 3.148e+02 3.931e+02 6.952e+02, threshold=6.296e+02, percent-clipped=0.0 2023-06-19 23:55:22,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=582144.0, ans=0.5 2023-06-19 23:55:30,365 INFO [train.py:996] (3/4) Epoch 4, batch 5550, loss[loss=0.2712, simple_loss=0.3782, pruned_loss=0.0821, over 21161.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3394, pruned_loss=0.09207, over 4276487.63 frames. ], batch size: 548, lr: 8.35e-03, grad_scale: 16.0 2023-06-19 23:57:00,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=582444.0, ans=0.2 2023-06-19 23:57:19,412 INFO [train.py:996] (3/4) Epoch 4, batch 5600, loss[loss=0.2945, simple_loss=0.3794, pruned_loss=0.1048, over 21785.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3369, pruned_loss=0.08851, over 4276759.27 frames. ], batch size: 316, lr: 8.35e-03, grad_scale: 32.0 2023-06-19 23:58:10,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=582624.0, ans=0.2 2023-06-19 23:58:12,425 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.737e+02 3.310e+02 4.006e+02 7.274e+02, threshold=6.621e+02, percent-clipped=1.0 2023-06-19 23:58:31,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=582684.0, ans=0.125 2023-06-19 23:58:42,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=582744.0, ans=0.2 2023-06-19 23:58:50,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=582744.0, ans=0.125 2023-06-19 23:58:55,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=582744.0, ans=0.1 2023-06-19 23:58:55,530 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2023-06-19 23:59:00,446 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.95 vs. limit=6.0 2023-06-19 23:59:01,241 INFO [train.py:996] (3/4) Epoch 4, batch 5650, loss[loss=0.3158, simple_loss=0.367, pruned_loss=0.1323, over 21862.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.341, pruned_loss=0.09269, over 4277940.19 frames. ], batch size: 371, lr: 8.35e-03, grad_scale: 32.0 2023-06-19 23:59:28,851 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.99 vs. limit=6.0 2023-06-19 23:59:54,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=582924.0, ans=0.2 2023-06-20 00:00:23,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=582984.0, ans=0.0 2023-06-20 00:00:44,940 INFO [train.py:996] (3/4) Epoch 4, batch 5700, loss[loss=0.3153, simple_loss=0.3784, pruned_loss=0.1261, over 21491.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3399, pruned_loss=0.0952, over 4283863.15 frames. ], batch size: 471, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 00:01:03,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=583104.0, ans=0.2 2023-06-20 00:01:09,314 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=22.5 2023-06-20 00:01:22,029 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 00:01:38,504 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 3.072e+02 3.794e+02 4.480e+02 7.487e+02, threshold=7.588e+02, percent-clipped=5.0 2023-06-20 00:02:14,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=583344.0, ans=0.0 2023-06-20 00:02:29,586 INFO [train.py:996] (3/4) Epoch 4, batch 5750, loss[loss=0.2407, simple_loss=0.3182, pruned_loss=0.08164, over 21250.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3348, pruned_loss=0.09132, over 4287292.33 frames. ], batch size: 159, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 00:02:30,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=583404.0, ans=0.125 2023-06-20 00:03:14,868 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.11 vs. limit=15.0 2023-06-20 00:03:18,078 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=12.0 2023-06-20 00:04:13,588 INFO [train.py:996] (3/4) Epoch 4, batch 5800, loss[loss=0.2916, simple_loss=0.3789, pruned_loss=0.1022, over 21708.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3348, pruned_loss=0.08951, over 4289002.17 frames. ], batch size: 351, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 00:04:25,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=583704.0, ans=0.025 2023-06-20 00:04:25,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=583704.0, ans=0.125 2023-06-20 00:04:33,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=583764.0, ans=0.125 2023-06-20 00:05:02,488 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 2.603e+02 3.108e+02 3.966e+02 5.463e+02, threshold=6.216e+02, percent-clipped=0.0 2023-06-20 00:05:08,592 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-20 00:05:21,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=583884.0, ans=0.125 2023-06-20 00:05:50,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=583944.0, ans=0.0 2023-06-20 00:05:53,785 INFO [train.py:996] (3/4) Epoch 4, batch 5850, loss[loss=0.1772, simple_loss=0.2563, pruned_loss=0.04907, over 21109.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3307, pruned_loss=0.08493, over 4289411.43 frames. ], batch size: 143, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 00:06:04,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=584004.0, ans=0.125 2023-06-20 00:06:25,004 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=15.0 2023-06-20 00:06:26,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=584064.0, ans=0.0 2023-06-20 00:06:44,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=584124.0, ans=0.1 2023-06-20 00:07:10,034 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.91 vs. limit=15.0 2023-06-20 00:07:36,916 INFO [train.py:996] (3/4) Epoch 4, batch 5900, loss[loss=0.1959, simple_loss=0.2804, pruned_loss=0.05574, over 21783.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.322, pruned_loss=0.07876, over 4291042.57 frames. ], batch size: 298, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 00:08:12,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=584364.0, ans=0.0 2023-06-20 00:08:29,704 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 2.549e+02 3.049e+02 3.679e+02 6.495e+02, threshold=6.098e+02, percent-clipped=1.0 2023-06-20 00:09:02,130 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=22.5 2023-06-20 00:09:28,635 INFO [train.py:996] (3/4) Epoch 4, batch 5950, loss[loss=0.2748, simple_loss=0.3235, pruned_loss=0.113, over 21540.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3231, pruned_loss=0.08433, over 4291555.38 frames. ], batch size: 441, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 00:09:38,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=584604.0, ans=0.1 2023-06-20 00:09:40,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=584604.0, ans=0.1 2023-06-20 00:09:57,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=584664.0, ans=0.125 2023-06-20 00:11:04,357 INFO [train.py:996] (3/4) Epoch 4, batch 6000, loss[loss=0.2274, simple_loss=0.2859, pruned_loss=0.08441, over 21881.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3197, pruned_loss=0.08755, over 4276656.72 frames. ], batch size: 98, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:11:04,358 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 00:11:13,497 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.3379, 3.1054, 1.6396, 1.5013], device='cuda:3') 2023-06-20 00:11:19,173 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.5318, 1.8738, 1.4833, 2.2129, 0.9245, 2.1518, 1.7394, 1.6161], device='cuda:3') 2023-06-20 00:11:26,260 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2686, simple_loss=0.3646, pruned_loss=0.08628, over 1796401.00 frames. 2023-06-20 00:11:26,261 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-20 00:12:19,532 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.164e+02 2.849e+02 3.273e+02 3.960e+02 7.085e+02, threshold=6.546e+02, percent-clipped=4.0 2023-06-20 00:13:09,979 INFO [train.py:996] (3/4) Epoch 4, batch 6050, loss[loss=0.231, simple_loss=0.2829, pruned_loss=0.08953, over 21621.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3142, pruned_loss=0.08901, over 4276242.85 frames. ], batch size: 247, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:13:12,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=585204.0, ans=0.125 2023-06-20 00:13:39,576 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 00:14:50,518 INFO [train.py:996] (3/4) Epoch 4, batch 6100, loss[loss=0.2703, simple_loss=0.3366, pruned_loss=0.102, over 21768.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.314, pruned_loss=0.08789, over 4281060.90 frames. ], batch size: 389, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:14:57,724 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-20 00:15:02,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=585504.0, ans=0.0 2023-06-20 00:15:07,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=585564.0, ans=0.0 2023-06-20 00:15:42,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=585624.0, ans=0.125 2023-06-20 00:15:43,377 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.539e+02 3.084e+02 3.751e+02 6.044e+02, threshold=6.168e+02, percent-clipped=0.0 2023-06-20 00:16:18,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=585744.0, ans=0.1 2023-06-20 00:16:18,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=585744.0, ans=0.125 2023-06-20 00:16:18,565 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 00:16:32,402 INFO [train.py:996] (3/4) Epoch 4, batch 6150, loss[loss=0.2444, simple_loss=0.3128, pruned_loss=0.08804, over 21642.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3175, pruned_loss=0.09112, over 4277175.30 frames. ], batch size: 391, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:16:36,785 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.92 vs. limit=8.0 2023-06-20 00:17:11,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=585924.0, ans=0.125 2023-06-20 00:17:24,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=585924.0, ans=0.125 2023-06-20 00:17:44,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=585984.0, ans=0.0 2023-06-20 00:17:59,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=586044.0, ans=0.1 2023-06-20 00:18:02,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=586044.0, ans=0.1 2023-06-20 00:18:08,677 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-20 00:18:09,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=586044.0, ans=0.1 2023-06-20 00:18:14,445 INFO [train.py:996] (3/4) Epoch 4, batch 6200, loss[loss=0.2825, simple_loss=0.3372, pruned_loss=0.1139, over 21336.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3205, pruned_loss=0.09203, over 4280619.20 frames. ], batch size: 159, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:19:08,237 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.711e+02 3.284e+02 3.994e+02 6.399e+02, threshold=6.568e+02, percent-clipped=2.0 2023-06-20 00:19:59,252 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-06-20 00:19:59,685 INFO [train.py:996] (3/4) Epoch 4, batch 6250, loss[loss=0.3232, simple_loss=0.4112, pruned_loss=0.1176, over 21520.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3238, pruned_loss=0.09061, over 4273900.55 frames. ], batch size: 471, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:20:48,594 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=15.0 2023-06-20 00:21:08,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=586584.0, ans=0.0 2023-06-20 00:21:43,499 INFO [train.py:996] (3/4) Epoch 4, batch 6300, loss[loss=0.3026, simple_loss=0.3527, pruned_loss=0.1263, over 21849.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3275, pruned_loss=0.09039, over 4272083.48 frames. ], batch size: 124, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:21:46,190 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-20 00:22:45,840 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.688e+02 3.149e+02 3.968e+02 6.842e+02, threshold=6.299e+02, percent-clipped=2.0 2023-06-20 00:22:49,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=586824.0, ans=0.0 2023-06-20 00:22:51,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=586884.0, ans=0.125 2023-06-20 00:23:17,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=586944.0, ans=0.125 2023-06-20 00:23:18,428 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.55 vs. limit=15.0 2023-06-20 00:23:26,094 INFO [train.py:996] (3/4) Epoch 4, batch 6350, loss[loss=0.2501, simple_loss=0.3168, pruned_loss=0.09164, over 21930.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3337, pruned_loss=0.09674, over 4275921.83 frames. ], batch size: 351, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:23:46,051 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.43 vs. limit=15.0 2023-06-20 00:24:03,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=587064.0, ans=0.125 2023-06-20 00:24:34,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=587184.0, ans=0.125 2023-06-20 00:24:41,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=587184.0, ans=0.125 2023-06-20 00:24:52,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=587244.0, ans=0.0 2023-06-20 00:24:55,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=587244.0, ans=0.0 2023-06-20 00:25:00,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=587244.0, ans=0.125 2023-06-20 00:25:16,750 INFO [train.py:996] (3/4) Epoch 4, batch 6400, loss[loss=0.2862, simple_loss=0.3959, pruned_loss=0.08828, over 19725.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3423, pruned_loss=0.1017, over 4275788.99 frames. ], batch size: 703, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:25:36,518 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=15.0 2023-06-20 00:25:55,684 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-20 00:26:04,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=587424.0, ans=0.95 2023-06-20 00:26:11,097 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.474e+02 3.295e+02 3.771e+02 4.525e+02 8.192e+02, threshold=7.543e+02, percent-clipped=2.0 2023-06-20 00:27:05,302 INFO [train.py:996] (3/4) Epoch 4, batch 6450, loss[loss=0.2673, simple_loss=0.3305, pruned_loss=0.102, over 21330.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3432, pruned_loss=0.1007, over 4280902.81 frames. ], batch size: 176, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:27:18,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=587604.0, ans=0.0 2023-06-20 00:27:18,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=587604.0, ans=0.1 2023-06-20 00:28:22,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=587844.0, ans=0.125 2023-06-20 00:28:22,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=587844.0, ans=0.02 2023-06-20 00:28:36,179 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-20 00:28:48,562 INFO [train.py:996] (3/4) Epoch 4, batch 6500, loss[loss=0.2302, simple_loss=0.288, pruned_loss=0.08616, over 21606.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.335, pruned_loss=0.09872, over 4272545.11 frames. ], batch size: 247, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 00:29:08,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=587964.0, ans=0.125 2023-06-20 00:29:10,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=587964.0, ans=10.0 2023-06-20 00:29:14,205 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-20 00:29:35,987 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.671e+02 3.231e+02 3.777e+02 5.375e+02, threshold=6.462e+02, percent-clipped=0.0 2023-06-20 00:29:38,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=588024.0, ans=12.0 2023-06-20 00:29:41,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=588084.0, ans=0.2 2023-06-20 00:30:15,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=588144.0, ans=0.125 2023-06-20 00:30:30,232 INFO [train.py:996] (3/4) Epoch 4, batch 6550, loss[loss=0.284, simple_loss=0.3451, pruned_loss=0.1115, over 21752.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3342, pruned_loss=0.09779, over 4274446.19 frames. ], batch size: 441, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 00:31:26,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=588384.0, ans=10.0 2023-06-20 00:31:41,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=588384.0, ans=0.125 2023-06-20 00:32:13,059 INFO [train.py:996] (3/4) Epoch 4, batch 6600, loss[loss=0.2254, simple_loss=0.2852, pruned_loss=0.08277, over 21831.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3281, pruned_loss=0.09681, over 4278486.51 frames. ], batch size: 107, lr: 8.31e-03, grad_scale: 16.0 2023-06-20 00:32:45,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=588564.0, ans=0.125 2023-06-20 00:33:01,893 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.719e+02 3.222e+02 3.782e+02 6.837e+02, threshold=6.444e+02, percent-clipped=2.0 2023-06-20 00:33:05,420 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 00:33:30,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=588684.0, ans=0.125 2023-06-20 00:33:34,358 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=12.0 2023-06-20 00:33:54,745 INFO [train.py:996] (3/4) Epoch 4, batch 6650, loss[loss=0.2428, simple_loss=0.2969, pruned_loss=0.09437, over 21746.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3224, pruned_loss=0.09309, over 4271967.92 frames. ], batch size: 112, lr: 8.31e-03, grad_scale: 16.0 2023-06-20 00:34:29,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=588864.0, ans=0.0 2023-06-20 00:35:28,293 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2023-06-20 00:35:37,426 INFO [train.py:996] (3/4) Epoch 4, batch 6700, loss[loss=0.2194, simple_loss=0.2829, pruned_loss=0.07795, over 21850.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3171, pruned_loss=0.09218, over 4268434.09 frames. ], batch size: 107, lr: 8.31e-03, grad_scale: 16.0 2023-06-20 00:35:54,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=589104.0, ans=0.125 2023-06-20 00:35:58,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=589164.0, ans=0.1 2023-06-20 00:36:10,596 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.25 vs. limit=12.0 2023-06-20 00:36:23,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=589224.0, ans=0.0 2023-06-20 00:36:26,048 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.795e+02 3.323e+02 4.034e+02 6.039e+02, threshold=6.647e+02, percent-clipped=0.0 2023-06-20 00:36:26,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=589224.0, ans=0.125 2023-06-20 00:36:43,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=589284.0, ans=0.1 2023-06-20 00:37:18,640 INFO [train.py:996] (3/4) Epoch 4, batch 6750, loss[loss=0.2529, simple_loss=0.3082, pruned_loss=0.09875, over 21865.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3151, pruned_loss=0.09151, over 4264204.71 frames. ], batch size: 283, lr: 8.30e-03, grad_scale: 16.0 2023-06-20 00:37:22,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=589404.0, ans=0.0 2023-06-20 00:37:26,303 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=15.0 2023-06-20 00:38:06,231 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-20 00:38:12,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=589584.0, ans=0.125 2023-06-20 00:38:28,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=589584.0, ans=0.1 2023-06-20 00:38:54,400 INFO [train.py:996] (3/4) Epoch 4, batch 6800, loss[loss=0.2628, simple_loss=0.3085, pruned_loss=0.1085, over 21597.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3172, pruned_loss=0.09398, over 4257409.55 frames. ], batch size: 414, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 00:39:40,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=589824.0, ans=0.1 2023-06-20 00:39:43,461 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.844e+02 3.168e+02 3.952e+02 7.008e+02, threshold=6.337e+02, percent-clipped=1.0 2023-06-20 00:40:13,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=589944.0, ans=0.0 2023-06-20 00:40:28,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=589944.0, ans=0.125 2023-06-20 00:40:34,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=590004.0, ans=0.0 2023-06-20 00:40:35,921 INFO [train.py:996] (3/4) Epoch 4, batch 6850, loss[loss=0.23, simple_loss=0.2883, pruned_loss=0.08585, over 21674.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3158, pruned_loss=0.0956, over 4261863.90 frames. ], batch size: 263, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 00:40:36,928 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.92 vs. limit=15.0 2023-06-20 00:41:10,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=590064.0, ans=0.125 2023-06-20 00:41:57,933 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-06-20 00:42:20,510 INFO [train.py:996] (3/4) Epoch 4, batch 6900, loss[loss=0.2399, simple_loss=0.3031, pruned_loss=0.08834, over 21790.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3181, pruned_loss=0.09556, over 4270677.02 frames. ], batch size: 112, lr: 8.30e-03, grad_scale: 16.0 2023-06-20 00:42:22,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=590304.0, ans=0.125 2023-06-20 00:42:30,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=590304.0, ans=0.1 2023-06-20 00:42:44,861 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 00:42:59,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=590424.0, ans=0.2 2023-06-20 00:43:22,478 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 3.145e+02 3.688e+02 5.056e+02 7.443e+02, threshold=7.376e+02, percent-clipped=5.0 2023-06-20 00:44:03,274 INFO [train.py:996] (3/4) Epoch 4, batch 6950, loss[loss=0.3086, simple_loss=0.3755, pruned_loss=0.1209, over 21808.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3203, pruned_loss=0.09329, over 4273045.26 frames. ], batch size: 118, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:44:04,259 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-20 00:45:26,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=590784.0, ans=0.2 2023-06-20 00:45:50,587 INFO [train.py:996] (3/4) Epoch 4, batch 7000, loss[loss=0.341, simple_loss=0.3848, pruned_loss=0.1486, over 21290.00 frames. ], tot_loss[loss=0.259, simple_loss=0.324, pruned_loss=0.09703, over 4276295.56 frames. ], batch size: 507, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:46:09,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=590964.0, ans=0.2 2023-06-20 00:46:46,589 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.182e+02 3.062e+02 3.465e+02 4.392e+02 8.171e+02, threshold=6.929e+02, percent-clipped=2.0 2023-06-20 00:46:50,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=591084.0, ans=0.125 2023-06-20 00:47:30,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=591144.0, ans=0.2 2023-06-20 00:47:33,145 INFO [train.py:996] (3/4) Epoch 4, batch 7050, loss[loss=0.2488, simple_loss=0.3326, pruned_loss=0.08247, over 21662.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3216, pruned_loss=0.0955, over 4271777.43 frames. ], batch size: 414, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:47:55,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=591264.0, ans=0.0 2023-06-20 00:48:14,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=591324.0, ans=0.1 2023-06-20 00:48:20,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=591324.0, ans=0.2 2023-06-20 00:49:03,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=591444.0, ans=0.125 2023-06-20 00:49:11,726 INFO [train.py:996] (3/4) Epoch 4, batch 7100, loss[loss=0.3431, simple_loss=0.3875, pruned_loss=0.1494, over 21408.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3275, pruned_loss=0.09675, over 4271737.85 frames. ], batch size: 471, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:50:00,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=591624.0, ans=0.0 2023-06-20 00:50:07,539 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.760e+02 3.326e+02 4.324e+02 6.991e+02, threshold=6.652e+02, percent-clipped=1.0 2023-06-20 00:50:10,491 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.08 vs. limit=15.0 2023-06-20 00:50:53,159 INFO [train.py:996] (3/4) Epoch 4, batch 7150, loss[loss=0.2649, simple_loss=0.3312, pruned_loss=0.09926, over 21773.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.322, pruned_loss=0.09318, over 4278754.16 frames. ], batch size: 247, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:52:24,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=592044.0, ans=0.0 2023-06-20 00:52:30,742 INFO [train.py:996] (3/4) Epoch 4, batch 7200, loss[loss=0.3309, simple_loss=0.3798, pruned_loss=0.1409, over 21780.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3263, pruned_loss=0.09679, over 4274765.02 frames. ], batch size: 441, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 00:52:59,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=592164.0, ans=0.0 2023-06-20 00:53:17,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=592224.0, ans=0.0 2023-06-20 00:53:24,723 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=15.0 2023-06-20 00:53:31,851 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.719e+02 3.108e+02 3.931e+02 6.174e+02, threshold=6.217e+02, percent-clipped=0.0 2023-06-20 00:53:37,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=592284.0, ans=0.1 2023-06-20 00:53:47,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=592284.0, ans=0.1 2023-06-20 00:53:58,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=592344.0, ans=0.125 2023-06-20 00:54:12,853 INFO [train.py:996] (3/4) Epoch 4, batch 7250, loss[loss=0.2484, simple_loss=0.3037, pruned_loss=0.0965, over 21744.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3235, pruned_loss=0.09649, over 4259204.49 frames. ], batch size: 300, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 00:54:54,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=592524.0, ans=0.125 2023-06-20 00:55:08,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=592524.0, ans=0.1 2023-06-20 00:55:11,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=592524.0, ans=0.125 2023-06-20 00:55:24,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=592584.0, ans=0.05 2023-06-20 00:55:33,280 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=12.0 2023-06-20 00:55:36,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=592584.0, ans=0.125 2023-06-20 00:55:41,801 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 00:55:44,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=592644.0, ans=0.125 2023-06-20 00:55:55,965 INFO [train.py:996] (3/4) Epoch 4, batch 7300, loss[loss=0.2449, simple_loss=0.3006, pruned_loss=0.09461, over 21578.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3166, pruned_loss=0.09522, over 4262596.29 frames. ], batch size: 298, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 00:56:16,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=592764.0, ans=0.125 2023-06-20 00:56:41,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=592824.0, ans=0.2 2023-06-20 00:56:54,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=592824.0, ans=0.0 2023-06-20 00:56:58,530 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.938e+02 3.597e+02 4.532e+02 8.618e+02, threshold=7.193e+02, percent-clipped=4.0 2023-06-20 00:57:19,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=592944.0, ans=0.1 2023-06-20 00:57:44,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=593004.0, ans=0.0 2023-06-20 00:57:45,775 INFO [train.py:996] (3/4) Epoch 4, batch 7350, loss[loss=0.2462, simple_loss=0.2938, pruned_loss=0.09929, over 21752.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3133, pruned_loss=0.09585, over 4255823.54 frames. ], batch size: 300, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 00:57:46,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=593004.0, ans=0.1 2023-06-20 00:58:15,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=593064.0, ans=0.125 2023-06-20 00:58:48,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=593184.0, ans=0.2 2023-06-20 00:59:22,567 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=12.0 2023-06-20 00:59:27,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=593244.0, ans=0.125 2023-06-20 00:59:31,746 INFO [train.py:996] (3/4) Epoch 4, batch 7400, loss[loss=0.197, simple_loss=0.2253, pruned_loss=0.08432, over 16704.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3198, pruned_loss=0.09879, over 4250450.02 frames. ], batch size: 60, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 01:00:02,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=593364.0, ans=0.125 2023-06-20 01:00:17,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=593424.0, ans=0.1 2023-06-20 01:00:19,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=593424.0, ans=0.2 2023-06-20 01:00:28,488 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.139e+02 3.006e+02 3.626e+02 4.126e+02 7.462e+02, threshold=7.252e+02, percent-clipped=1.0 2023-06-20 01:01:12,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=593544.0, ans=0.1 2023-06-20 01:01:15,451 INFO [train.py:996] (3/4) Epoch 4, batch 7450, loss[loss=0.2378, simple_loss=0.293, pruned_loss=0.09133, over 21574.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3183, pruned_loss=0.09661, over 4256874.73 frames. ], batch size: 247, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 01:01:50,722 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.10 vs. limit=10.0 2023-06-20 01:02:17,757 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.46 vs. limit=15.0 2023-06-20 01:02:43,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=593844.0, ans=0.125 2023-06-20 01:03:05,763 INFO [train.py:996] (3/4) Epoch 4, batch 7500, loss[loss=0.2909, simple_loss=0.3991, pruned_loss=0.09135, over 21302.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3252, pruned_loss=0.09874, over 4257419.21 frames. ], batch size: 549, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 01:03:41,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=593964.0, ans=0.1 2023-06-20 01:03:54,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=594024.0, ans=0.125 2023-06-20 01:04:09,788 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.163e+02 3.635e+02 4.766e+02 7.864e+02, threshold=7.270e+02, percent-clipped=2.0 2023-06-20 01:04:27,427 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-20 01:04:50,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=594204.0, ans=0.125 2023-06-20 01:04:51,491 INFO [train.py:996] (3/4) Epoch 4, batch 7550, loss[loss=0.2494, simple_loss=0.3376, pruned_loss=0.08054, over 21607.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3335, pruned_loss=0.09733, over 4260520.46 frames. ], batch size: 263, lr: 8.27e-03, grad_scale: 16.0 2023-06-20 01:04:55,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=594204.0, ans=0.0 2023-06-20 01:05:39,760 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=22.5 2023-06-20 01:05:49,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=594384.0, ans=0.125 2023-06-20 01:06:24,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=594444.0, ans=0.0 2023-06-20 01:06:26,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=594444.0, ans=0.125 2023-06-20 01:06:33,659 INFO [train.py:996] (3/4) Epoch 4, batch 7600, loss[loss=0.2806, simple_loss=0.3347, pruned_loss=0.1132, over 21800.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3325, pruned_loss=0.09567, over 4270178.92 frames. ], batch size: 282, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 01:07:13,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=594624.0, ans=0.125 2023-06-20 01:07:25,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=594624.0, ans=0.0 2023-06-20 01:07:27,124 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.700e+02 3.107e+02 3.746e+02 5.626e+02, threshold=6.215e+02, percent-clipped=0.0 2023-06-20 01:07:52,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=594684.0, ans=0.0 2023-06-20 01:07:52,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=594684.0, ans=0.0 2023-06-20 01:08:17,615 INFO [train.py:996] (3/4) Epoch 4, batch 7650, loss[loss=0.2663, simple_loss=0.3208, pruned_loss=0.106, over 21408.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3304, pruned_loss=0.09762, over 4279722.40 frames. ], batch size: 159, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 01:08:48,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=594864.0, ans=0.125 2023-06-20 01:09:24,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=594984.0, ans=0.015 2023-06-20 01:10:03,404 INFO [train.py:996] (3/4) Epoch 4, batch 7700, loss[loss=0.3167, simple_loss=0.3743, pruned_loss=0.1296, over 21780.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3326, pruned_loss=0.1004, over 4282475.35 frames. ], batch size: 441, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:10:21,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=595104.0, ans=0.05 2023-06-20 01:10:34,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=595164.0, ans=0.015 2023-06-20 01:10:42,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=595224.0, ans=0.0 2023-06-20 01:10:43,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=595224.0, ans=0.0 2023-06-20 01:11:04,887 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=22.5 2023-06-20 01:11:08,876 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.825e+02 3.572e+02 4.383e+02 7.085e+02, threshold=7.144e+02, percent-clipped=3.0 2023-06-20 01:11:50,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=595344.0, ans=0.0 2023-06-20 01:11:54,627 INFO [train.py:996] (3/4) Epoch 4, batch 7750, loss[loss=0.3223, simple_loss=0.4183, pruned_loss=0.1132, over 21860.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.339, pruned_loss=0.101, over 4283985.66 frames. ], batch size: 372, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:12:03,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=595404.0, ans=0.125 2023-06-20 01:12:35,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=595464.0, ans=10.0 2023-06-20 01:12:38,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=595524.0, ans=0.1 2023-06-20 01:12:59,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=595524.0, ans=0.125 2023-06-20 01:13:04,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=595584.0, ans=0.95 2023-06-20 01:13:38,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=595644.0, ans=0.2 2023-06-20 01:13:41,000 INFO [train.py:996] (3/4) Epoch 4, batch 7800, loss[loss=0.339, simple_loss=0.393, pruned_loss=0.1425, over 21395.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3391, pruned_loss=0.1005, over 4283266.41 frames. ], batch size: 507, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:13:54,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=595704.0, ans=0.1 2023-06-20 01:13:58,259 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.96 vs. limit=22.5 2023-06-20 01:14:19,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=595764.0, ans=0.0 2023-06-20 01:14:26,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=595824.0, ans=0.125 2023-06-20 01:14:44,352 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.307e+02 3.079e+02 3.630e+02 4.586e+02 7.709e+02, threshold=7.261e+02, percent-clipped=1.0 2023-06-20 01:15:24,356 INFO [train.py:996] (3/4) Epoch 4, batch 7850, loss[loss=0.241, simple_loss=0.2969, pruned_loss=0.09254, over 21793.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3313, pruned_loss=0.0994, over 4273396.97 frames. ], batch size: 112, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:15:24,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=596004.0, ans=0.2 2023-06-20 01:16:09,149 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=15.0 2023-06-20 01:16:23,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=596124.0, ans=0.0 2023-06-20 01:16:30,244 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2023-06-20 01:16:49,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=596244.0, ans=0.2 2023-06-20 01:17:10,733 INFO [train.py:996] (3/4) Epoch 4, batch 7900, loss[loss=0.2198, simple_loss=0.2684, pruned_loss=0.08555, over 20016.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3269, pruned_loss=0.09878, over 4262633.95 frames. ], batch size: 704, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:18:08,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=596424.0, ans=0.1 2023-06-20 01:18:14,883 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 3.214e+02 3.696e+02 4.914e+02 8.338e+02, threshold=7.393e+02, percent-clipped=4.0 2023-06-20 01:18:23,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.28 vs. limit=22.5 2023-06-20 01:18:44,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=596544.0, ans=0.035 2023-06-20 01:18:56,155 INFO [train.py:996] (3/4) Epoch 4, batch 7950, loss[loss=0.2603, simple_loss=0.3251, pruned_loss=0.09781, over 21344.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.334, pruned_loss=0.09941, over 4268445.52 frames. ], batch size: 159, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 01:19:22,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=596664.0, ans=0.125 2023-06-20 01:20:53,154 INFO [train.py:996] (3/4) Epoch 4, batch 8000, loss[loss=0.2602, simple_loss=0.3202, pruned_loss=0.1, over 21601.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3393, pruned_loss=0.1026, over 4265578.74 frames. ], batch size: 112, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 01:21:13,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=596964.0, ans=0.05 2023-06-20 01:21:13,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=596964.0, ans=0.125 2023-06-20 01:21:55,194 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 2.970e+02 3.290e+02 4.047e+02 5.946e+02, threshold=6.580e+02, percent-clipped=0.0 2023-06-20 01:22:46,646 INFO [train.py:996] (3/4) Epoch 4, batch 8050, loss[loss=0.2258, simple_loss=0.2924, pruned_loss=0.07961, over 21466.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3412, pruned_loss=0.1025, over 4255253.15 frames. ], batch size: 211, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 01:23:02,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=597264.0, ans=0.125 2023-06-20 01:23:34,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=597324.0, ans=0.125 2023-06-20 01:23:55,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=597384.0, ans=0.0 2023-06-20 01:24:32,694 INFO [train.py:996] (3/4) Epoch 4, batch 8100, loss[loss=0.3199, simple_loss=0.3694, pruned_loss=0.1352, over 21605.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3422, pruned_loss=0.1044, over 4267825.64 frames. ], batch size: 471, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 01:24:36,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=597504.0, ans=0.1 2023-06-20 01:25:39,506 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.366e+02 3.127e+02 3.761e+02 5.016e+02 1.103e+03, threshold=7.523e+02, percent-clipped=9.0 2023-06-20 01:25:53,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=597684.0, ans=0.125 2023-06-20 01:26:13,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=597744.0, ans=0.0 2023-06-20 01:26:20,058 INFO [train.py:996] (3/4) Epoch 4, batch 8150, loss[loss=0.2661, simple_loss=0.3669, pruned_loss=0.08265, over 21823.00 frames. ], tot_loss[loss=0.2791, simple_loss=0.3483, pruned_loss=0.105, over 4259047.37 frames. ], batch size: 372, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 01:26:20,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=597804.0, ans=0.0 2023-06-20 01:26:54,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=597864.0, ans=0.125 2023-06-20 01:27:08,087 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:27:15,467 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-20 01:27:22,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=597924.0, ans=0.1 2023-06-20 01:27:32,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=597984.0, ans=0.0 2023-06-20 01:27:40,060 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:27:40,731 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.28 vs. limit=15.0 2023-06-20 01:28:10,471 INFO [train.py:996] (3/4) Epoch 4, batch 8200, loss[loss=0.2748, simple_loss=0.3234, pruned_loss=0.1131, over 21439.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3396, pruned_loss=0.1023, over 4255567.85 frames. ], batch size: 389, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 01:29:13,763 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.320e+02 2.961e+02 3.420e+02 4.432e+02 7.003e+02, threshold=6.840e+02, percent-clipped=0.0 2023-06-20 01:29:18,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=598284.0, ans=0.0 2023-06-20 01:29:38,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=598344.0, ans=0.125 2023-06-20 01:29:53,762 INFO [train.py:996] (3/4) Epoch 4, batch 8250, loss[loss=0.3301, simple_loss=0.3999, pruned_loss=0.1302, over 21542.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3409, pruned_loss=0.1027, over 4265208.44 frames. ], batch size: 508, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 01:29:54,785 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-20 01:29:56,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=598404.0, ans=0.035 2023-06-20 01:30:11,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=598404.0, ans=0.125 2023-06-20 01:30:11,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=598404.0, ans=0.1 2023-06-20 01:30:30,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=598464.0, ans=0.125 2023-06-20 01:30:42,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=598524.0, ans=0.05 2023-06-20 01:30:43,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=598524.0, ans=0.1 2023-06-20 01:31:16,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=598644.0, ans=0.1 2023-06-20 01:31:22,571 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2023-06-20 01:31:34,751 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-20 01:31:38,805 INFO [train.py:996] (3/4) Epoch 4, batch 8300, loss[loss=0.2269, simple_loss=0.3117, pruned_loss=0.07105, over 21714.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3384, pruned_loss=0.09952, over 4266172.87 frames. ], batch size: 298, lr: 8.24e-03, grad_scale: 16.0 2023-06-20 01:32:25,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=598764.0, ans=0.125 2023-06-20 01:32:28,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=598824.0, ans=0.07 2023-06-20 01:32:28,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=598824.0, ans=0.125 2023-06-20 01:32:40,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=598824.0, ans=0.0 2023-06-20 01:32:43,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=598884.0, ans=0.1 2023-06-20 01:32:43,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=598884.0, ans=0.2 2023-06-20 01:32:45,073 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.814e+02 3.368e+02 3.938e+02 8.477e+02, threshold=6.736e+02, percent-clipped=1.0 2023-06-20 01:32:48,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=598884.0, ans=0.125 2023-06-20 01:33:10,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=598944.0, ans=0.125 2023-06-20 01:33:23,474 INFO [train.py:996] (3/4) Epoch 4, batch 8350, loss[loss=0.248, simple_loss=0.3155, pruned_loss=0.09027, over 21632.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3362, pruned_loss=0.0969, over 4275640.56 frames. ], batch size: 247, lr: 8.24e-03, grad_scale: 16.0 2023-06-20 01:33:26,196 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=22.5 2023-06-20 01:34:54,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=599244.0, ans=0.0 2023-06-20 01:35:08,067 INFO [train.py:996] (3/4) Epoch 4, batch 8400, loss[loss=0.1999, simple_loss=0.2706, pruned_loss=0.06464, over 21273.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3324, pruned_loss=0.09344, over 4275387.48 frames. ], batch size: 131, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:36:14,373 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.656e+02 3.140e+02 3.908e+02 6.671e+02, threshold=6.281e+02, percent-clipped=0.0 2023-06-20 01:36:19,966 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:36:50,915 INFO [train.py:996] (3/4) Epoch 4, batch 8450, loss[loss=0.307, simple_loss=0.3454, pruned_loss=0.1343, over 21662.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3316, pruned_loss=0.09335, over 4281696.67 frames. ], batch size: 508, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:37:15,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=599604.0, ans=0.125 2023-06-20 01:37:41,926 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:37:44,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=599724.0, ans=0.125 2023-06-20 01:37:48,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=599724.0, ans=0.2 2023-06-20 01:38:01,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=599784.0, ans=0.035 2023-06-20 01:38:16,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=599844.0, ans=0.0 2023-06-20 01:38:26,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=599844.0, ans=0.125 2023-06-20 01:38:34,731 INFO [train.py:996] (3/4) Epoch 4, batch 8500, loss[loss=0.2645, simple_loss=0.3186, pruned_loss=0.1052, over 21658.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3282, pruned_loss=0.09493, over 4287030.84 frames. ], batch size: 332, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:39:07,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=599964.0, ans=0.1 2023-06-20 01:39:44,101 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.032e+02 3.480e+02 4.088e+02 6.738e+02, threshold=6.960e+02, percent-clipped=1.0 2023-06-20 01:40:02,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=600144.0, ans=0.0 2023-06-20 01:40:04,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=600144.0, ans=0.1 2023-06-20 01:40:21,723 INFO [train.py:996] (3/4) Epoch 4, batch 8550, loss[loss=0.3358, simple_loss=0.4249, pruned_loss=0.1234, over 21275.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.334, pruned_loss=0.09831, over 4290153.79 frames. ], batch size: 548, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:40:51,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=600264.0, ans=0.125 2023-06-20 01:41:13,143 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:41:16,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=600324.0, ans=0.125 2023-06-20 01:41:16,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=600324.0, ans=0.2 2023-06-20 01:41:35,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=600384.0, ans=0.0 2023-06-20 01:41:38,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=600384.0, ans=0.125 2023-06-20 01:41:54,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=600444.0, ans=0.125 2023-06-20 01:42:17,772 INFO [train.py:996] (3/4) Epoch 4, batch 8600, loss[loss=0.3356, simple_loss=0.3908, pruned_loss=0.1402, over 21433.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3398, pruned_loss=0.1002, over 4288126.90 frames. ], batch size: 471, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:42:30,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=600504.0, ans=0.125 2023-06-20 01:42:41,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=600564.0, ans=0.0 2023-06-20 01:42:56,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=600624.0, ans=0.2 2023-06-20 01:43:15,160 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 3.082e+02 3.829e+02 4.661e+02 1.059e+03, threshold=7.657e+02, percent-clipped=7.0 2023-06-20 01:43:27,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=600684.0, ans=0.2 2023-06-20 01:44:00,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=600804.0, ans=0.125 2023-06-20 01:44:06,498 INFO [train.py:996] (3/4) Epoch 4, batch 8650, loss[loss=0.219, simple_loss=0.3318, pruned_loss=0.05311, over 21237.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3459, pruned_loss=0.1005, over 4285333.26 frames. ], batch size: 548, lr: 8.22e-03, grad_scale: 16.0 2023-06-20 01:44:18,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=600804.0, ans=0.0 2023-06-20 01:44:42,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=600924.0, ans=0.125 2023-06-20 01:45:40,756 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.95 vs. limit=6.0 2023-06-20 01:45:44,857 INFO [train.py:996] (3/4) Epoch 4, batch 8700, loss[loss=0.2351, simple_loss=0.293, pruned_loss=0.08857, over 21802.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3363, pruned_loss=0.09667, over 4281989.10 frames. ], batch size: 118, lr: 8.22e-03, grad_scale: 16.0 2023-06-20 01:46:18,257 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-06-20 01:46:36,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=601224.0, ans=0.125 2023-06-20 01:46:42,031 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.831e+02 3.437e+02 4.356e+02 1.035e+03, threshold=6.874e+02, percent-clipped=3.0 2023-06-20 01:46:56,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=601284.0, ans=0.0 2023-06-20 01:47:26,510 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=22.5 2023-06-20 01:47:35,531 INFO [train.py:996] (3/4) Epoch 4, batch 8750, loss[loss=0.2923, simple_loss=0.3486, pruned_loss=0.1179, over 21586.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3325, pruned_loss=0.09751, over 4285856.32 frames. ], batch size: 471, lr: 8.22e-03, grad_scale: 16.0 2023-06-20 01:48:35,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=601584.0, ans=0.2 2023-06-20 01:49:22,800 INFO [train.py:996] (3/4) Epoch 4, batch 8800, loss[loss=0.2875, simple_loss=0.3547, pruned_loss=0.1102, over 21512.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.341, pruned_loss=0.1013, over 4289246.29 frames. ], batch size: 194, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 01:49:48,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=601764.0, ans=0.0 2023-06-20 01:49:49,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.12 vs. limit=15.0 2023-06-20 01:49:57,525 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=12.0 2023-06-20 01:50:20,619 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 2.907e+02 3.416e+02 4.301e+02 7.142e+02, threshold=6.833e+02, percent-clipped=3.0 2023-06-20 01:50:53,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=601944.0, ans=0.04949747468305833 2023-06-20 01:51:03,475 INFO [train.py:996] (3/4) Epoch 4, batch 8850, loss[loss=0.2989, simple_loss=0.3458, pruned_loss=0.126, over 21349.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3479, pruned_loss=0.104, over 4287649.57 frames. ], batch size: 508, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 01:51:10,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=602004.0, ans=0.125 2023-06-20 01:52:26,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=602244.0, ans=0.125 2023-06-20 01:52:44,277 INFO [train.py:996] (3/4) Epoch 4, batch 8900, loss[loss=0.2527, simple_loss=0.3067, pruned_loss=0.09937, over 21600.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3418, pruned_loss=0.102, over 4287960.23 frames. ], batch size: 298, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 01:53:09,794 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-06-20 01:53:34,764 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.70 vs. limit=10.0 2023-06-20 01:53:54,064 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-20 01:54:00,138 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.914e+02 3.498e+02 4.067e+02 9.619e+02, threshold=6.997e+02, percent-clipped=2.0 2023-06-20 01:54:03,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=602484.0, ans=0.0 2023-06-20 01:54:18,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=602544.0, ans=0.0 2023-06-20 01:54:31,817 INFO [train.py:996] (3/4) Epoch 4, batch 8950, loss[loss=0.2189, simple_loss=0.2674, pruned_loss=0.08518, over 20760.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3419, pruned_loss=0.101, over 4283447.05 frames. ], batch size: 609, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 01:54:43,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=602604.0, ans=0.0 2023-06-20 01:55:59,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=602844.0, ans=0.0 2023-06-20 01:56:03,428 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.04 vs. limit=15.0 2023-06-20 01:56:06,500 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-20 01:56:15,319 INFO [train.py:996] (3/4) Epoch 4, batch 9000, loss[loss=0.2508, simple_loss=0.3052, pruned_loss=0.09817, over 21057.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3368, pruned_loss=0.1005, over 4277392.08 frames. ], batch size: 143, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 01:56:15,319 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 01:56:37,869 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2701, simple_loss=0.3695, pruned_loss=0.08531, over 1796401.00 frames. 2023-06-20 01:56:37,870 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-20 01:57:07,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=602964.0, ans=0.125 2023-06-20 01:57:36,277 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:57:40,800 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.934e+02 3.477e+02 4.426e+02 7.521e+02, threshold=6.955e+02, percent-clipped=2.0 2023-06-20 01:58:24,269 INFO [train.py:996] (3/4) Epoch 4, batch 9050, loss[loss=0.188, simple_loss=0.2685, pruned_loss=0.05377, over 21556.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.331, pruned_loss=0.09628, over 4273914.35 frames. ], batch size: 212, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 01:58:27,105 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-20 01:58:34,047 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=22.5 2023-06-20 01:58:43,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=603204.0, ans=0.125 2023-06-20 01:58:58,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=603264.0, ans=0.2 2023-06-20 01:59:06,357 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.03 vs. limit=10.0 2023-06-20 01:59:42,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=603444.0, ans=0.125 2023-06-20 02:00:15,319 INFO [train.py:996] (3/4) Epoch 4, batch 9100, loss[loss=0.2666, simple_loss=0.3592, pruned_loss=0.08696, over 21664.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.3388, pruned_loss=0.1002, over 4268373.38 frames. ], batch size: 441, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 02:00:17,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=603504.0, ans=0.125 2023-06-20 02:00:29,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=603504.0, ans=0.07 2023-06-20 02:00:43,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=603564.0, ans=0.125 2023-06-20 02:00:46,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=603564.0, ans=0.125 2023-06-20 02:01:08,276 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.683e+02 3.374e+02 4.242e+02 6.313e+02, threshold=6.748e+02, percent-clipped=0.0 2023-06-20 02:01:15,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=603684.0, ans=0.125 2023-06-20 02:01:20,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=603684.0, ans=0.0 2023-06-20 02:01:29,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=603744.0, ans=0.1 2023-06-20 02:01:56,015 INFO [train.py:996] (3/4) Epoch 4, batch 9150, loss[loss=0.275, simple_loss=0.3808, pruned_loss=0.08456, over 21218.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3399, pruned_loss=0.09669, over 4270070.46 frames. ], batch size: 548, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 02:02:42,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=22.5 2023-06-20 02:03:37,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=604044.0, ans=0.125 2023-06-20 02:03:41,429 INFO [train.py:996] (3/4) Epoch 4, batch 9200, loss[loss=0.2159, simple_loss=0.306, pruned_loss=0.06289, over 21731.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3409, pruned_loss=0.09532, over 4263672.55 frames. ], batch size: 298, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 02:03:52,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=604104.0, ans=0.0 2023-06-20 02:04:19,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=604224.0, ans=0.0 2023-06-20 02:04:45,436 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.901e+02 3.630e+02 4.447e+02 7.984e+02, threshold=7.260e+02, percent-clipped=1.0 2023-06-20 02:05:10,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=604344.0, ans=0.0 2023-06-20 02:05:14,601 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.15 vs. limit=12.0 2023-06-20 02:05:17,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=604344.0, ans=0.125 2023-06-20 02:05:24,864 INFO [train.py:996] (3/4) Epoch 4, batch 9250, loss[loss=0.2405, simple_loss=0.3001, pruned_loss=0.09043, over 21912.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.3449, pruned_loss=0.09996, over 4266151.07 frames. ], batch size: 113, lr: 8.20e-03, grad_scale: 16.0 2023-06-20 02:05:52,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=604464.0, ans=0.125 2023-06-20 02:05:52,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=604464.0, ans=0.0 2023-06-20 02:06:54,857 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-20 02:07:06,130 INFO [train.py:996] (3/4) Epoch 4, batch 9300, loss[loss=0.2322, simple_loss=0.2935, pruned_loss=0.08542, over 21881.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3379, pruned_loss=0.09893, over 4274208.40 frames. ], batch size: 107, lr: 8.20e-03, grad_scale: 16.0 2023-06-20 02:07:14,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=604704.0, ans=0.1 2023-06-20 02:07:27,661 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=22.5 2023-06-20 02:07:56,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=604824.0, ans=0.125 2023-06-20 02:08:18,254 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 3.021e+02 3.641e+02 4.393e+02 8.139e+02, threshold=7.281e+02, percent-clipped=1.0 2023-06-20 02:08:26,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=604884.0, ans=0.125 2023-06-20 02:08:30,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=604944.0, ans=10.0 2023-06-20 02:08:35,757 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=15.0 2023-06-20 02:08:43,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=604944.0, ans=0.125 2023-06-20 02:08:46,467 INFO [train.py:996] (3/4) Epoch 4, batch 9350, loss[loss=0.3131, simple_loss=0.3752, pruned_loss=0.1256, over 21566.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3446, pruned_loss=0.1008, over 4276130.50 frames. ], batch size: 263, lr: 8.20e-03, grad_scale: 16.0 2023-06-20 02:09:08,841 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-20 02:09:19,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=605064.0, ans=0.035 2023-06-20 02:09:53,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=605124.0, ans=0.04949747468305833 2023-06-20 02:10:04,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=605184.0, ans=0.0 2023-06-20 02:10:30,855 INFO [train.py:996] (3/4) Epoch 4, batch 9400, loss[loss=0.2478, simple_loss=0.3072, pruned_loss=0.09416, over 21134.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3472, pruned_loss=0.1022, over 4277281.89 frames. ], batch size: 143, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 02:11:44,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=605484.0, ans=0.125 2023-06-20 02:11:46,092 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 3.026e+02 3.567e+02 4.359e+02 8.563e+02, threshold=7.134e+02, percent-clipped=2.0 2023-06-20 02:11:49,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=605484.0, ans=0.125 2023-06-20 02:12:13,919 INFO [train.py:996] (3/4) Epoch 4, batch 9450, loss[loss=0.2176, simple_loss=0.2804, pruned_loss=0.07734, over 21725.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3375, pruned_loss=0.1004, over 4282679.27 frames. ], batch size: 300, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 02:12:52,328 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=22.5 2023-06-20 02:13:26,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=605784.0, ans=0.2 2023-06-20 02:13:52,766 INFO [train.py:996] (3/4) Epoch 4, batch 9500, loss[loss=0.2868, simple_loss=0.3436, pruned_loss=0.115, over 21186.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3299, pruned_loss=0.09796, over 4262129.28 frames. ], batch size: 143, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 02:15:09,563 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.876e+02 3.483e+02 4.277e+02 8.627e+02, threshold=6.965e+02, percent-clipped=2.0 2023-06-20 02:15:11,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=606084.0, ans=0.0 2023-06-20 02:15:28,844 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.39 vs. limit=10.0 2023-06-20 02:15:37,582 INFO [train.py:996] (3/4) Epoch 4, batch 9550, loss[loss=0.3157, simple_loss=0.3776, pruned_loss=0.1269, over 21195.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3338, pruned_loss=0.09915, over 4267774.97 frames. ], batch size: 143, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 02:15:51,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=606204.0, ans=0.125 2023-06-20 02:16:30,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=606324.0, ans=0.125 2023-06-20 02:17:12,626 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.91 vs. limit=10.0 2023-06-20 02:17:21,027 INFO [train.py:996] (3/4) Epoch 4, batch 9600, loss[loss=0.294, simple_loss=0.3598, pruned_loss=0.1141, over 21423.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3383, pruned_loss=0.1027, over 4272269.26 frames. ], batch size: 211, lr: 8.19e-03, grad_scale: 32.0 2023-06-20 02:17:35,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=606504.0, ans=0.125 2023-06-20 02:17:40,348 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-06-20 02:17:58,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=606564.0, ans=0.0 2023-06-20 02:17:59,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=606564.0, ans=0.125 2023-06-20 02:18:36,525 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 2.961e+02 3.442e+02 3.920e+02 7.478e+02, threshold=6.885e+02, percent-clipped=1.0 2023-06-20 02:18:48,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=606744.0, ans=0.125 2023-06-20 02:18:54,155 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.90 vs. limit=6.0 2023-06-20 02:18:58,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=606744.0, ans=0.09899494936611666 2023-06-20 02:19:09,537 INFO [train.py:996] (3/4) Epoch 4, batch 9650, loss[loss=0.2834, simple_loss=0.3399, pruned_loss=0.1135, over 21503.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3367, pruned_loss=0.1015, over 4273514.17 frames. ], batch size: 508, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:19:15,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=606804.0, ans=0.125 2023-06-20 02:19:23,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=606804.0, ans=0.09899494936611666 2023-06-20 02:19:26,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=606804.0, ans=0.125 2023-06-20 02:19:50,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=606864.0, ans=0.125 2023-06-20 02:19:52,114 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:20:34,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=607044.0, ans=0.95 2023-06-20 02:20:47,472 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:20:54,111 INFO [train.py:996] (3/4) Epoch 4, batch 9700, loss[loss=0.2472, simple_loss=0.332, pruned_loss=0.08118, over 21636.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3391, pruned_loss=0.1011, over 4274739.73 frames. ], batch size: 263, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:21:24,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=607104.0, ans=0.125 2023-06-20 02:21:38,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=607164.0, ans=0.0 2023-06-20 02:21:41,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=607164.0, ans=0.07 2023-06-20 02:21:54,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=607224.0, ans=0.05 2023-06-20 02:21:59,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=607284.0, ans=0.2 2023-06-20 02:22:07,442 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.382e+02 2.962e+02 3.413e+02 3.970e+02 9.096e+02, threshold=6.826e+02, percent-clipped=3.0 2023-06-20 02:22:12,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=607284.0, ans=0.125 2023-06-20 02:22:17,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=607344.0, ans=0.125 2023-06-20 02:22:38,372 INFO [train.py:996] (3/4) Epoch 4, batch 9750, loss[loss=0.2488, simple_loss=0.3038, pruned_loss=0.09688, over 15416.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3336, pruned_loss=0.1001, over 4272050.09 frames. ], batch size: 60, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:22:38,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=607404.0, ans=0.0 2023-06-20 02:23:22,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=15.0 2023-06-20 02:23:28,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=607524.0, ans=0.1 2023-06-20 02:23:29,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=607524.0, ans=0.0 2023-06-20 02:24:03,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=607644.0, ans=0.125 2023-06-20 02:24:13,820 INFO [train.py:996] (3/4) Epoch 4, batch 9800, loss[loss=0.2486, simple_loss=0.3299, pruned_loss=0.0837, over 21891.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3339, pruned_loss=0.1006, over 4274963.17 frames. ], batch size: 124, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:24:59,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=607764.0, ans=0.125 2023-06-20 02:25:00,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=607764.0, ans=0.0 2023-06-20 02:25:25,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=607884.0, ans=0.0 2023-06-20 02:25:29,985 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.802e+02 3.178e+02 3.680e+02 5.783e+02, threshold=6.355e+02, percent-clipped=0.0 2023-06-20 02:25:56,197 INFO [train.py:996] (3/4) Epoch 4, batch 9850, loss[loss=0.2594, simple_loss=0.3076, pruned_loss=0.1056, over 21243.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3306, pruned_loss=0.1002, over 4278810.40 frames. ], batch size: 159, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:26:05,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=608004.0, ans=0.0 2023-06-20 02:26:47,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=608064.0, ans=0.125 2023-06-20 02:27:20,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=608184.0, ans=0.0 2023-06-20 02:27:41,043 INFO [train.py:996] (3/4) Epoch 4, batch 9900, loss[loss=0.2821, simple_loss=0.3708, pruned_loss=0.09671, over 20740.00 frames. ], tot_loss[loss=0.262, simple_loss=0.326, pruned_loss=0.09895, over 4267418.67 frames. ], batch size: 607, lr: 8.17e-03, grad_scale: 16.0 2023-06-20 02:28:23,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=608364.0, ans=0.0 2023-06-20 02:28:47,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=15.0 2023-06-20 02:29:00,243 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 2.957e+02 3.478e+02 4.825e+02 8.249e+02, threshold=6.956e+02, percent-clipped=2.0 2023-06-20 02:29:31,820 INFO [train.py:996] (3/4) Epoch 4, batch 9950, loss[loss=0.2706, simple_loss=0.3312, pruned_loss=0.105, over 19955.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3297, pruned_loss=0.1013, over 4269902.85 frames. ], batch size: 702, lr: 8.17e-03, grad_scale: 16.0 2023-06-20 02:29:44,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=608604.0, ans=0.1 2023-06-20 02:30:29,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=608724.0, ans=0.0 2023-06-20 02:31:01,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=608844.0, ans=0.025 2023-06-20 02:31:23,718 INFO [train.py:996] (3/4) Epoch 4, batch 10000, loss[loss=0.2249, simple_loss=0.2855, pruned_loss=0.08218, over 21580.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3256, pruned_loss=0.1008, over 4263106.33 frames. ], batch size: 230, lr: 8.17e-03, grad_scale: 32.0 2023-06-20 02:31:24,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=608904.0, ans=0.0 2023-06-20 02:32:32,024 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.674e+02 3.222e+02 3.735e+02 9.123e+02, threshold=6.443e+02, percent-clipped=2.0 2023-06-20 02:32:43,847 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-06-20 02:33:03,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=609144.0, ans=0.1 2023-06-20 02:33:10,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=609144.0, ans=0.125 2023-06-20 02:33:14,568 INFO [train.py:996] (3/4) Epoch 4, batch 10050, loss[loss=0.2092, simple_loss=0.2841, pruned_loss=0.06721, over 21663.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3279, pruned_loss=0.1018, over 4268927.55 frames. ], batch size: 332, lr: 8.17e-03, grad_scale: 32.0 2023-06-20 02:33:42,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=609264.0, ans=0.125 2023-06-20 02:35:07,516 INFO [train.py:996] (3/4) Epoch 4, batch 10100, loss[loss=0.2113, simple_loss=0.2802, pruned_loss=0.0712, over 21627.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.325, pruned_loss=0.09922, over 4261918.30 frames. ], batch size: 230, lr: 8.17e-03, grad_scale: 16.0 2023-06-20 02:35:24,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=609564.0, ans=0.125 2023-06-20 02:35:57,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=609684.0, ans=0.0 2023-06-20 02:36:10,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=609684.0, ans=0.07 2023-06-20 02:36:11,675 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.172e+02 3.027e+02 3.628e+02 4.360e+02 7.943e+02, threshold=7.256e+02, percent-clipped=2.0 2023-06-20 02:36:47,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=609744.0, ans=0.0 2023-06-20 02:36:53,346 INFO [train.py:996] (3/4) Epoch 4, batch 10150, loss[loss=0.2501, simple_loss=0.3247, pruned_loss=0.08774, over 21261.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3313, pruned_loss=0.1012, over 4270294.12 frames. ], batch size: 176, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:36:58,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=609804.0, ans=0.125 2023-06-20 02:36:58,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=609804.0, ans=0.125 2023-06-20 02:37:09,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=609864.0, ans=0.1 2023-06-20 02:37:18,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=609864.0, ans=0.125 2023-06-20 02:37:23,599 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:37:31,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=609924.0, ans=0.0 2023-06-20 02:38:02,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=609984.0, ans=0.035 2023-06-20 02:38:04,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=609984.0, ans=0.125 2023-06-20 02:38:06,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=609984.0, ans=0.0 2023-06-20 02:38:38,458 INFO [train.py:996] (3/4) Epoch 4, batch 10200, loss[loss=0.2482, simple_loss=0.3316, pruned_loss=0.08243, over 21702.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.331, pruned_loss=0.09854, over 4275758.89 frames. ], batch size: 415, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:39:07,649 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.81 vs. limit=12.0 2023-06-20 02:39:09,327 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=22.5 2023-06-20 02:39:46,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=610284.0, ans=0.035 2023-06-20 02:39:52,549 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.626e+02 2.503e+02 3.181e+02 4.095e+02 8.895e+02, threshold=6.363e+02, percent-clipped=3.0 2023-06-20 02:39:52,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=610284.0, ans=0.0 2023-06-20 02:40:23,048 INFO [train.py:996] (3/4) Epoch 4, batch 10250, loss[loss=0.2787, simple_loss=0.3501, pruned_loss=0.1036, over 21604.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3244, pruned_loss=0.09222, over 4279651.54 frames. ], batch size: 389, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:40:57,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=610524.0, ans=0.0 2023-06-20 02:41:16,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=610524.0, ans=0.2 2023-06-20 02:41:20,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=610524.0, ans=0.125 2023-06-20 02:42:09,906 INFO [train.py:996] (3/4) Epoch 4, batch 10300, loss[loss=0.3192, simple_loss=0.4003, pruned_loss=0.119, over 21907.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3288, pruned_loss=0.09371, over 4278230.24 frames. ], batch size: 372, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:42:25,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=610764.0, ans=0.125 2023-06-20 02:42:36,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=610764.0, ans=0.2 2023-06-20 02:43:25,981 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 3.082e+02 3.756e+02 4.509e+02 8.129e+02, threshold=7.512e+02, percent-clipped=5.0 2023-06-20 02:43:40,629 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=12.0 2023-06-20 02:43:51,313 INFO [train.py:996] (3/4) Epoch 4, batch 10350, loss[loss=0.2541, simple_loss=0.329, pruned_loss=0.08962, over 21819.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3301, pruned_loss=0.09481, over 4267616.00 frames. ], batch size: 372, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:43:55,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=611004.0, ans=0.0 2023-06-20 02:44:28,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=611064.0, ans=0.125 2023-06-20 02:44:38,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=611124.0, ans=0.125 2023-06-20 02:45:03,269 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-20 02:45:35,493 INFO [train.py:996] (3/4) Epoch 4, batch 10400, loss[loss=0.2474, simple_loss=0.3196, pruned_loss=0.08766, over 21686.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.322, pruned_loss=0.09172, over 4266060.43 frames. ], batch size: 391, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:46:30,653 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=12.0 2023-06-20 02:46:56,615 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 3.060e+02 3.521e+02 4.304e+02 7.584e+02, threshold=7.042e+02, percent-clipped=1.0 2023-06-20 02:47:21,557 INFO [train.py:996] (3/4) Epoch 4, batch 10450, loss[loss=0.2961, simple_loss=0.3667, pruned_loss=0.1128, over 21613.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3281, pruned_loss=0.096, over 4259546.50 frames. ], batch size: 263, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:47:48,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=611664.0, ans=0.125 2023-06-20 02:48:30,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=611784.0, ans=0.0 2023-06-20 02:48:39,640 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.61 vs. limit=22.5 2023-06-20 02:49:17,067 INFO [train.py:996] (3/4) Epoch 4, batch 10500, loss[loss=0.292, simple_loss=0.4054, pruned_loss=0.08932, over 19846.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3264, pruned_loss=0.09451, over 4257274.59 frames. ], batch size: 703, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:50:02,101 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=12.0 2023-06-20 02:50:11,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=612024.0, ans=0.0 2023-06-20 02:50:25,505 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.864e+02 3.538e+02 4.749e+02 1.100e+03, threshold=7.075e+02, percent-clipped=4.0 2023-06-20 02:50:44,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=612144.0, ans=0.125 2023-06-20 02:51:01,150 INFO [train.py:996] (3/4) Epoch 4, batch 10550, loss[loss=0.2671, simple_loss=0.2997, pruned_loss=0.1172, over 14886.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3245, pruned_loss=0.09572, over 4247408.76 frames. ], batch size: 60, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:51:19,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=612264.0, ans=0.125 2023-06-20 02:51:51,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=612324.0, ans=0.1 2023-06-20 02:52:41,237 INFO [train.py:996] (3/4) Epoch 4, batch 10600, loss[loss=0.2375, simple_loss=0.2892, pruned_loss=0.09292, over 21658.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3198, pruned_loss=0.09351, over 4245808.50 frames. ], batch size: 282, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:53:19,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=612564.0, ans=0.2 2023-06-20 02:53:33,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=612624.0, ans=0.025 2023-06-20 02:53:53,397 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 2.887e+02 3.753e+02 5.124e+02 8.898e+02, threshold=7.506e+02, percent-clipped=7.0 2023-06-20 02:54:17,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=612744.0, ans=0.035 2023-06-20 02:54:28,373 INFO [train.py:996] (3/4) Epoch 4, batch 10650, loss[loss=0.1834, simple_loss=0.2633, pruned_loss=0.05177, over 21395.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3213, pruned_loss=0.09234, over 4251674.72 frames. ], batch size: 211, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 02:55:10,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=612864.0, ans=0.5 2023-06-20 02:55:15,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=612924.0, ans=0.125 2023-06-20 02:56:18,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=613044.0, ans=0.125 2023-06-20 02:56:26,505 INFO [train.py:996] (3/4) Epoch 4, batch 10700, loss[loss=0.3213, simple_loss=0.3742, pruned_loss=0.1342, over 21598.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3214, pruned_loss=0.09232, over 4242816.52 frames. ], batch size: 263, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 02:56:33,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=613104.0, ans=0.0 2023-06-20 02:56:51,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=613164.0, ans=0.125 2023-06-20 02:57:20,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=613284.0, ans=0.1 2023-06-20 02:57:36,047 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 3.228e+02 3.667e+02 4.550e+02 7.977e+02, threshold=7.334e+02, percent-clipped=1.0 2023-06-20 02:57:38,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=613284.0, ans=0.125 2023-06-20 02:57:41,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=613284.0, ans=0.125 2023-06-20 02:57:43,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=613344.0, ans=0.0 2023-06-20 02:58:11,813 INFO [train.py:996] (3/4) Epoch 4, batch 10750, loss[loss=0.2908, simple_loss=0.3538, pruned_loss=0.1139, over 20669.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3325, pruned_loss=0.09805, over 4255257.86 frames. ], batch size: 607, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 02:58:20,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=613404.0, ans=0.125 2023-06-20 02:59:17,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=613584.0, ans=0.09899494936611666 2023-06-20 02:59:57,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=613704.0, ans=0.0 2023-06-20 02:59:58,069 INFO [train.py:996] (3/4) Epoch 4, batch 10800, loss[loss=0.2492, simple_loss=0.324, pruned_loss=0.08716, over 21903.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3378, pruned_loss=0.09873, over 4260430.89 frames. ], batch size: 316, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 03:00:11,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=613704.0, ans=0.125 2023-06-20 03:00:15,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=613764.0, ans=0.1 2023-06-20 03:00:15,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=613764.0, ans=0.0 2023-06-20 03:00:20,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=613764.0, ans=0.1 2023-06-20 03:00:43,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=613824.0, ans=0.05 2023-06-20 03:00:55,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=613824.0, ans=0.2 2023-06-20 03:00:56,064 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2023-06-20 03:01:15,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=613884.0, ans=0.125 2023-06-20 03:01:16,786 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 2.918e+02 3.193e+02 3.822e+02 6.360e+02, threshold=6.386e+02, percent-clipped=0.0 2023-06-20 03:01:41,160 INFO [train.py:996] (3/4) Epoch 4, batch 10850, loss[loss=0.2894, simple_loss=0.3483, pruned_loss=0.1153, over 21356.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3367, pruned_loss=0.09875, over 4259517.75 frames. ], batch size: 471, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 03:01:54,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=614004.0, ans=0.1 2023-06-20 03:02:04,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=614064.0, ans=0.1 2023-06-20 03:02:25,511 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.96 vs. limit=10.0 2023-06-20 03:02:36,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=614124.0, ans=0.1 2023-06-20 03:03:13,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=614244.0, ans=0.125 2023-06-20 03:03:24,889 INFO [train.py:996] (3/4) Epoch 4, batch 10900, loss[loss=0.2541, simple_loss=0.3293, pruned_loss=0.0895, over 21591.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3293, pruned_loss=0.09626, over 4258223.33 frames. ], batch size: 414, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 03:03:51,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=614364.0, ans=0.125 2023-06-20 03:04:00,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=614364.0, ans=0.1 2023-06-20 03:04:43,218 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.644e+02 3.102e+02 4.157e+02 6.653e+02, threshold=6.203e+02, percent-clipped=4.0 2023-06-20 03:04:48,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=614484.0, ans=0.0 2023-06-20 03:04:55,653 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2023-06-20 03:05:07,344 INFO [train.py:996] (3/4) Epoch 4, batch 10950, loss[loss=0.2262, simple_loss=0.2959, pruned_loss=0.07826, over 21754.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3262, pruned_loss=0.09414, over 4254788.35 frames. ], batch size: 351, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 03:05:14,380 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.91 vs. limit=22.5 2023-06-20 03:06:09,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=614724.0, ans=0.1 2023-06-20 03:06:24,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=614784.0, ans=0.125 2023-06-20 03:06:31,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=614784.0, ans=0.035 2023-06-20 03:06:31,802 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.55 vs. limit=10.0 2023-06-20 03:06:49,619 INFO [train.py:996] (3/4) Epoch 4, batch 11000, loss[loss=0.2272, simple_loss=0.294, pruned_loss=0.08027, over 21502.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3238, pruned_loss=0.09504, over 4263054.29 frames. ], batch size: 212, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 03:06:57,335 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:08:07,854 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.675e+02 3.019e+02 3.528e+02 7.831e+02, threshold=6.038e+02, percent-clipped=1.0 2023-06-20 03:08:31,482 INFO [train.py:996] (3/4) Epoch 4, batch 11050, loss[loss=0.2175, simple_loss=0.2744, pruned_loss=0.08029, over 21775.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3225, pruned_loss=0.09674, over 4266828.27 frames. ], batch size: 351, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 03:08:40,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=615204.0, ans=0.0 2023-06-20 03:09:44,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=615384.0, ans=0.0 2023-06-20 03:10:13,722 INFO [train.py:996] (3/4) Epoch 4, batch 11100, loss[loss=0.2806, simple_loss=0.3361, pruned_loss=0.1125, over 21351.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3203, pruned_loss=0.09667, over 4266632.39 frames. ], batch size: 471, lr: 8.13e-03, grad_scale: 16.0 2023-06-20 03:10:25,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=615504.0, ans=0.125 2023-06-20 03:11:00,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=615624.0, ans=0.2 2023-06-20 03:11:30,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=615684.0, ans=0.125 2023-06-20 03:11:33,200 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 3.025e+02 3.574e+02 4.736e+02 7.982e+02, threshold=7.148e+02, percent-clipped=11.0 2023-06-20 03:11:45,723 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=12.0 2023-06-20 03:11:48,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=615744.0, ans=0.09899494936611666 2023-06-20 03:11:56,221 INFO [train.py:996] (3/4) Epoch 4, batch 11150, loss[loss=0.2677, simple_loss=0.3427, pruned_loss=0.09633, over 21770.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3172, pruned_loss=0.09596, over 4252284.42 frames. ], batch size: 371, lr: 8.12e-03, grad_scale: 16.0 2023-06-20 03:11:56,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=615804.0, ans=0.125 2023-06-20 03:12:29,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=615864.0, ans=0.05 2023-06-20 03:13:06,995 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.04 vs. limit=12.0 2023-06-20 03:13:17,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=616044.0, ans=0.125 2023-06-20 03:13:33,706 INFO [train.py:996] (3/4) Epoch 4, batch 11200, loss[loss=0.2418, simple_loss=0.2936, pruned_loss=0.09504, over 21329.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3148, pruned_loss=0.09531, over 4255689.84 frames. ], batch size: 211, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 03:13:48,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=616164.0, ans=0.125 2023-06-20 03:14:09,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=616164.0, ans=0.1 2023-06-20 03:14:16,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=616224.0, ans=0.2 2023-06-20 03:14:20,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.12 vs. limit=12.0 2023-06-20 03:14:21,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=616224.0, ans=0.2 2023-06-20 03:14:54,144 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 2.789e+02 3.395e+02 4.282e+02 6.875e+02, threshold=6.790e+02, percent-clipped=0.0 2023-06-20 03:15:13,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=616344.0, ans=0.125 2023-06-20 03:15:17,026 INFO [train.py:996] (3/4) Epoch 4, batch 11250, loss[loss=0.2387, simple_loss=0.3269, pruned_loss=0.07522, over 21796.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.314, pruned_loss=0.09426, over 4260305.83 frames. ], batch size: 124, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 03:15:27,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=616404.0, ans=0.0 2023-06-20 03:16:16,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=616584.0, ans=0.0 2023-06-20 03:16:50,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=616644.0, ans=0.0 2023-06-20 03:17:00,604 INFO [train.py:996] (3/4) Epoch 4, batch 11300, loss[loss=0.2245, simple_loss=0.293, pruned_loss=0.07796, over 21813.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3166, pruned_loss=0.09454, over 4272658.32 frames. ], batch size: 118, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 03:17:48,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=616824.0, ans=0.1 2023-06-20 03:18:07,345 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=15.0 2023-06-20 03:18:21,766 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.889e+02 3.588e+02 4.478e+02 6.707e+02, threshold=7.176e+02, percent-clipped=0.0 2023-06-20 03:18:29,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=616944.0, ans=0.125 2023-06-20 03:18:34,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=616944.0, ans=0.0 2023-06-20 03:18:44,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=617004.0, ans=0.0 2023-06-20 03:18:45,389 INFO [train.py:996] (3/4) Epoch 4, batch 11350, loss[loss=0.3091, simple_loss=0.3723, pruned_loss=0.1229, over 21889.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3183, pruned_loss=0.09398, over 4263426.13 frames. ], batch size: 372, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 03:19:05,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=617004.0, ans=0.125 2023-06-20 03:19:18,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=617064.0, ans=0.1 2023-06-20 03:20:41,057 INFO [train.py:996] (3/4) Epoch 4, batch 11400, loss[loss=0.2267, simple_loss=0.3079, pruned_loss=0.07271, over 21410.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3271, pruned_loss=0.09804, over 4267836.04 frames. ], batch size: 194, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 03:20:41,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-20 03:20:48,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=617304.0, ans=0.125 2023-06-20 03:21:05,256 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-20 03:21:29,612 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=12.0 2023-06-20 03:21:53,013 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 3.300e+02 4.007e+02 5.041e+02 8.202e+02, threshold=8.013e+02, percent-clipped=6.0 2023-06-20 03:22:06,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=617544.0, ans=0.125 2023-06-20 03:22:26,651 INFO [train.py:996] (3/4) Epoch 4, batch 11450, loss[loss=0.2795, simple_loss=0.3497, pruned_loss=0.1047, over 21637.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3288, pruned_loss=0.09669, over 4263961.37 frames. ], batch size: 389, lr: 8.11e-03, grad_scale: 16.0 2023-06-20 03:22:33,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=617604.0, ans=0.125 2023-06-20 03:22:34,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=617604.0, ans=0.125 2023-06-20 03:23:30,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=617784.0, ans=0.025 2023-06-20 03:23:41,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=617784.0, ans=0.2 2023-06-20 03:23:42,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=617844.0, ans=0.0 2023-06-20 03:24:00,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=617844.0, ans=0.125 2023-06-20 03:24:05,926 INFO [train.py:996] (3/4) Epoch 4, batch 11500, loss[loss=0.2885, simple_loss=0.3526, pruned_loss=0.1122, over 21470.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3321, pruned_loss=0.09788, over 4268948.04 frames. ], batch size: 211, lr: 8.11e-03, grad_scale: 16.0 2023-06-20 03:24:50,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=618024.0, ans=0.0 2023-06-20 03:25:19,763 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 3.106e+02 3.529e+02 4.777e+02 9.700e+02, threshold=7.057e+02, percent-clipped=3.0 2023-06-20 03:25:35,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=618144.0, ans=0.0 2023-06-20 03:25:36,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=618144.0, ans=0.07 2023-06-20 03:25:40,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=618144.0, ans=0.125 2023-06-20 03:25:47,089 INFO [train.py:996] (3/4) Epoch 4, batch 11550, loss[loss=0.2003, simple_loss=0.252, pruned_loss=0.0743, over 16725.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3378, pruned_loss=0.09812, over 4266246.93 frames. ], batch size: 60, lr: 8.11e-03, grad_scale: 16.0 2023-06-20 03:25:57,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=618204.0, ans=0.1 2023-06-20 03:26:23,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=618264.0, ans=0.125 2023-06-20 03:26:25,186 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.66 vs. limit=15.0 2023-06-20 03:27:06,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=618384.0, ans=0.1 2023-06-20 03:27:34,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=618444.0, ans=0.125 2023-06-20 03:27:37,863 INFO [train.py:996] (3/4) Epoch 4, batch 11600, loss[loss=0.2838, simple_loss=0.3736, pruned_loss=0.09703, over 21657.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3495, pruned_loss=0.09949, over 4266989.59 frames. ], batch size: 263, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 03:27:38,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=618504.0, ans=0.0 2023-06-20 03:28:55,732 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 2.978e+02 3.550e+02 4.282e+02 6.763e+02, threshold=7.099e+02, percent-clipped=1.0 2023-06-20 03:29:03,262 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-20 03:29:22,199 INFO [train.py:996] (3/4) Epoch 4, batch 11650, loss[loss=0.2559, simple_loss=0.3365, pruned_loss=0.08765, over 21193.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3554, pruned_loss=0.1003, over 4266274.98 frames. ], batch size: 159, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:29:30,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=618804.0, ans=0.0 2023-06-20 03:29:41,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=618804.0, ans=0.125 2023-06-20 03:30:53,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-20 03:30:59,644 INFO [train.py:996] (3/4) Epoch 4, batch 11700, loss[loss=0.2522, simple_loss=0.2993, pruned_loss=0.1026, over 21609.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.347, pruned_loss=0.1003, over 4268621.52 frames. ], batch size: 247, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:31:15,937 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.47 vs. limit=15.0 2023-06-20 03:31:35,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=619164.0, ans=0.1 2023-06-20 03:32:02,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=619224.0, ans=0.035 2023-06-20 03:32:17,015 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.228e+02 2.962e+02 3.618e+02 4.622e+02 7.851e+02, threshold=7.236e+02, percent-clipped=2.0 2023-06-20 03:32:49,956 INFO [train.py:996] (3/4) Epoch 4, batch 11750, loss[loss=0.2925, simple_loss=0.3942, pruned_loss=0.09541, over 19689.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.3386, pruned_loss=0.09999, over 4249810.71 frames. ], batch size: 702, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:32:52,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=619404.0, ans=0.125 2023-06-20 03:32:54,122 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-06-20 03:33:11,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=619464.0, ans=0.125 2023-06-20 03:33:42,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=619524.0, ans=0.2 2023-06-20 03:33:56,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=619584.0, ans=0.125 2023-06-20 03:34:05,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=619584.0, ans=0.2 2023-06-20 03:34:09,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=619584.0, ans=0.0 2023-06-20 03:34:35,052 INFO [train.py:996] (3/4) Epoch 4, batch 11800, loss[loss=0.2548, simple_loss=0.3113, pruned_loss=0.09918, over 21852.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3399, pruned_loss=0.102, over 4260228.23 frames. ], batch size: 107, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:34:40,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=619704.0, ans=0.125 2023-06-20 03:34:53,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=619764.0, ans=0.125 2023-06-20 03:35:51,522 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.341e+02 3.018e+02 3.704e+02 4.604e+02 8.251e+02, threshold=7.407e+02, percent-clipped=4.0 2023-06-20 03:36:10,879 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=15.0 2023-06-20 03:36:15,553 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-20 03:36:18,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=620004.0, ans=0.125 2023-06-20 03:36:19,782 INFO [train.py:996] (3/4) Epoch 4, batch 11850, loss[loss=0.2231, simple_loss=0.3049, pruned_loss=0.07062, over 21158.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3408, pruned_loss=0.1014, over 4268621.35 frames. ], batch size: 143, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:36:31,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=620004.0, ans=0.1 2023-06-20 03:37:14,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=620184.0, ans=0.0 2023-06-20 03:37:56,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=620244.0, ans=0.0 2023-06-20 03:37:59,343 INFO [train.py:996] (3/4) Epoch 4, batch 11900, loss[loss=0.2303, simple_loss=0.3152, pruned_loss=0.07264, over 21584.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3401, pruned_loss=0.09883, over 4267750.07 frames. ], batch size: 230, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:38:02,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=620304.0, ans=0.125 2023-06-20 03:38:05,548 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.90 vs. limit=15.0 2023-06-20 03:38:13,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=620304.0, ans=0.1 2023-06-20 03:38:39,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=620364.0, ans=0.95 2023-06-20 03:38:51,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=620424.0, ans=0.0 2023-06-20 03:39:18,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=620484.0, ans=0.125 2023-06-20 03:39:22,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.597e+02 3.084e+02 3.622e+02 5.304e+02, threshold=6.167e+02, percent-clipped=0.0 2023-06-20 03:39:31,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=620544.0, ans=0.125 2023-06-20 03:39:44,827 INFO [train.py:996] (3/4) Epoch 4, batch 11950, loss[loss=0.2946, simple_loss=0.3877, pruned_loss=0.1007, over 21632.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3386, pruned_loss=0.09403, over 4268802.47 frames. ], batch size: 441, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:40:57,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=620784.0, ans=0.0 2023-06-20 03:41:07,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=620784.0, ans=0.95 2023-06-20 03:41:29,994 INFO [train.py:996] (3/4) Epoch 4, batch 12000, loss[loss=0.2225, simple_loss=0.2852, pruned_loss=0.07995, over 21418.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3355, pruned_loss=0.09227, over 4258266.99 frames. ], batch size: 211, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:41:29,994 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 03:41:44,410 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.8197, 3.8771, 2.1777, 1.7688], device='cuda:3') 2023-06-20 03:41:51,444 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2681, simple_loss=0.3653, pruned_loss=0.08549, over 1796401.00 frames. 2023-06-20 03:41:51,445 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-20 03:42:04,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=620904.0, ans=0.0 2023-06-20 03:42:59,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=621084.0, ans=0.1 2023-06-20 03:43:04,276 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 2.836e+02 3.434e+02 3.961e+02 6.580e+02, threshold=6.867e+02, percent-clipped=2.0 2023-06-20 03:43:34,833 INFO [train.py:996] (3/4) Epoch 4, batch 12050, loss[loss=0.2731, simple_loss=0.3403, pruned_loss=0.103, over 21841.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3354, pruned_loss=0.09535, over 4264880.75 frames. ], batch size: 351, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:43:37,466 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=22.5 2023-06-20 03:43:55,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=621264.0, ans=0.125 2023-06-20 03:44:21,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=621324.0, ans=0.2 2023-06-20 03:45:09,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=621444.0, ans=0.0 2023-06-20 03:45:21,127 INFO [train.py:996] (3/4) Epoch 4, batch 12100, loss[loss=0.2871, simple_loss=0.3509, pruned_loss=0.1116, over 21321.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3409, pruned_loss=0.0995, over 4267576.60 frames. ], batch size: 143, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:46:11,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=621624.0, ans=0.05 2023-06-20 03:46:14,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=621624.0, ans=0.0 2023-06-20 03:46:21,590 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.46 vs. limit=10.0 2023-06-20 03:46:47,630 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.023e+02 3.734e+02 4.594e+02 9.342e+02, threshold=7.469e+02, percent-clipped=3.0 2023-06-20 03:46:58,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=621744.0, ans=0.125 2023-06-20 03:46:58,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=621744.0, ans=0.0 2023-06-20 03:46:59,910 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:47:14,421 INFO [train.py:996] (3/4) Epoch 4, batch 12150, loss[loss=0.2526, simple_loss=0.3236, pruned_loss=0.09083, over 21381.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3422, pruned_loss=0.09918, over 4268236.82 frames. ], batch size: 211, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:47:17,920 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:47:37,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=621864.0, ans=0.0 2023-06-20 03:48:25,234 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=15.0 2023-06-20 03:48:27,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=621984.0, ans=0.0 2023-06-20 03:48:39,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=622044.0, ans=0.125 2023-06-20 03:49:02,733 INFO [train.py:996] (3/4) Epoch 4, batch 12200, loss[loss=0.2429, simple_loss=0.2919, pruned_loss=0.09698, over 21284.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3382, pruned_loss=0.09849, over 4274386.77 frames. ], batch size: 160, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 03:49:26,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=622164.0, ans=0.125 2023-06-20 03:49:35,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=622224.0, ans=0.125 2023-06-20 03:49:37,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=622224.0, ans=0.125 2023-06-20 03:49:50,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=622224.0, ans=0.125 2023-06-20 03:50:17,981 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.79 vs. limit=12.0 2023-06-20 03:50:18,568 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.874e+02 3.548e+02 4.515e+02 8.617e+02, threshold=7.096e+02, percent-clipped=2.0 2023-06-20 03:50:41,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=622404.0, ans=0.1 2023-06-20 03:50:43,035 INFO [train.py:996] (3/4) Epoch 4, batch 12250, loss[loss=0.1611, simple_loss=0.2311, pruned_loss=0.04558, over 15936.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.331, pruned_loss=0.09562, over 4262476.68 frames. ], batch size: 60, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 03:50:53,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=622404.0, ans=0.0 2023-06-20 03:51:23,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=622524.0, ans=0.125 2023-06-20 03:51:26,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=622524.0, ans=0.1 2023-06-20 03:51:42,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-06-20 03:52:14,831 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.22 vs. limit=6.0 2023-06-20 03:52:15,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=622644.0, ans=0.125 2023-06-20 03:52:21,918 INFO [train.py:996] (3/4) Epoch 4, batch 12300, loss[loss=0.152, simple_loss=0.2234, pruned_loss=0.04028, over 21283.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3208, pruned_loss=0.08826, over 4259489.21 frames. ], batch size: 159, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 03:53:00,323 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-20 03:53:20,011 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.85 vs. limit=10.0 2023-06-20 03:53:28,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=622824.0, ans=0.0 2023-06-20 03:53:32,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=622884.0, ans=0.125 2023-06-20 03:53:42,112 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.85 vs. limit=10.0 2023-06-20 03:53:45,499 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 2.451e+02 2.752e+02 3.675e+02 6.755e+02, threshold=5.504e+02, percent-clipped=0.0 2023-06-20 03:54:09,724 INFO [train.py:996] (3/4) Epoch 4, batch 12350, loss[loss=0.2581, simple_loss=0.323, pruned_loss=0.09654, over 21844.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3261, pruned_loss=0.08908, over 4263591.76 frames. ], batch size: 282, lr: 8.08e-03, grad_scale: 16.0 2023-06-20 03:54:59,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=623124.0, ans=0.125 2023-06-20 03:55:44,950 INFO [train.py:996] (3/4) Epoch 4, batch 12400, loss[loss=0.2824, simple_loss=0.333, pruned_loss=0.1159, over 21317.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3283, pruned_loss=0.09387, over 4271076.63 frames. ], batch size: 176, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 03:55:55,974 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=22.5 2023-06-20 03:56:08,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=623304.0, ans=0.125 2023-06-20 03:56:10,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=623364.0, ans=12.0 2023-06-20 03:56:41,073 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-20 03:56:49,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=623484.0, ans=0.125 2023-06-20 03:57:06,676 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.226e+02 3.090e+02 3.958e+02 4.907e+02 8.874e+02, threshold=7.916e+02, percent-clipped=17.0 2023-06-20 03:57:33,446 INFO [train.py:996] (3/4) Epoch 4, batch 12450, loss[loss=0.2803, simple_loss=0.353, pruned_loss=0.1038, over 21947.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3327, pruned_loss=0.09704, over 4275400.88 frames. ], batch size: 316, lr: 8.07e-03, grad_scale: 16.0 2023-06-20 03:57:44,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=623604.0, ans=0.125 2023-06-20 03:58:04,110 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.10 vs. limit=15.0 2023-06-20 03:58:15,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=623664.0, ans=0.125 2023-06-20 03:58:34,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=623784.0, ans=0.125 2023-06-20 03:58:51,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=623784.0, ans=0.125 2023-06-20 03:59:00,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=623784.0, ans=0.125 2023-06-20 03:59:19,789 INFO [train.py:996] (3/4) Epoch 4, batch 12500, loss[loss=0.3111, simple_loss=0.393, pruned_loss=0.1147, over 21434.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3434, pruned_loss=0.1016, over 4276145.01 frames. ], batch size: 194, lr: 8.07e-03, grad_scale: 16.0 2023-06-20 03:59:25,870 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.65 vs. limit=10.0 2023-06-20 04:00:39,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=624084.0, ans=0.2 2023-06-20 04:00:53,419 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 2.989e+02 3.343e+02 3.829e+02 7.985e+02, threshold=6.687e+02, percent-clipped=1.0 2023-06-20 04:01:10,291 INFO [train.py:996] (3/4) Epoch 4, batch 12550, loss[loss=0.2622, simple_loss=0.3343, pruned_loss=0.095, over 21645.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.349, pruned_loss=0.1042, over 4279779.76 frames. ], batch size: 263, lr: 8.07e-03, grad_scale: 16.0 2023-06-20 04:01:10,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=624204.0, ans=0.125 2023-06-20 04:01:46,568 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2023-06-20 04:01:58,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=15.0 2023-06-20 04:02:23,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=624384.0, ans=0.125 2023-06-20 04:02:36,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=624444.0, ans=0.0 2023-06-20 04:02:39,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=624444.0, ans=0.2 2023-06-20 04:03:04,244 INFO [train.py:996] (3/4) Epoch 4, batch 12600, loss[loss=0.2684, simple_loss=0.3472, pruned_loss=0.09476, over 21844.00 frames. ], tot_loss[loss=0.2752, simple_loss=0.3476, pruned_loss=0.1015, over 4275464.11 frames. ], batch size: 372, lr: 8.07e-03, grad_scale: 8.0 2023-06-20 04:03:55,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=624624.0, ans=0.125 2023-06-20 04:04:21,098 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.838e+02 3.340e+02 3.938e+02 7.249e+02, threshold=6.681e+02, percent-clipped=1.0 2023-06-20 04:04:40,903 INFO [train.py:996] (3/4) Epoch 4, batch 12650, loss[loss=0.2975, simple_loss=0.4032, pruned_loss=0.09589, over 20794.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3377, pruned_loss=0.09596, over 4270133.62 frames. ], batch size: 608, lr: 8.07e-03, grad_scale: 8.0 2023-06-20 04:04:49,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=624804.0, ans=0.125 2023-06-20 04:04:50,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=624804.0, ans=0.0 2023-06-20 04:04:59,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=624864.0, ans=0.2 2023-06-20 04:06:16,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=625044.0, ans=6.0 2023-06-20 04:06:25,530 INFO [train.py:996] (3/4) Epoch 4, batch 12700, loss[loss=0.3366, simple_loss=0.3825, pruned_loss=0.1453, over 21425.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3375, pruned_loss=0.09854, over 4275737.76 frames. ], batch size: 471, lr: 8.06e-03, grad_scale: 8.0 2023-06-20 04:06:55,201 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-06-20 04:07:15,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=625224.0, ans=0.125 2023-06-20 04:07:18,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=625224.0, ans=0.2 2023-06-20 04:07:33,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=625284.0, ans=0.125 2023-06-20 04:07:34,993 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:07:37,139 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-20 04:07:53,688 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 3.213e+02 3.839e+02 4.784e+02 8.311e+02, threshold=7.678e+02, percent-clipped=6.0 2023-06-20 04:07:57,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=625344.0, ans=0.125 2023-06-20 04:08:07,938 INFO [train.py:996] (3/4) Epoch 4, batch 12750, loss[loss=0.289, simple_loss=0.3566, pruned_loss=0.1107, over 21811.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3375, pruned_loss=0.09886, over 4277976.94 frames. ], batch size: 415, lr: 8.06e-03, grad_scale: 8.0 2023-06-20 04:09:04,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=625584.0, ans=0.0 2023-06-20 04:09:09,668 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:09:34,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=625644.0, ans=0.0 2023-06-20 04:09:48,347 INFO [train.py:996] (3/4) Epoch 4, batch 12800, loss[loss=0.3041, simple_loss=0.3686, pruned_loss=0.1199, over 21301.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3369, pruned_loss=0.09958, over 4284938.11 frames. ], batch size: 144, lr: 8.06e-03, grad_scale: 16.0 2023-06-20 04:10:05,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=625704.0, ans=0.125 2023-06-20 04:10:24,402 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-20 04:10:50,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=625884.0, ans=0.0 2023-06-20 04:11:16,960 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.809e+02 3.205e+02 4.130e+02 6.634e+02, threshold=6.411e+02, percent-clipped=0.0 2023-06-20 04:11:23,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=625944.0, ans=0.1 2023-06-20 04:11:37,267 INFO [train.py:996] (3/4) Epoch 4, batch 12850, loss[loss=0.2242, simple_loss=0.3158, pruned_loss=0.06627, over 21639.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3407, pruned_loss=0.1014, over 4288641.93 frames. ], batch size: 263, lr: 8.06e-03, grad_scale: 16.0 2023-06-20 04:12:01,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=626064.0, ans=0.125 2023-06-20 04:12:03,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=626064.0, ans=0.2 2023-06-20 04:12:14,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=626064.0, ans=0.05 2023-06-20 04:12:21,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=626124.0, ans=0.2 2023-06-20 04:12:40,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=626124.0, ans=0.1 2023-06-20 04:13:00,345 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:13:10,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=626244.0, ans=0.125 2023-06-20 04:13:13,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=626244.0, ans=0.125 2023-06-20 04:13:17,469 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=22.5 2023-06-20 04:13:27,066 INFO [train.py:996] (3/4) Epoch 4, batch 12900, loss[loss=0.2217, simple_loss=0.2866, pruned_loss=0.07838, over 21793.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3385, pruned_loss=0.09792, over 4285482.69 frames. ], batch size: 118, lr: 8.06e-03, grad_scale: 16.0 2023-06-20 04:13:52,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=626364.0, ans=0.125 2023-06-20 04:14:50,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.844e+02 2.539e+02 2.943e+02 3.440e+02 5.830e+02, threshold=5.886e+02, percent-clipped=0.0 2023-06-20 04:15:09,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=626544.0, ans=0.125 2023-06-20 04:15:12,137 INFO [train.py:996] (3/4) Epoch 4, batch 12950, loss[loss=0.2557, simple_loss=0.3268, pruned_loss=0.0923, over 21741.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3377, pruned_loss=0.09545, over 4275874.42 frames. ], batch size: 332, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:16:27,451 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.25 vs. limit=15.0 2023-06-20 04:16:50,301 INFO [train.py:996] (3/4) Epoch 4, batch 13000, loss[loss=0.3449, simple_loss=0.489, pruned_loss=0.1004, over 19709.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3392, pruned_loss=0.09536, over 4272307.37 frames. ], batch size: 702, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:17:37,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=627024.0, ans=0.0 2023-06-20 04:17:48,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=627084.0, ans=0.07 2023-06-20 04:18:12,511 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-20 04:18:13,289 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.732e+02 3.360e+02 4.009e+02 6.698e+02, threshold=6.719e+02, percent-clipped=5.0 2023-06-20 04:18:33,504 INFO [train.py:996] (3/4) Epoch 4, batch 13050, loss[loss=0.2749, simple_loss=0.333, pruned_loss=0.1083, over 21530.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3346, pruned_loss=0.0932, over 4267717.87 frames. ], batch size: 131, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:18:35,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=627204.0, ans=0.125 2023-06-20 04:18:35,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=627204.0, ans=0.1 2023-06-20 04:19:05,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=627264.0, ans=0.2 2023-06-20 04:19:17,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=627324.0, ans=0.125 2023-06-20 04:20:07,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=627444.0, ans=0.0 2023-06-20 04:20:18,868 INFO [train.py:996] (3/4) Epoch 4, batch 13100, loss[loss=0.2812, simple_loss=0.3568, pruned_loss=0.1028, over 21395.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3372, pruned_loss=0.09404, over 4276585.76 frames. ], batch size: 131, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:20:27,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=627504.0, ans=0.05 2023-06-20 04:20:36,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=627504.0, ans=10.0 2023-06-20 04:20:39,912 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=22.5 2023-06-20 04:20:47,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=627564.0, ans=0.04949747468305833 2023-06-20 04:20:49,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=627564.0, ans=0.5 2023-06-20 04:21:45,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=627744.0, ans=0.125 2023-06-20 04:21:47,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=627744.0, ans=0.125 2023-06-20 04:21:48,716 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 3.050e+02 3.631e+02 4.148e+02 7.105e+02, threshold=7.262e+02, percent-clipped=1.0 2023-06-20 04:22:03,669 INFO [train.py:996] (3/4) Epoch 4, batch 13150, loss[loss=0.2263, simple_loss=0.2527, pruned_loss=0.09997, over 20244.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3374, pruned_loss=0.0966, over 4280738.30 frames. ], batch size: 710, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:22:24,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=627804.0, ans=0.125 2023-06-20 04:22:27,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=627804.0, ans=10.0 2023-06-20 04:24:02,001 INFO [train.py:996] (3/4) Epoch 4, batch 13200, loss[loss=0.2645, simple_loss=0.3245, pruned_loss=0.1023, over 21826.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3357, pruned_loss=0.09712, over 4280178.02 frames. ], batch size: 282, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 04:25:27,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.207e+02 2.669e+02 3.031e+02 3.643e+02 6.014e+02, threshold=6.063e+02, percent-clipped=0.0 2023-06-20 04:25:48,471 INFO [train.py:996] (3/4) Epoch 4, batch 13250, loss[loss=0.2543, simple_loss=0.3229, pruned_loss=0.09281, over 21784.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3361, pruned_loss=0.09882, over 4285102.05 frames. ], batch size: 112, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 04:26:18,345 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=15.0 2023-06-20 04:26:42,753 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.35 vs. limit=22.5 2023-06-20 04:26:50,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=628584.0, ans=0.125 2023-06-20 04:26:52,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=628584.0, ans=0.2 2023-06-20 04:27:13,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=628584.0, ans=0.125 2023-06-20 04:27:16,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=628644.0, ans=0.125 2023-06-20 04:27:23,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=628644.0, ans=0.125 2023-06-20 04:27:40,884 INFO [train.py:996] (3/4) Epoch 4, batch 13300, loss[loss=0.2947, simple_loss=0.3735, pruned_loss=0.108, over 21601.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3396, pruned_loss=0.0985, over 4286389.60 frames. ], batch size: 414, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 04:28:11,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=628764.0, ans=0.1 2023-06-20 04:29:02,512 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 2.810e+02 3.267e+02 3.707e+02 6.798e+02, threshold=6.534e+02, percent-clipped=1.0 2023-06-20 04:29:21,138 INFO [train.py:996] (3/4) Epoch 4, batch 13350, loss[loss=0.258, simple_loss=0.3348, pruned_loss=0.09058, over 20689.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3442, pruned_loss=0.1021, over 4289947.42 frames. ], batch size: 607, lr: 8.04e-03, grad_scale: 16.0 2023-06-20 04:29:27,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=629004.0, ans=0.125 2023-06-20 04:30:13,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=629124.0, ans=0.1 2023-06-20 04:30:35,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=629184.0, ans=0.1 2023-06-20 04:30:42,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=629244.0, ans=0.2 2023-06-20 04:31:05,722 INFO [train.py:996] (3/4) Epoch 4, batch 13400, loss[loss=0.2749, simple_loss=0.335, pruned_loss=0.1074, over 21428.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3459, pruned_loss=0.1042, over 4292195.53 frames. ], batch size: 211, lr: 8.04e-03, grad_scale: 16.0 2023-06-20 04:31:29,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=629364.0, ans=0.125 2023-06-20 04:32:14,969 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:32:38,228 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.331e+02 3.043e+02 3.589e+02 4.349e+02 7.690e+02, threshold=7.178e+02, percent-clipped=6.0 2023-06-20 04:32:47,892 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=22.5 2023-06-20 04:32:48,639 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:32:48,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=629544.0, ans=0.125 2023-06-20 04:32:51,536 INFO [train.py:996] (3/4) Epoch 4, batch 13450, loss[loss=0.2305, simple_loss=0.2942, pruned_loss=0.08342, over 21661.00 frames. ], tot_loss[loss=0.2808, simple_loss=0.3484, pruned_loss=0.1066, over 4284064.04 frames. ], batch size: 247, lr: 8.04e-03, grad_scale: 16.0 2023-06-20 04:33:10,144 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:33:44,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=629724.0, ans=0.125 2023-06-20 04:33:46,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=629724.0, ans=0.125 2023-06-20 04:33:49,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=629724.0, ans=0.07 2023-06-20 04:33:59,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=629784.0, ans=0.0 2023-06-20 04:34:21,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=629844.0, ans=0.05 2023-06-20 04:34:46,845 INFO [train.py:996] (3/4) Epoch 4, batch 13500, loss[loss=0.2515, simple_loss=0.3229, pruned_loss=0.09007, over 21561.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.337, pruned_loss=0.1023, over 4283208.34 frames. ], batch size: 441, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 04:35:29,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=629964.0, ans=0.125 2023-06-20 04:35:41,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=630024.0, ans=0.04949747468305833 2023-06-20 04:35:41,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=630024.0, ans=0.125 2023-06-20 04:36:21,736 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.218e+02 3.754e+02 4.451e+02 7.704e+02, threshold=7.508e+02, percent-clipped=1.0 2023-06-20 04:36:34,722 INFO [train.py:996] (3/4) Epoch 4, batch 13550, loss[loss=0.3083, simple_loss=0.4036, pruned_loss=0.1065, over 21764.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3412, pruned_loss=0.1015, over 4284990.74 frames. ], batch size: 351, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 04:37:37,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=630324.0, ans=0.0 2023-06-20 04:38:10,311 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.78 vs. limit=15.0 2023-06-20 04:38:14,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=630444.0, ans=0.125 2023-06-20 04:38:18,734 INFO [train.py:996] (3/4) Epoch 4, batch 13600, loss[loss=0.2448, simple_loss=0.3067, pruned_loss=0.09145, over 21272.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.343, pruned_loss=0.1029, over 4290815.75 frames. ], batch size: 176, lr: 8.03e-03, grad_scale: 32.0 2023-06-20 04:38:20,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=630504.0, ans=0.125 2023-06-20 04:38:42,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=630504.0, ans=0.125 2023-06-20 04:39:15,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=630624.0, ans=0.0 2023-06-20 04:39:50,134 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.831e+02 3.255e+02 3.648e+02 6.704e+02, threshold=6.511e+02, percent-clipped=0.0 2023-06-20 04:39:55,528 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.11 vs. limit=12.0 2023-06-20 04:40:01,066 INFO [train.py:996] (3/4) Epoch 4, batch 13650, loss[loss=0.2199, simple_loss=0.2836, pruned_loss=0.07817, over 21719.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3379, pruned_loss=0.09976, over 4279015.68 frames. ], batch size: 316, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 04:41:45,801 INFO [train.py:996] (3/4) Epoch 4, batch 13700, loss[loss=0.3639, simple_loss=0.4195, pruned_loss=0.1542, over 21494.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3332, pruned_loss=0.09943, over 4274765.67 frames. ], batch size: 508, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 04:42:22,446 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-06-20 04:42:35,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=631224.0, ans=0.125 2023-06-20 04:43:04,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=631284.0, ans=0.0 2023-06-20 04:43:24,867 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.956e+02 3.502e+02 4.364e+02 8.587e+02, threshold=7.005e+02, percent-clipped=5.0 2023-06-20 04:43:42,150 INFO [train.py:996] (3/4) Epoch 4, batch 13750, loss[loss=0.2126, simple_loss=0.3142, pruned_loss=0.0555, over 19802.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3322, pruned_loss=0.09866, over 4270900.40 frames. ], batch size: 703, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:44:18,744 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.10 vs. limit=22.5 2023-06-20 04:44:44,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=631524.0, ans=0.125 2023-06-20 04:44:54,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=631584.0, ans=0.0 2023-06-20 04:45:08,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=631584.0, ans=0.0 2023-06-20 04:45:14,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=631644.0, ans=0.125 2023-06-20 04:45:32,622 INFO [train.py:996] (3/4) Epoch 4, batch 13800, loss[loss=0.2581, simple_loss=0.333, pruned_loss=0.09161, over 21220.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3365, pruned_loss=0.0969, over 4261273.40 frames. ], batch size: 159, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:45:49,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=631704.0, ans=0.125 2023-06-20 04:45:49,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=631704.0, ans=0.125 2023-06-20 04:45:55,184 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.76 vs. limit=12.0 2023-06-20 04:46:26,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=631824.0, ans=0.0 2023-06-20 04:47:06,841 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 3.073e+02 3.642e+02 4.560e+02 8.359e+02, threshold=7.284e+02, percent-clipped=3.0 2023-06-20 04:47:23,549 INFO [train.py:996] (3/4) Epoch 4, batch 13850, loss[loss=0.2676, simple_loss=0.3363, pruned_loss=0.0995, over 21882.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3426, pruned_loss=0.09765, over 4270210.85 frames. ], batch size: 118, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:47:53,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=632064.0, ans=0.125 2023-06-20 04:48:27,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=632124.0, ans=6.0 2023-06-20 04:49:02,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=632244.0, ans=0.015 2023-06-20 04:49:09,015 INFO [train.py:996] (3/4) Epoch 4, batch 13900, loss[loss=0.2702, simple_loss=0.3364, pruned_loss=0.102, over 21739.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3464, pruned_loss=0.1019, over 4273699.98 frames. ], batch size: 351, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:49:43,869 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.03 vs. limit=15.0 2023-06-20 04:49:59,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=632424.0, ans=0.0 2023-06-20 04:50:13,107 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-06-20 04:50:25,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=632484.0, ans=0.0 2023-06-20 04:50:40,990 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.633e+02 3.556e+02 4.265e+02 5.269e+02 7.474e+02, threshold=8.529e+02, percent-clipped=1.0 2023-06-20 04:50:48,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=632544.0, ans=0.1 2023-06-20 04:50:52,491 INFO [train.py:996] (3/4) Epoch 4, batch 13950, loss[loss=0.2585, simple_loss=0.3186, pruned_loss=0.09922, over 21795.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3461, pruned_loss=0.1044, over 4279733.29 frames. ], batch size: 247, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:51:06,825 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=15.0 2023-06-20 04:51:11,803 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2023-06-20 04:51:56,944 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=22.5 2023-06-20 04:52:01,381 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=12.0 2023-06-20 04:52:02,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=632784.0, ans=0.035 2023-06-20 04:52:27,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=632844.0, ans=0.125 2023-06-20 04:52:35,132 INFO [train.py:996] (3/4) Epoch 4, batch 14000, loss[loss=0.2178, simple_loss=0.3173, pruned_loss=0.05916, over 21366.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3392, pruned_loss=0.1002, over 4272535.28 frames. ], batch size: 548, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 04:54:05,707 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.790e+02 2.775e+02 3.339e+02 4.026e+02 8.444e+02, threshold=6.679e+02, percent-clipped=0.0 2023-06-20 04:54:17,311 INFO [train.py:996] (3/4) Epoch 4, batch 14050, loss[loss=0.2316, simple_loss=0.2892, pruned_loss=0.08698, over 21955.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3337, pruned_loss=0.09623, over 4272618.80 frames. ], batch size: 103, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 04:54:40,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=633264.0, ans=0.0 2023-06-20 04:55:00,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=633324.0, ans=0.125 2023-06-20 04:55:02,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=633324.0, ans=0.2 2023-06-20 04:55:03,999 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.19 vs. limit=12.0 2023-06-20 04:55:16,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=633324.0, ans=0.2 2023-06-20 04:55:33,199 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:55:33,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=633384.0, ans=0.125 2023-06-20 04:56:00,935 INFO [train.py:996] (3/4) Epoch 4, batch 14100, loss[loss=0.2299, simple_loss=0.28, pruned_loss=0.08985, over 21399.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3292, pruned_loss=0.09652, over 4263297.06 frames. ], batch size: 194, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 04:56:11,845 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:56:13,568 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-20 04:56:58,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=633624.0, ans=0.0 2023-06-20 04:57:23,689 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:57:34,356 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.697e+02 3.241e+02 4.086e+02 6.665e+02, threshold=6.483e+02, percent-clipped=0.0 2023-06-20 04:57:43,876 INFO [train.py:996] (3/4) Epoch 4, batch 14150, loss[loss=0.26, simple_loss=0.3308, pruned_loss=0.09461, over 21304.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3334, pruned_loss=0.09782, over 4259801.85 frames. ], batch size: 159, lr: 8.01e-03, grad_scale: 16.0 2023-06-20 04:57:49,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=633804.0, ans=0.125 2023-06-20 04:57:59,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=633804.0, ans=0.125 2023-06-20 04:58:34,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=633924.0, ans=0.125 2023-06-20 04:59:24,537 INFO [train.py:996] (3/4) Epoch 4, batch 14200, loss[loss=0.2858, simple_loss=0.3199, pruned_loss=0.1258, over 21433.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3308, pruned_loss=0.09544, over 4266270.26 frames. ], batch size: 508, lr: 8.01e-03, grad_scale: 16.0 2023-06-20 04:59:36,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=634104.0, ans=0.125 2023-06-20 04:59:51,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=634164.0, ans=0.025 2023-06-20 05:00:01,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=634224.0, ans=0.125 2023-06-20 05:00:08,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=634224.0, ans=0.125 2023-06-20 05:00:19,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=634224.0, ans=0.0 2023-06-20 05:00:31,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=634284.0, ans=0.0 2023-06-20 05:00:52,664 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.482e+02 2.802e+02 3.379e+02 6.129e+02, threshold=5.605e+02, percent-clipped=0.0 2023-06-20 05:00:53,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=634344.0, ans=10.0 2023-06-20 05:00:56,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=634344.0, ans=0.1 2023-06-20 05:01:07,609 INFO [train.py:996] (3/4) Epoch 4, batch 14250, loss[loss=0.2206, simple_loss=0.2862, pruned_loss=0.07749, over 21626.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3254, pruned_loss=0.0952, over 4261661.05 frames. ], batch size: 247, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 05:01:15,157 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=15.0 2023-06-20 05:01:31,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=634464.0, ans=0.125 2023-06-20 05:01:53,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=634524.0, ans=0.1 2023-06-20 05:01:55,635 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.544e-03 2023-06-20 05:02:24,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=634584.0, ans=0.125 2023-06-20 05:02:29,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=634644.0, ans=0.0 2023-06-20 05:02:34,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=634644.0, ans=0.0 2023-06-20 05:02:46,937 INFO [train.py:996] (3/4) Epoch 4, batch 14300, loss[loss=0.3411, simple_loss=0.4192, pruned_loss=0.1316, over 21879.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3277, pruned_loss=0.09518, over 4263717.31 frames. ], batch size: 372, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 05:03:42,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=634824.0, ans=0.0 2023-06-20 05:04:21,882 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.272e+02 2.821e+02 3.422e+02 4.230e+02 9.010e+02, threshold=6.844e+02, percent-clipped=9.0 2023-06-20 05:04:31,912 INFO [train.py:996] (3/4) Epoch 4, batch 14350, loss[loss=0.2329, simple_loss=0.3037, pruned_loss=0.08104, over 21798.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.331, pruned_loss=0.09499, over 4247351.14 frames. ], batch size: 247, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 05:04:46,097 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.48 vs. limit=15.0 2023-06-20 05:04:52,363 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.19 vs. limit=10.0 2023-06-20 05:04:53,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=635064.0, ans=0.1 2023-06-20 05:05:05,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=635064.0, ans=0.04949747468305833 2023-06-20 05:05:20,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=635124.0, ans=0.02 2023-06-20 05:06:18,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=635304.0, ans=0.0 2023-06-20 05:06:19,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=635304.0, ans=15.0 2023-06-20 05:06:19,429 INFO [train.py:996] (3/4) Epoch 4, batch 14400, loss[loss=0.275, simple_loss=0.3277, pruned_loss=0.1112, over 21855.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3296, pruned_loss=0.0966, over 4248866.36 frames. ], batch size: 373, lr: 8.00e-03, grad_scale: 32.0 2023-06-20 05:06:57,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=635424.0, ans=0.0 2023-06-20 05:07:32,458 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.53 vs. limit=10.0 2023-06-20 05:07:42,768 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.165e+02 2.821e+02 3.350e+02 4.136e+02 6.839e+02, threshold=6.700e+02, percent-clipped=0.0 2023-06-20 05:07:57,240 INFO [train.py:996] (3/4) Epoch 4, batch 14450, loss[loss=0.2725, simple_loss=0.3223, pruned_loss=0.1113, over 21186.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3252, pruned_loss=0.09695, over 4250794.02 frames. ], batch size: 143, lr: 8.00e-03, grad_scale: 32.0 2023-06-20 05:08:36,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=635724.0, ans=15.0 2023-06-20 05:09:08,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=635784.0, ans=0.2 2023-06-20 05:09:23,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=635844.0, ans=0.0 2023-06-20 05:09:39,316 INFO [train.py:996] (3/4) Epoch 4, batch 14500, loss[loss=0.261, simple_loss=0.3414, pruned_loss=0.09027, over 21775.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3233, pruned_loss=0.09667, over 4258787.63 frames. ], batch size: 371, lr: 8.00e-03, grad_scale: 32.0 2023-06-20 05:09:40,547 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2023-06-20 05:10:03,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=635964.0, ans=0.125 2023-06-20 05:10:10,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=635964.0, ans=0.0 2023-06-20 05:10:12,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=635964.0, ans=0.2 2023-06-20 05:10:50,976 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=15.0 2023-06-20 05:11:03,384 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=15.0 2023-06-20 05:11:13,694 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.861e+02 3.336e+02 4.612e+02 7.217e+02, threshold=6.672e+02, percent-clipped=2.0 2023-06-20 05:11:14,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=636144.0, ans=0.125 2023-06-20 05:11:23,551 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:11:24,707 INFO [train.py:996] (3/4) Epoch 4, batch 14550, loss[loss=0.328, simple_loss=0.3865, pruned_loss=0.1348, over 21515.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3293, pruned_loss=0.09866, over 4264834.66 frames. ], batch size: 414, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:11:51,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=636264.0, ans=0.125 2023-06-20 05:12:09,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=636324.0, ans=0.125 2023-06-20 05:12:53,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.97 vs. limit=6.0 2023-06-20 05:12:57,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=636444.0, ans=0.015 2023-06-20 05:13:03,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=636444.0, ans=0.125 2023-06-20 05:13:16,327 INFO [train.py:996] (3/4) Epoch 4, batch 14600, loss[loss=0.2923, simple_loss=0.3755, pruned_loss=0.1045, over 21880.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3349, pruned_loss=0.1025, over 4267641.82 frames. ], batch size: 316, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:14:27,894 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.43 vs. limit=15.0 2023-06-20 05:14:37,551 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-06-20 05:14:43,552 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.324e+02 3.039e+02 3.637e+02 4.481e+02 9.662e+02, threshold=7.275e+02, percent-clipped=5.0 2023-06-20 05:14:53,020 INFO [train.py:996] (3/4) Epoch 4, batch 14650, loss[loss=0.2241, simple_loss=0.2925, pruned_loss=0.07785, over 21823.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.3368, pruned_loss=0.1012, over 4270200.77 frames. ], batch size: 102, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:14:53,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=636804.0, ans=0.125 2023-06-20 05:15:20,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=636864.0, ans=0.125 2023-06-20 05:15:30,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=636864.0, ans=0.125 2023-06-20 05:16:18,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=637044.0, ans=0.2 2023-06-20 05:16:28,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=637044.0, ans=0.125 2023-06-20 05:16:40,840 INFO [train.py:996] (3/4) Epoch 4, batch 14700, loss[loss=0.214, simple_loss=0.2826, pruned_loss=0.0727, over 21313.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3289, pruned_loss=0.09399, over 4260443.13 frames. ], batch size: 131, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:16:44,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=637104.0, ans=0.0 2023-06-20 05:17:12,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=637164.0, ans=0.125 2023-06-20 05:17:23,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=637224.0, ans=0.0 2023-06-20 05:17:29,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=637224.0, ans=0.0 2023-06-20 05:17:35,066 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.84 vs. limit=15.0 2023-06-20 05:17:41,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=637284.0, ans=0.2 2023-06-20 05:17:51,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=637284.0, ans=0.125 2023-06-20 05:18:12,035 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 2.351e+02 2.788e+02 3.519e+02 6.135e+02, threshold=5.577e+02, percent-clipped=0.0 2023-06-20 05:18:22,374 INFO [train.py:996] (3/4) Epoch 4, batch 14750, loss[loss=0.344, simple_loss=0.4085, pruned_loss=0.1397, over 21609.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3355, pruned_loss=0.09792, over 4262847.96 frames. ], batch size: 389, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:19:04,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=637524.0, ans=0.0 2023-06-20 05:19:36,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=637584.0, ans=0.125 2023-06-20 05:19:52,081 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=12.0 2023-06-20 05:20:03,218 INFO [train.py:996] (3/4) Epoch 4, batch 14800, loss[loss=0.2772, simple_loss=0.334, pruned_loss=0.1102, over 21181.00 frames. ], tot_loss[loss=0.2789, simple_loss=0.3489, pruned_loss=0.1044, over 4267875.85 frames. ], batch size: 159, lr: 7.98e-03, grad_scale: 32.0 2023-06-20 05:20:26,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=637704.0, ans=0.125 2023-06-20 05:20:59,044 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:21:04,951 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.18 vs. limit=15.0 2023-06-20 05:21:19,746 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.79 vs. limit=15.0 2023-06-20 05:21:30,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=637944.0, ans=0.125 2023-06-20 05:21:38,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=637944.0, ans=0.125 2023-06-20 05:21:42,991 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 3.164e+02 3.889e+02 4.731e+02 8.129e+02, threshold=7.778e+02, percent-clipped=15.0 2023-06-20 05:22:00,573 INFO [train.py:996] (3/4) Epoch 4, batch 14850, loss[loss=0.2708, simple_loss=0.316, pruned_loss=0.1127, over 14631.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.342, pruned_loss=0.1033, over 4266324.07 frames. ], batch size: 60, lr: 7.98e-03, grad_scale: 16.0 2023-06-20 05:23:20,185 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.00 vs. limit=22.5 2023-06-20 05:23:34,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=638244.0, ans=0.0 2023-06-20 05:23:41,035 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.19 vs. limit=22.5 2023-06-20 05:23:46,622 INFO [train.py:996] (3/4) Epoch 4, batch 14900, loss[loss=0.2774, simple_loss=0.3423, pruned_loss=0.1063, over 21729.00 frames. ], tot_loss[loss=0.2801, simple_loss=0.3479, pruned_loss=0.1062, over 4265527.95 frames. ], batch size: 298, lr: 7.98e-03, grad_scale: 16.0 2023-06-20 05:23:52,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=638304.0, ans=0.0 2023-06-20 05:24:29,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=638424.0, ans=0.5 2023-06-20 05:24:38,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=638424.0, ans=10.0 2023-06-20 05:25:25,409 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.971e+02 3.790e+02 5.715e+02 1.373e+03, threshold=7.580e+02, percent-clipped=7.0 2023-06-20 05:25:32,236 INFO [train.py:996] (3/4) Epoch 4, batch 14950, loss[loss=0.2616, simple_loss=0.3389, pruned_loss=0.09215, over 21889.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3473, pruned_loss=0.1047, over 4268411.84 frames. ], batch size: 317, lr: 7.98e-03, grad_scale: 16.0 2023-06-20 05:25:32,718 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:25:36,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=638604.0, ans=0.04949747468305833 2023-06-20 05:26:02,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=638664.0, ans=0.0 2023-06-20 05:26:06,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=638664.0, ans=0.125 2023-06-20 05:26:06,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=638664.0, ans=0.0 2023-06-20 05:27:16,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.whiten.whitening_limit, batch_count=638904.0, ans=12.0 2023-06-20 05:27:16,851 INFO [train.py:996] (3/4) Epoch 4, batch 15000, loss[loss=0.2866, simple_loss=0.3373, pruned_loss=0.1179, over 21316.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3484, pruned_loss=0.1063, over 4272651.17 frames. ], batch size: 143, lr: 7.98e-03, grad_scale: 16.0 2023-06-20 05:27:16,851 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 05:27:30,602 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.8031, 4.1630, 4.2521, 4.5016], device='cuda:3') 2023-06-20 05:27:33,901 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2743, simple_loss=0.3665, pruned_loss=0.09108, over 1796401.00 frames. 2023-06-20 05:27:33,902 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-20 05:27:45,623 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.52 vs. limit=12.0 2023-06-20 05:28:50,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=639084.0, ans=0.1 2023-06-20 05:29:07,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=639144.0, ans=0.2 2023-06-20 05:29:12,206 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.414e+02 3.360e+02 3.927e+02 4.837e+02 8.029e+02, threshold=7.853e+02, percent-clipped=2.0 2023-06-20 05:29:14,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=639144.0, ans=0.125 2023-06-20 05:29:24,461 INFO [train.py:996] (3/4) Epoch 4, batch 15050, loss[loss=0.2528, simple_loss=0.3096, pruned_loss=0.09802, over 21218.00 frames. ], tot_loss[loss=0.2812, simple_loss=0.3482, pruned_loss=0.1071, over 4271163.45 frames. ], batch size: 159, lr: 7.97e-03, grad_scale: 16.0 2023-06-20 05:29:28,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=639204.0, ans=0.2 2023-06-20 05:29:31,336 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:29:32,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=639204.0, ans=0.125 2023-06-20 05:30:08,824 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.42 vs. limit=12.0 2023-06-20 05:31:01,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=639444.0, ans=0.05 2023-06-20 05:31:09,497 INFO [train.py:996] (3/4) Epoch 4, batch 15100, loss[loss=0.2801, simple_loss=0.3539, pruned_loss=0.1032, over 21782.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3508, pruned_loss=0.1066, over 4271870.24 frames. ], batch size: 124, lr: 7.97e-03, grad_scale: 16.0 2023-06-20 05:31:11,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=639504.0, ans=0.2 2023-06-20 05:31:21,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=639504.0, ans=0.5 2023-06-20 05:32:19,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=639684.0, ans=0.125 2023-06-20 05:32:35,134 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-20 05:32:45,894 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 3.063e+02 3.378e+02 3.992e+02 7.623e+02, threshold=6.756e+02, percent-clipped=0.0 2023-06-20 05:32:52,574 INFO [train.py:996] (3/4) Epoch 4, batch 15150, loss[loss=0.2566, simple_loss=0.3065, pruned_loss=0.1033, over 21356.00 frames. ], tot_loss[loss=0.2801, simple_loss=0.3466, pruned_loss=0.1068, over 4268595.72 frames. ], batch size: 177, lr: 7.97e-03, grad_scale: 16.0 2023-06-20 05:32:53,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=639804.0, ans=0.125 2023-06-20 05:33:16,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=639804.0, ans=0.1 2023-06-20 05:33:48,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=639924.0, ans=0.125 2023-06-20 05:34:11,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=639984.0, ans=0.125 2023-06-20 05:34:23,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=640044.0, ans=0.07 2023-06-20 05:34:41,827 INFO [train.py:996] (3/4) Epoch 4, batch 15200, loss[loss=0.2349, simple_loss=0.2999, pruned_loss=0.08491, over 21271.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.338, pruned_loss=0.1016, over 4273013.43 frames. ], batch size: 176, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 05:34:44,192 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.53 vs. limit=15.0 2023-06-20 05:36:14,280 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.035e+02 3.960e+02 4.645e+02 7.650e+02, threshold=7.920e+02, percent-clipped=3.0 2023-06-20 05:36:25,891 INFO [train.py:996] (3/4) Epoch 4, batch 15250, loss[loss=0.2216, simple_loss=0.2845, pruned_loss=0.07937, over 21313.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3338, pruned_loss=0.09997, over 4256355.86 frames. ], batch size: 549, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 05:36:26,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=640404.0, ans=0.125 2023-06-20 05:36:47,035 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:37:03,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=640464.0, ans=0.125 2023-06-20 05:37:29,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=640584.0, ans=0.0 2023-06-20 05:37:37,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=640584.0, ans=0.1 2023-06-20 05:38:17,083 INFO [train.py:996] (3/4) Epoch 4, batch 15300, loss[loss=0.3159, simple_loss=0.3819, pruned_loss=0.1249, over 21811.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.335, pruned_loss=0.1033, over 4260468.51 frames. ], batch size: 124, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 05:38:20,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=640704.0, ans=0.125 2023-06-20 05:39:08,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=640824.0, ans=0.0 2023-06-20 05:39:10,840 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-20 05:39:28,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=640884.0, ans=10.0 2023-06-20 05:39:35,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=640944.0, ans=0.125 2023-06-20 05:39:47,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=640944.0, ans=0.0 2023-06-20 05:39:54,766 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.484e+02 2.881e+02 3.296e+02 3.984e+02 9.139e+02, threshold=6.591e+02, percent-clipped=2.0 2023-06-20 05:40:01,916 INFO [train.py:996] (3/4) Epoch 4, batch 15350, loss[loss=0.3139, simple_loss=0.3726, pruned_loss=0.1276, over 21758.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3397, pruned_loss=0.106, over 4268070.57 frames. ], batch size: 441, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 05:40:02,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=641004.0, ans=0.125 2023-06-20 05:40:38,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=641124.0, ans=0.125 2023-06-20 05:40:52,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=641124.0, ans=0.125 2023-06-20 05:40:55,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=641184.0, ans=0.125 2023-06-20 05:41:06,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=641184.0, ans=0.125 2023-06-20 05:41:36,549 INFO [train.py:996] (3/4) Epoch 4, batch 15400, loss[loss=0.2552, simple_loss=0.3301, pruned_loss=0.09014, over 21866.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.3403, pruned_loss=0.1047, over 4279927.79 frames. ], batch size: 124, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 05:41:51,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=641304.0, ans=0.0 2023-06-20 05:41:53,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=641304.0, ans=0.125 2023-06-20 05:41:54,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=641304.0, ans=0.125 2023-06-20 05:42:24,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=641424.0, ans=0.0 2023-06-20 05:42:34,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=641484.0, ans=0.125 2023-06-20 05:43:07,580 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.755e+02 3.304e+02 3.947e+02 7.271e+02, threshold=6.607e+02, percent-clipped=2.0 2023-06-20 05:43:19,990 INFO [train.py:996] (3/4) Epoch 4, batch 15450, loss[loss=0.2476, simple_loss=0.3231, pruned_loss=0.08602, over 21817.00 frames. ], tot_loss[loss=0.273, simple_loss=0.3382, pruned_loss=0.1039, over 4286757.35 frames. ], batch size: 282, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 05:44:01,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=641724.0, ans=0.0 2023-06-20 05:44:42,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=641844.0, ans=0.1 2023-06-20 05:44:51,119 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=12.0 2023-06-20 05:45:10,514 INFO [train.py:996] (3/4) Epoch 4, batch 15500, loss[loss=0.2696, simple_loss=0.3315, pruned_loss=0.1038, over 21843.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3418, pruned_loss=0.1035, over 4272833.15 frames. ], batch size: 102, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 05:45:11,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=641904.0, ans=0.2 2023-06-20 05:45:41,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=641964.0, ans=0.125 2023-06-20 05:45:57,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=642024.0, ans=0.05 2023-06-20 05:46:25,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=642084.0, ans=0.04949747468305833 2023-06-20 05:46:25,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=642084.0, ans=0.07 2023-06-20 05:46:56,145 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.827e+02 3.458e+02 4.345e+02 6.798e+02, threshold=6.916e+02, percent-clipped=2.0 2023-06-20 05:47:00,855 INFO [train.py:996] (3/4) Epoch 4, batch 15550, loss[loss=0.2671, simple_loss=0.3479, pruned_loss=0.09312, over 21632.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.34, pruned_loss=0.1002, over 4273902.86 frames. ], batch size: 414, lr: 7.96e-03, grad_scale: 16.0 2023-06-20 05:47:06,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=642204.0, ans=0.0 2023-06-20 05:47:39,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=642324.0, ans=0.125 2023-06-20 05:47:42,562 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=8.022e-03 2023-06-20 05:48:22,500 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:48:26,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=642444.0, ans=0.0 2023-06-20 05:48:28,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=642444.0, ans=0.1 2023-06-20 05:48:36,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=642444.0, ans=0.1 2023-06-20 05:48:39,090 INFO [train.py:996] (3/4) Epoch 4, batch 15600, loss[loss=0.2488, simple_loss=0.3096, pruned_loss=0.09401, over 21172.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3336, pruned_loss=0.09851, over 4268964.24 frames. ], batch size: 143, lr: 7.95e-03, grad_scale: 32.0 2023-06-20 05:49:00,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=642564.0, ans=0.125 2023-06-20 05:49:02,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=642564.0, ans=0.125 2023-06-20 05:50:15,644 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-20 05:50:17,874 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.694e+02 3.221e+02 3.969e+02 6.566e+02, threshold=6.442e+02, percent-clipped=0.0 2023-06-20 05:50:21,190 INFO [train.py:996] (3/4) Epoch 4, batch 15650, loss[loss=0.2412, simple_loss=0.3049, pruned_loss=0.08873, over 21714.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3315, pruned_loss=0.09712, over 4271451.41 frames. ], batch size: 316, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:50:25,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=642804.0, ans=0.2 2023-06-20 05:50:38,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=642864.0, ans=0.125 2023-06-20 05:50:44,005 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-06-20 05:50:47,062 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=12.0 2023-06-20 05:50:52,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-20 05:51:41,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=642984.0, ans=0.0 2023-06-20 05:51:56,499 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:52:01,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=643044.0, ans=0.1 2023-06-20 05:52:06,088 INFO [train.py:996] (3/4) Epoch 4, batch 15700, loss[loss=0.2753, simple_loss=0.3416, pruned_loss=0.1045, over 21508.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3282, pruned_loss=0.09629, over 4267710.09 frames. ], batch size: 389, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:52:11,838 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-20 05:52:34,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=643164.0, ans=0.2 2023-06-20 05:53:20,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=643284.0, ans=0.125 2023-06-20 05:53:46,786 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 2.685e+02 3.204e+02 3.710e+02 6.179e+02, threshold=6.407e+02, percent-clipped=0.0 2023-06-20 05:53:49,710 INFO [train.py:996] (3/4) Epoch 4, batch 15750, loss[loss=0.2433, simple_loss=0.3096, pruned_loss=0.08846, over 21400.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3233, pruned_loss=0.09572, over 4266489.95 frames. ], batch size: 194, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:54:07,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=643464.0, ans=0.025 2023-06-20 05:54:27,884 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-06-20 05:55:14,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=643644.0, ans=0.2 2023-06-20 05:55:31,421 INFO [train.py:996] (3/4) Epoch 4, batch 15800, loss[loss=0.285, simple_loss=0.3497, pruned_loss=0.1102, over 20097.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3178, pruned_loss=0.09488, over 4250382.89 frames. ], batch size: 703, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:55:39,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=643704.0, ans=0.125 2023-06-20 05:55:45,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=643704.0, ans=10.0 2023-06-20 05:55:47,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=643764.0, ans=0.125 2023-06-20 05:55:52,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=643764.0, ans=0.125 2023-06-20 05:56:32,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=643884.0, ans=0.125 2023-06-20 05:57:11,276 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 3.033e+02 3.511e+02 4.104e+02 6.063e+02, threshold=7.023e+02, percent-clipped=0.0 2023-06-20 05:57:14,442 INFO [train.py:996] (3/4) Epoch 4, batch 15850, loss[loss=0.2235, simple_loss=0.2783, pruned_loss=0.08435, over 21444.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3192, pruned_loss=0.09709, over 4250796.51 frames. ], batch size: 212, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:57:21,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=644004.0, ans=0.125 2023-06-20 05:58:06,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=644124.0, ans=0.0 2023-06-20 05:58:19,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=644184.0, ans=0.125 2023-06-20 05:58:36,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=644244.0, ans=0.125 2023-06-20 05:58:47,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=644304.0, ans=0.1 2023-06-20 05:58:48,829 INFO [train.py:996] (3/4) Epoch 4, batch 15900, loss[loss=0.2388, simple_loss=0.3237, pruned_loss=0.07693, over 21688.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3172, pruned_loss=0.09721, over 4240654.18 frames. ], batch size: 332, lr: 7.94e-03, grad_scale: 16.0 2023-06-20 05:58:49,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=644304.0, ans=0.125 2023-06-20 05:59:03,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=644304.0, ans=0.2 2023-06-20 05:59:22,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=644424.0, ans=0.125 2023-06-20 05:59:56,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=644484.0, ans=0.125 2023-06-20 06:00:15,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=644544.0, ans=0.125 2023-06-20 06:00:19,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=644544.0, ans=0.125 2023-06-20 06:00:28,934 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 2.691e+02 3.018e+02 3.866e+02 6.282e+02, threshold=6.037e+02, percent-clipped=0.0 2023-06-20 06:00:32,129 INFO [train.py:996] (3/4) Epoch 4, batch 15950, loss[loss=0.2414, simple_loss=0.3213, pruned_loss=0.08081, over 21673.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.318, pruned_loss=0.0947, over 4242663.73 frames. ], batch size: 389, lr: 7.94e-03, grad_scale: 16.0 2023-06-20 06:00:38,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=644604.0, ans=0.0 2023-06-20 06:00:39,448 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-20 06:00:45,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=644604.0, ans=0.0 2023-06-20 06:01:08,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=644724.0, ans=0.125 2023-06-20 06:01:38,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=644784.0, ans=0.125 2023-06-20 06:01:54,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=644844.0, ans=0.125 2023-06-20 06:02:14,370 INFO [train.py:996] (3/4) Epoch 4, batch 16000, loss[loss=0.2141, simple_loss=0.3033, pruned_loss=0.06246, over 21583.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3196, pruned_loss=0.09121, over 4248651.46 frames. ], batch size: 230, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 06:02:15,307 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-06-20 06:03:48,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=645144.0, ans=0.2 2023-06-20 06:03:51,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=645144.0, ans=0.0 2023-06-20 06:03:53,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=22.5 2023-06-20 06:03:53,919 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.739e+02 2.704e+02 3.132e+02 3.952e+02 7.192e+02, threshold=6.264e+02, percent-clipped=3.0 2023-06-20 06:03:57,352 INFO [train.py:996] (3/4) Epoch 4, batch 16050, loss[loss=0.258, simple_loss=0.3409, pruned_loss=0.08751, over 21329.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3215, pruned_loss=0.08884, over 4259993.05 frames. ], batch size: 194, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 06:05:35,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=645444.0, ans=12.0 2023-06-20 06:05:40,815 INFO [train.py:996] (3/4) Epoch 4, batch 16100, loss[loss=0.2362, simple_loss=0.3032, pruned_loss=0.08464, over 21929.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3256, pruned_loss=0.09083, over 4267852.49 frames. ], batch size: 316, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 06:05:43,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=645504.0, ans=0.0 2023-06-20 06:05:48,256 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=15.0 2023-06-20 06:06:02,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=645564.0, ans=0.0 2023-06-20 06:06:07,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=645564.0, ans=0.0 2023-06-20 06:07:20,421 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 3.023e+02 3.501e+02 4.351e+02 8.172e+02, threshold=7.003e+02, percent-clipped=2.0 2023-06-20 06:07:23,540 INFO [train.py:996] (3/4) Epoch 4, batch 16150, loss[loss=0.2483, simple_loss=0.3096, pruned_loss=0.09349, over 21668.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3279, pruned_loss=0.09389, over 4279010.79 frames. ], batch size: 263, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:08:12,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=645984.0, ans=0.125 2023-06-20 06:08:25,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=645984.0, ans=0.125 2023-06-20 06:08:38,373 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.20 vs. limit=10.0 2023-06-20 06:09:02,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=646044.0, ans=0.125 2023-06-20 06:09:06,386 INFO [train.py:996] (3/4) Epoch 4, batch 16200, loss[loss=0.3121, simple_loss=0.3692, pruned_loss=0.1274, over 21653.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3338, pruned_loss=0.09607, over 4279448.05 frames. ], batch size: 263, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:10:08,297 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=22.5 2023-06-20 06:10:09,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=646284.0, ans=0.2 2023-06-20 06:10:40,626 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.687e+02 3.085e+02 4.076e+02 6.886e+02, threshold=6.170e+02, percent-clipped=1.0 2023-06-20 06:10:44,028 INFO [train.py:996] (3/4) Epoch 4, batch 16250, loss[loss=0.2155, simple_loss=0.2931, pruned_loss=0.06899, over 21805.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3327, pruned_loss=0.09591, over 4276409.21 frames. ], batch size: 372, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:10:52,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=646404.0, ans=0.2 2023-06-20 06:10:59,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=646464.0, ans=0.125 2023-06-20 06:11:48,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=646584.0, ans=0.0 2023-06-20 06:12:01,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=646584.0, ans=0.0 2023-06-20 06:12:17,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=646644.0, ans=0.2 2023-06-20 06:12:25,355 INFO [train.py:996] (3/4) Epoch 4, batch 16300, loss[loss=0.2765, simple_loss=0.3453, pruned_loss=0.1039, over 21361.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3275, pruned_loss=0.09244, over 4278059.94 frames. ], batch size: 507, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:12:57,718 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-20 06:14:05,361 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.535e+02 3.018e+02 3.417e+02 5.954e+02, threshold=6.036e+02, percent-clipped=0.0 2023-06-20 06:14:07,968 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-20 06:14:08,700 INFO [train.py:996] (3/4) Epoch 4, batch 16350, loss[loss=0.2741, simple_loss=0.3526, pruned_loss=0.09784, over 19923.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3253, pruned_loss=0.09259, over 4268292.21 frames. ], batch size: 703, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:14:14,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=647004.0, ans=0.1 2023-06-20 06:15:25,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=647184.0, ans=0.2 2023-06-20 06:15:52,734 INFO [train.py:996] (3/4) Epoch 4, batch 16400, loss[loss=0.2528, simple_loss=0.3126, pruned_loss=0.09645, over 21848.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3295, pruned_loss=0.09538, over 4270746.70 frames. ], batch size: 107, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 06:17:20,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=647544.0, ans=0.125 2023-06-20 06:17:31,935 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.209e+02 2.888e+02 3.304e+02 3.936e+02 7.106e+02, threshold=6.607e+02, percent-clipped=3.0 2023-06-20 06:17:35,178 INFO [train.py:996] (3/4) Epoch 4, batch 16450, loss[loss=0.2541, simple_loss=0.3157, pruned_loss=0.09628, over 21250.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3308, pruned_loss=0.0974, over 4271353.76 frames. ], batch size: 143, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 06:17:36,075 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-06-20 06:18:12,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=647724.0, ans=0.0 2023-06-20 06:18:28,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=647724.0, ans=0.0 2023-06-20 06:18:36,720 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:18:47,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=647784.0, ans=0.125 2023-06-20 06:18:53,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=647784.0, ans=0.125 2023-06-20 06:18:53,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=647784.0, ans=0.125 2023-06-20 06:19:07,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=647844.0, ans=0.125 2023-06-20 06:19:19,905 INFO [train.py:996] (3/4) Epoch 4, batch 16500, loss[loss=0.2966, simple_loss=0.3709, pruned_loss=0.1112, over 21661.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3288, pruned_loss=0.09722, over 4279093.49 frames. ], batch size: 441, lr: 7.92e-03, grad_scale: 16.0 2023-06-20 06:19:48,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=647964.0, ans=0.125 2023-06-20 06:20:03,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=648024.0, ans=0.0 2023-06-20 06:20:23,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=648024.0, ans=0.0 2023-06-20 06:20:30,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=648084.0, ans=0.125 2023-06-20 06:21:03,942 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.948e+02 3.472e+02 4.236e+02 9.691e+02, threshold=6.943e+02, percent-clipped=9.0 2023-06-20 06:21:05,575 INFO [train.py:996] (3/4) Epoch 4, batch 16550, loss[loss=0.28, simple_loss=0.3548, pruned_loss=0.1026, over 21699.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3286, pruned_loss=0.09378, over 4273396.50 frames. ], batch size: 351, lr: 7.92e-03, grad_scale: 16.0 2023-06-20 06:21:12,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=648204.0, ans=0.125 2023-06-20 06:21:29,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=648204.0, ans=0.125 2023-06-20 06:21:43,299 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=22.5 2023-06-20 06:22:35,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=648444.0, ans=0.0 2023-06-20 06:22:59,759 INFO [train.py:996] (3/4) Epoch 4, batch 16600, loss[loss=0.2786, simple_loss=0.3659, pruned_loss=0.09566, over 21383.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3381, pruned_loss=0.09852, over 4273430.00 frames. ], batch size: 131, lr: 7.92e-03, grad_scale: 16.0 2023-06-20 06:23:23,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=648504.0, ans=0.1 2023-06-20 06:23:31,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.67 vs. limit=15.0 2023-06-20 06:23:37,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=648564.0, ans=0.0 2023-06-20 06:23:47,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=648624.0, ans=0.0 2023-06-20 06:24:03,336 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=22.5 2023-06-20 06:24:09,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=648684.0, ans=0.1 2023-06-20 06:24:48,212 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 3.275e+02 4.120e+02 5.174e+02 8.172e+02, threshold=8.240e+02, percent-clipped=2.0 2023-06-20 06:24:49,932 INFO [train.py:996] (3/4) Epoch 4, batch 16650, loss[loss=0.279, simple_loss=0.3471, pruned_loss=0.1055, over 21707.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3496, pruned_loss=0.1025, over 4272806.13 frames. ], batch size: 298, lr: 7.92e-03, grad_scale: 16.0 2023-06-20 06:24:50,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=648804.0, ans=0.0 2023-06-20 06:25:09,146 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:25:09,953 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.98 vs. limit=15.0 2023-06-20 06:25:29,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=648924.0, ans=0.0 2023-06-20 06:25:30,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-20 06:25:33,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=648924.0, ans=0.125 2023-06-20 06:26:21,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=649044.0, ans=0.0 2023-06-20 06:26:23,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=649044.0, ans=0.0 2023-06-20 06:26:30,420 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=22.5 2023-06-20 06:26:37,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=649044.0, ans=0.125 2023-06-20 06:26:41,418 INFO [train.py:996] (3/4) Epoch 4, batch 16700, loss[loss=0.2498, simple_loss=0.3551, pruned_loss=0.0723, over 21200.00 frames. ], tot_loss[loss=0.2763, simple_loss=0.3482, pruned_loss=0.1022, over 4272586.50 frames. ], batch size: 549, lr: 7.91e-03, grad_scale: 16.0 2023-06-20 06:26:52,347 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2023-06-20 06:27:22,668 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:27:48,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=649284.0, ans=0.0 2023-06-20 06:27:58,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=649284.0, ans=0.0 2023-06-20 06:28:26,840 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.468e+02 3.053e+02 3.657e+02 4.357e+02 8.504e+02, threshold=7.314e+02, percent-clipped=1.0 2023-06-20 06:28:28,505 INFO [train.py:996] (3/4) Epoch 4, batch 16750, loss[loss=0.3144, simple_loss=0.3948, pruned_loss=0.117, over 21895.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3525, pruned_loss=0.1055, over 4261543.15 frames. ], batch size: 372, lr: 7.91e-03, grad_scale: 16.0 2023-06-20 06:28:40,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=649404.0, ans=0.125 2023-06-20 06:28:41,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=649404.0, ans=0.125 2023-06-20 06:28:52,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=649464.0, ans=0.125 2023-06-20 06:29:33,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=649524.0, ans=0.125 2023-06-20 06:30:12,925 INFO [train.py:996] (3/4) Epoch 4, batch 16800, loss[loss=0.2677, simple_loss=0.323, pruned_loss=0.1062, over 21442.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3574, pruned_loss=0.1056, over 4261671.86 frames. ], batch size: 211, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 06:31:24,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=649884.0, ans=0.125 2023-06-20 06:31:58,991 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.418e+02 3.215e+02 3.899e+02 5.075e+02 9.613e+02, threshold=7.798e+02, percent-clipped=9.0 2023-06-20 06:32:00,526 INFO [train.py:996] (3/4) Epoch 4, batch 16850, loss[loss=0.2709, simple_loss=0.3312, pruned_loss=0.1054, over 21643.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.354, pruned_loss=0.1062, over 4270681.93 frames. ], batch size: 263, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 06:32:03,480 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=12.0 2023-06-20 06:33:37,196 INFO [train.py:996] (3/4) Epoch 4, batch 16900, loss[loss=0.2297, simple_loss=0.2956, pruned_loss=0.08191, over 20780.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.3467, pruned_loss=0.1043, over 4279838.26 frames. ], batch size: 608, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 06:34:17,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=650364.0, ans=0.1 2023-06-20 06:34:32,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=650424.0, ans=0.125 2023-06-20 06:34:43,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=650424.0, ans=0.1 2023-06-20 06:35:17,138 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.124e+02 2.495e+02 2.882e+02 3.347e+02 4.730e+02, threshold=5.764e+02, percent-clipped=0.0 2023-06-20 06:35:18,581 INFO [train.py:996] (3/4) Epoch 4, batch 16950, loss[loss=0.2729, simple_loss=0.333, pruned_loss=0.1064, over 21737.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3384, pruned_loss=0.1026, over 4281370.73 frames. ], batch size: 389, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 06:36:06,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=650664.0, ans=0.0 2023-06-20 06:36:06,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=650664.0, ans=0.2 2023-06-20 06:36:08,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=650664.0, ans=0.2 2023-06-20 06:36:26,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=650784.0, ans=0.0 2023-06-20 06:36:58,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=650844.0, ans=0.5 2023-06-20 06:37:01,047 INFO [train.py:996] (3/4) Epoch 4, batch 17000, loss[loss=0.2545, simple_loss=0.3193, pruned_loss=0.09478, over 21891.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3343, pruned_loss=0.1025, over 4288246.25 frames. ], batch size: 124, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 06:37:35,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=650964.0, ans=0.125 2023-06-20 06:38:05,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=651024.0, ans=0.125 2023-06-20 06:38:18,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=651084.0, ans=0.125 2023-06-20 06:38:20,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=651084.0, ans=0.125 2023-06-20 06:38:57,171 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 2.930e+02 3.368e+02 4.146e+02 7.848e+02, threshold=6.737e+02, percent-clipped=5.0 2023-06-20 06:38:57,192 INFO [train.py:996] (3/4) Epoch 4, batch 17050, loss[loss=0.2852, simple_loss=0.3754, pruned_loss=0.09745, over 21784.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3416, pruned_loss=0.1051, over 4291163.52 frames. ], batch size: 414, lr: 7.90e-03, grad_scale: 16.0 2023-06-20 06:39:15,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=651204.0, ans=10.0 2023-06-20 06:40:37,620 INFO [train.py:996] (3/4) Epoch 4, batch 17100, loss[loss=0.2578, simple_loss=0.3141, pruned_loss=0.1008, over 21477.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3413, pruned_loss=0.1054, over 4289031.39 frames. ], batch size: 211, lr: 7.90e-03, grad_scale: 16.0 2023-06-20 06:40:59,359 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.99 vs. limit=10.0 2023-06-20 06:41:11,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=651564.0, ans=0.125 2023-06-20 06:41:53,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=651744.0, ans=0.0 2023-06-20 06:41:54,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=651744.0, ans=0.125 2023-06-20 06:42:18,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=651804.0, ans=0.1 2023-06-20 06:42:19,603 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 2.907e+02 3.321e+02 3.692e+02 6.035e+02, threshold=6.643e+02, percent-clipped=0.0 2023-06-20 06:42:19,623 INFO [train.py:996] (3/4) Epoch 4, batch 17150, loss[loss=0.2239, simple_loss=0.2794, pruned_loss=0.08414, over 21249.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.3364, pruned_loss=0.1042, over 4289216.09 frames. ], batch size: 608, lr: 7.90e-03, grad_scale: 16.0 2023-06-20 06:42:26,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=651804.0, ans=0.125 2023-06-20 06:42:49,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=651864.0, ans=0.125 2023-06-20 06:42:56,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=651864.0, ans=0.125 2023-06-20 06:43:19,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=651984.0, ans=0.0 2023-06-20 06:43:27,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=651984.0, ans=0.125 2023-06-20 06:43:55,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=652044.0, ans=0.125 2023-06-20 06:44:07,964 INFO [train.py:996] (3/4) Epoch 4, batch 17200, loss[loss=0.2685, simple_loss=0.3416, pruned_loss=0.09772, over 21508.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3341, pruned_loss=0.1028, over 4286818.39 frames. ], batch size: 112, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 06:44:16,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=652104.0, ans=0.2 2023-06-20 06:44:28,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=652164.0, ans=0.125 2023-06-20 06:44:43,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=652164.0, ans=0.1 2023-06-20 06:44:54,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=652224.0, ans=0.2 2023-06-20 06:44:58,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=652224.0, ans=0.2 2023-06-20 06:45:57,639 INFO [train.py:996] (3/4) Epoch 4, batch 17250, loss[loss=0.3038, simple_loss=0.3783, pruned_loss=0.1147, over 21752.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3376, pruned_loss=0.1041, over 4293344.07 frames. ], batch size: 298, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:45:59,489 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.975e+02 3.249e+02 4.201e+02 6.802e+02, threshold=6.498e+02, percent-clipped=2.0 2023-06-20 06:45:59,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=652404.0, ans=0.125 2023-06-20 06:46:09,668 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:46:31,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=652524.0, ans=0.09899494936611666 2023-06-20 06:46:49,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=652524.0, ans=0.125 2023-06-20 06:47:04,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=652584.0, ans=10.0 2023-06-20 06:47:32,623 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-20 06:47:36,634 INFO [train.py:996] (3/4) Epoch 4, batch 17300, loss[loss=0.2907, simple_loss=0.3495, pruned_loss=0.1159, over 21492.00 frames. ], tot_loss[loss=0.2826, simple_loss=0.3479, pruned_loss=0.1087, over 4287060.56 frames. ], batch size: 194, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:47:40,861 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=15.0 2023-06-20 06:47:43,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=652704.0, ans=0.0 2023-06-20 06:47:43,938 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-06-20 06:47:48,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=652704.0, ans=0.0 2023-06-20 06:48:52,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=652884.0, ans=0.125 2023-06-20 06:49:17,042 INFO [train.py:996] (3/4) Epoch 4, batch 17350, loss[loss=0.2971, simple_loss=0.3691, pruned_loss=0.1126, over 19867.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.3493, pruned_loss=0.1085, over 4279162.17 frames. ], batch size: 702, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:49:18,633 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 3.182e+02 3.780e+02 4.285e+02 5.975e+02, threshold=7.560e+02, percent-clipped=0.0 2023-06-20 06:50:40,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=653184.0, ans=0.125 2023-06-20 06:50:53,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=653244.0, ans=0.125 2023-06-20 06:50:58,779 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-20 06:50:59,144 INFO [train.py:996] (3/4) Epoch 4, batch 17400, loss[loss=0.2373, simple_loss=0.306, pruned_loss=0.08429, over 21640.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.345, pruned_loss=0.1037, over 4281518.02 frames. ], batch size: 247, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:52:11,774 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-20 06:52:14,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=653484.0, ans=0.125 2023-06-20 06:52:42,145 INFO [train.py:996] (3/4) Epoch 4, batch 17450, loss[loss=0.2472, simple_loss=0.3404, pruned_loss=0.07702, over 21698.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3404, pruned_loss=0.1002, over 4284288.46 frames. ], batch size: 414, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:52:43,974 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.974e+02 3.600e+02 4.231e+02 7.262e+02, threshold=7.200e+02, percent-clipped=0.0 2023-06-20 06:53:02,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=653604.0, ans=0.1 2023-06-20 06:53:18,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=653664.0, ans=0.125 2023-06-20 06:54:00,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=653784.0, ans=0.125 2023-06-20 06:54:28,261 INFO [train.py:996] (3/4) Epoch 4, batch 17500, loss[loss=0.2719, simple_loss=0.3267, pruned_loss=0.1086, over 21777.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.335, pruned_loss=0.09726, over 4278951.87 frames. ], batch size: 247, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:54:29,467 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-20 06:54:36,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=653904.0, ans=0.04949747468305833 2023-06-20 06:54:52,578 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-20 06:55:06,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=653964.0, ans=0.0 2023-06-20 06:56:01,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=654204.0, ans=0.1 2023-06-20 06:56:02,395 INFO [train.py:996] (3/4) Epoch 4, batch 17550, loss[loss=0.2466, simple_loss=0.3313, pruned_loss=0.08092, over 21797.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3343, pruned_loss=0.09575, over 4283663.50 frames. ], batch size: 316, lr: 7.88e-03, grad_scale: 16.0 2023-06-20 06:56:02,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=654204.0, ans=0.125 2023-06-20 06:56:04,001 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 2.427e+02 2.904e+02 3.611e+02 6.733e+02, threshold=5.808e+02, percent-clipped=0.0 2023-06-20 06:56:38,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=654264.0, ans=0.2 2023-06-20 06:56:49,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=654264.0, ans=0.04949747468305833 2023-06-20 06:57:04,802 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.61 vs. limit=15.0 2023-06-20 06:57:20,864 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.31 vs. limit=10.0 2023-06-20 06:57:36,242 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.28 vs. limit=10.0 2023-06-20 06:57:38,150 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=22.5 2023-06-20 06:57:43,282 INFO [train.py:996] (3/4) Epoch 4, batch 17600, loss[loss=0.2899, simple_loss=0.3519, pruned_loss=0.1139, over 21828.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3365, pruned_loss=0.09607, over 4281328.09 frames. ], batch size: 282, lr: 7.88e-03, grad_scale: 32.0 2023-06-20 06:58:02,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.41 vs. limit=22.5 2023-06-20 06:58:40,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=654624.0, ans=0.1 2023-06-20 06:58:56,729 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-06-20 06:59:32,097 INFO [train.py:996] (3/4) Epoch 4, batch 17650, loss[loss=0.183, simple_loss=0.2332, pruned_loss=0.06645, over 21233.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3348, pruned_loss=0.09673, over 4275974.44 frames. ], batch size: 159, lr: 7.88e-03, grad_scale: 16.0 2023-06-20 06:59:40,724 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 3.037e+02 3.780e+02 4.490e+02 8.251e+02, threshold=7.559e+02, percent-clipped=12.0 2023-06-20 06:59:53,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=654804.0, ans=0.0 2023-06-20 06:59:54,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=654804.0, ans=0.0 2023-06-20 07:00:36,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=654984.0, ans=0.1 2023-06-20 07:01:18,475 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-20 07:01:20,959 INFO [train.py:996] (3/4) Epoch 4, batch 17700, loss[loss=0.2799, simple_loss=0.356, pruned_loss=0.1019, over 21945.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3294, pruned_loss=0.09385, over 4274947.65 frames. ], batch size: 317, lr: 7.88e-03, grad_scale: 16.0 2023-06-20 07:02:58,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=655344.0, ans=0.0 2023-06-20 07:03:06,198 INFO [train.py:996] (3/4) Epoch 4, batch 17750, loss[loss=0.2831, simple_loss=0.3576, pruned_loss=0.1043, over 20648.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3385, pruned_loss=0.09923, over 4273135.39 frames. ], batch size: 607, lr: 7.88e-03, grad_scale: 16.0 2023-06-20 07:03:09,445 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.874e+02 3.454e+02 4.122e+02 5.655e+02, threshold=6.909e+02, percent-clipped=0.0 2023-06-20 07:04:02,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=655584.0, ans=0.0 2023-06-20 07:04:50,169 INFO [train.py:996] (3/4) Epoch 4, batch 17800, loss[loss=0.2478, simple_loss=0.3174, pruned_loss=0.08909, over 21824.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3365, pruned_loss=0.09759, over 4271116.19 frames. ], batch size: 282, lr: 7.87e-03, grad_scale: 16.0 2023-06-20 07:05:01,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=655704.0, ans=0.125 2023-06-20 07:05:17,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=655764.0, ans=0.125 2023-06-20 07:05:19,717 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.63 vs. limit=10.0 2023-06-20 07:05:24,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=655824.0, ans=0.07 2023-06-20 07:05:24,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=655824.0, ans=0.125 2023-06-20 07:05:46,701 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=22.5 2023-06-20 07:06:33,108 INFO [train.py:996] (3/4) Epoch 4, batch 17850, loss[loss=0.2872, simple_loss=0.3503, pruned_loss=0.112, over 21473.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3363, pruned_loss=0.09752, over 4275488.29 frames. ], batch size: 131, lr: 7.87e-03, grad_scale: 16.0 2023-06-20 07:06:36,775 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.792e+02 3.277e+02 4.116e+02 6.981e+02, threshold=6.554e+02, percent-clipped=1.0 2023-06-20 07:07:26,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=656124.0, ans=0.0 2023-06-20 07:07:49,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=656184.0, ans=0.05 2023-06-20 07:08:17,951 INFO [train.py:996] (3/4) Epoch 4, batch 17900, loss[loss=0.2562, simple_loss=0.3293, pruned_loss=0.09155, over 21290.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3415, pruned_loss=0.1002, over 4273974.84 frames. ], batch size: 159, lr: 7.87e-03, grad_scale: 16.0 2023-06-20 07:08:57,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=656364.0, ans=0.2 2023-06-20 07:08:58,092 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-20 07:09:40,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=22.5 2023-06-20 07:09:45,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=656544.0, ans=0.125 2023-06-20 07:10:01,318 INFO [train.py:996] (3/4) Epoch 4, batch 17950, loss[loss=0.2723, simple_loss=0.36, pruned_loss=0.09233, over 21572.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3415, pruned_loss=0.09654, over 4275201.96 frames. ], batch size: 441, lr: 7.87e-03, grad_scale: 16.0 2023-06-20 07:10:04,341 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 2.785e+02 3.219e+02 3.835e+02 8.514e+02, threshold=6.438e+02, percent-clipped=3.0 2023-06-20 07:10:04,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=656604.0, ans=0.025 2023-06-20 07:10:31,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=656664.0, ans=0.1 2023-06-20 07:10:31,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=656664.0, ans=0.125 2023-06-20 07:10:52,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=656664.0, ans=0.035 2023-06-20 07:11:29,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=656844.0, ans=0.1 2023-06-20 07:11:36,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=656844.0, ans=0.125 2023-06-20 07:11:44,087 INFO [train.py:996] (3/4) Epoch 4, batch 18000, loss[loss=0.2335, simple_loss=0.2954, pruned_loss=0.08585, over 16091.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3337, pruned_loss=0.09518, over 4275947.40 frames. ], batch size: 67, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 07:11:44,088 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 07:12:05,673 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2767, simple_loss=0.3741, pruned_loss=0.08966, over 1796401.00 frames. 2023-06-20 07:12:05,674 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-20 07:12:06,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=656904.0, ans=0.0 2023-06-20 07:12:30,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=656904.0, ans=0.125 2023-06-20 07:12:58,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=657024.0, ans=0.2 2023-06-20 07:12:58,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=657024.0, ans=0.02 2023-06-20 07:12:59,963 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:13:12,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=657084.0, ans=0.125 2023-06-20 07:13:22,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=657084.0, ans=0.1 2023-06-20 07:13:55,469 INFO [train.py:996] (3/4) Epoch 4, batch 18050, loss[loss=0.2522, simple_loss=0.308, pruned_loss=0.09817, over 21422.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3287, pruned_loss=0.09471, over 4274848.27 frames. ], batch size: 211, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 07:14:03,898 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 3.056e+02 3.833e+02 4.791e+02 8.139e+02, threshold=7.666e+02, percent-clipped=8.0 2023-06-20 07:14:18,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=657204.0, ans=0.125 2023-06-20 07:14:24,762 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=12.0 2023-06-20 07:15:07,859 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.68 vs. limit=6.0 2023-06-20 07:15:41,209 INFO [train.py:996] (3/4) Epoch 4, batch 18100, loss[loss=0.2844, simple_loss=0.377, pruned_loss=0.09593, over 21621.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3329, pruned_loss=0.09661, over 4274386.49 frames. ], batch size: 414, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:15:56,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=657504.0, ans=0.125 2023-06-20 07:16:10,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=657564.0, ans=0.125 2023-06-20 07:16:36,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=657684.0, ans=0.2 2023-06-20 07:16:58,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=657744.0, ans=0.125 2023-06-20 07:17:05,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=657744.0, ans=0.125 2023-06-20 07:17:26,416 INFO [train.py:996] (3/4) Epoch 4, batch 18150, loss[loss=0.2551, simple_loss=0.3211, pruned_loss=0.0945, over 21713.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3354, pruned_loss=0.09663, over 4272490.56 frames. ], batch size: 282, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:17:31,377 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.178e+02 2.784e+02 3.397e+02 4.397e+02 8.554e+02, threshold=6.794e+02, percent-clipped=1.0 2023-06-20 07:17:36,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=657804.0, ans=0.125 2023-06-20 07:17:47,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=657864.0, ans=0.125 2023-06-20 07:19:02,649 INFO [train.py:996] (3/4) Epoch 4, batch 18200, loss[loss=0.2659, simple_loss=0.3145, pruned_loss=0.1086, over 21901.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3298, pruned_loss=0.09652, over 4274909.45 frames. ], batch size: 107, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:19:08,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=658104.0, ans=0.0 2023-06-20 07:20:11,173 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-20 07:20:31,241 INFO [train.py:996] (3/4) Epoch 4, batch 18250, loss[loss=0.2873, simple_loss=0.3923, pruned_loss=0.09111, over 19960.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3223, pruned_loss=0.09396, over 4263178.78 frames. ], batch size: 702, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:20:39,465 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-20 07:20:41,504 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.911e+02 2.684e+02 3.179e+02 3.917e+02 6.277e+02, threshold=6.359e+02, percent-clipped=0.0 2023-06-20 07:21:30,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=658524.0, ans=0.125 2023-06-20 07:22:08,069 INFO [train.py:996] (3/4) Epoch 4, batch 18300, loss[loss=0.2681, simple_loss=0.3251, pruned_loss=0.1055, over 21800.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3208, pruned_loss=0.09351, over 4260638.31 frames. ], batch size: 112, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:23:10,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=658824.0, ans=0.025 2023-06-20 07:23:49,843 INFO [train.py:996] (3/4) Epoch 4, batch 18350, loss[loss=0.2444, simple_loss=0.3039, pruned_loss=0.09246, over 21269.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3273, pruned_loss=0.09395, over 4268095.92 frames. ], batch size: 144, lr: 7.85e-03, grad_scale: 16.0 2023-06-20 07:23:51,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=659004.0, ans=0.1 2023-06-20 07:24:00,021 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.835e+02 3.516e+02 4.714e+02 7.993e+02, threshold=7.032e+02, percent-clipped=4.0 2023-06-20 07:24:00,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=659004.0, ans=0.0 2023-06-20 07:24:20,890 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:24:45,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=659124.0, ans=0.125 2023-06-20 07:24:51,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=659124.0, ans=0.125 2023-06-20 07:25:38,224 INFO [train.py:996] (3/4) Epoch 4, batch 18400, loss[loss=0.2231, simple_loss=0.2994, pruned_loss=0.07337, over 21541.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3234, pruned_loss=0.09291, over 4270339.96 frames. ], batch size: 230, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:26:08,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=659304.0, ans=0.2 2023-06-20 07:27:27,665 INFO [train.py:996] (3/4) Epoch 4, batch 18450, loss[loss=0.2289, simple_loss=0.297, pruned_loss=0.08046, over 21690.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3188, pruned_loss=0.0887, over 4266843.22 frames. ], batch size: 298, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:27:37,437 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.607e+02 3.111e+02 4.080e+02 7.142e+02, threshold=6.222e+02, percent-clipped=1.0 2023-06-20 07:28:18,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=659724.0, ans=0.2 2023-06-20 07:28:29,450 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-20 07:29:09,376 INFO [train.py:996] (3/4) Epoch 4, batch 18500, loss[loss=0.2063, simple_loss=0.2679, pruned_loss=0.07231, over 21763.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3143, pruned_loss=0.08755, over 4258014.35 frames. ], batch size: 118, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:29:14,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=659904.0, ans=0.2 2023-06-20 07:29:37,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=659964.0, ans=0.1 2023-06-20 07:30:41,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=660144.0, ans=0.2 2023-06-20 07:30:51,276 INFO [train.py:996] (3/4) Epoch 4, batch 18550, loss[loss=0.2235, simple_loss=0.2895, pruned_loss=0.07876, over 21191.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3142, pruned_loss=0.08654, over 4253634.29 frames. ], batch size: 176, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:31:01,298 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.666e+02 3.227e+02 3.857e+02 6.093e+02, threshold=6.453e+02, percent-clipped=0.0 2023-06-20 07:31:01,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=660204.0, ans=0.0 2023-06-20 07:31:49,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=660384.0, ans=0.0 2023-06-20 07:32:19,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=660444.0, ans=0.0 2023-06-20 07:32:39,748 INFO [train.py:996] (3/4) Epoch 4, batch 18600, loss[loss=0.2742, simple_loss=0.3406, pruned_loss=0.1039, over 21292.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3126, pruned_loss=0.08754, over 4259350.95 frames. ], batch size: 471, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:33:06,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=660564.0, ans=0.125 2023-06-20 07:33:33,808 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.06 vs. limit=15.0 2023-06-20 07:34:16,709 INFO [train.py:996] (3/4) Epoch 4, batch 18650, loss[loss=0.2402, simple_loss=0.3005, pruned_loss=0.08999, over 20233.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3125, pruned_loss=0.08855, over 4260586.76 frames. ], batch size: 703, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:34:26,813 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.903e+02 3.307e+02 4.145e+02 6.218e+02, threshold=6.614e+02, percent-clipped=0.0 2023-06-20 07:34:48,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=660864.0, ans=0.125 2023-06-20 07:35:09,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=660984.0, ans=0.09899494936611666 2023-06-20 07:35:19,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=660984.0, ans=0.5 2023-06-20 07:35:22,855 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-20 07:35:28,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=661044.0, ans=0.125 2023-06-20 07:35:37,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=661044.0, ans=0.125 2023-06-20 07:35:47,870 INFO [train.py:996] (3/4) Epoch 4, batch 18700, loss[loss=0.2148, simple_loss=0.2684, pruned_loss=0.08059, over 20308.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.311, pruned_loss=0.09059, over 4267133.01 frames. ], batch size: 703, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:35:49,012 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=22.5 2023-06-20 07:36:08,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=661104.0, ans=0.125 2023-06-20 07:36:19,167 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=15.0 2023-06-20 07:36:23,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=661164.0, ans=0.0 2023-06-20 07:36:25,547 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=12.0 2023-06-20 07:36:47,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=661224.0, ans=0.125 2023-06-20 07:36:48,502 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.49 vs. limit=15.0 2023-06-20 07:36:52,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=661284.0, ans=0.2 2023-06-20 07:37:17,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=661344.0, ans=0.0 2023-06-20 07:37:30,154 INFO [train.py:996] (3/4) Epoch 4, batch 18750, loss[loss=0.2776, simple_loss=0.3469, pruned_loss=0.1041, over 21762.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3135, pruned_loss=0.09329, over 4261593.80 frames. ], batch size: 298, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:37:45,237 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.659e+02 3.125e+02 3.916e+02 7.035e+02, threshold=6.249e+02, percent-clipped=1.0 2023-06-20 07:37:45,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=661404.0, ans=0.125 2023-06-20 07:37:49,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=661404.0, ans=0.09899494936611666 2023-06-20 07:38:01,149 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.16 vs. limit=15.0 2023-06-20 07:38:19,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=661524.0, ans=0.1 2023-06-20 07:38:25,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=661524.0, ans=15.0 2023-06-20 07:38:46,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=661644.0, ans=0.125 2023-06-20 07:38:47,430 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-06-20 07:38:48,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=661644.0, ans=0.05 2023-06-20 07:39:06,557 INFO [train.py:996] (3/4) Epoch 4, batch 18800, loss[loss=0.2358, simple_loss=0.3249, pruned_loss=0.07334, over 21640.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3193, pruned_loss=0.09399, over 4260783.51 frames. ], batch size: 414, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:40:55,485 INFO [train.py:996] (3/4) Epoch 4, batch 18850, loss[loss=0.2088, simple_loss=0.2631, pruned_loss=0.07718, over 21314.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3147, pruned_loss=0.08819, over 4247441.90 frames. ], batch size: 144, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:41:00,434 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.538e+02 3.009e+02 3.652e+02 6.341e+02, threshold=6.019e+02, percent-clipped=1.0 2023-06-20 07:41:06,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=662004.0, ans=0.125 2023-06-20 07:41:07,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=662004.0, ans=0.125 2023-06-20 07:42:32,243 INFO [train.py:996] (3/4) Epoch 4, batch 18900, loss[loss=0.2347, simple_loss=0.2749, pruned_loss=0.09726, over 20312.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3106, pruned_loss=0.08811, over 4255114.15 frames. ], batch size: 703, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:42:58,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=662364.0, ans=0.0 2023-06-20 07:43:08,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=662364.0, ans=0.125 2023-06-20 07:43:18,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=662424.0, ans=0.0 2023-06-20 07:43:24,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=662424.0, ans=0.0 2023-06-20 07:43:45,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=662484.0, ans=0.07 2023-06-20 07:43:46,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=662544.0, ans=0.125 2023-06-20 07:43:49,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=662544.0, ans=0.1 2023-06-20 07:44:05,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=662544.0, ans=0.125 2023-06-20 07:44:10,278 INFO [train.py:996] (3/4) Epoch 4, batch 18950, loss[loss=0.201, simple_loss=0.2787, pruned_loss=0.0617, over 20807.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3125, pruned_loss=0.09049, over 4264115.85 frames. ], batch size: 608, lr: 7.83e-03, grad_scale: 32.0 2023-06-20 07:44:25,534 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 2.815e+02 3.147e+02 3.726e+02 6.285e+02, threshold=6.294e+02, percent-clipped=0.0 2023-06-20 07:44:37,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=662664.0, ans=0.0 2023-06-20 07:45:16,567 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=22.5 2023-06-20 07:45:22,359 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:45:31,561 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2023-06-20 07:45:32,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=662844.0, ans=0.125 2023-06-20 07:45:32,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=662844.0, ans=0.2 2023-06-20 07:46:03,979 INFO [train.py:996] (3/4) Epoch 4, batch 19000, loss[loss=0.2847, simple_loss=0.3606, pruned_loss=0.1044, over 21718.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3228, pruned_loss=0.09275, over 4267530.98 frames. ], batch size: 351, lr: 7.83e-03, grad_scale: 16.0 2023-06-20 07:46:04,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=662904.0, ans=0.125 2023-06-20 07:46:12,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=662904.0, ans=0.0 2023-06-20 07:46:17,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=662904.0, ans=0.0 2023-06-20 07:46:28,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=662964.0, ans=0.125 2023-06-20 07:46:31,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=662964.0, ans=0.125 2023-06-20 07:46:43,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=663024.0, ans=0.0 2023-06-20 07:46:52,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=663084.0, ans=0.125 2023-06-20 07:47:28,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=663144.0, ans=0.1 2023-06-20 07:47:47,413 INFO [train.py:996] (3/4) Epoch 4, batch 19050, loss[loss=0.2836, simple_loss=0.3435, pruned_loss=0.1118, over 21884.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3278, pruned_loss=0.09731, over 4275365.25 frames. ], batch size: 118, lr: 7.83e-03, grad_scale: 16.0 2023-06-20 07:47:53,685 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.483e+02 3.224e+02 3.773e+02 4.391e+02 1.056e+03, threshold=7.547e+02, percent-clipped=6.0 2023-06-20 07:47:57,618 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-20 07:47:59,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=663204.0, ans=0.125 2023-06-20 07:48:05,847 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-20 07:48:11,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=663264.0, ans=0.05 2023-06-20 07:48:44,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=15.0 2023-06-20 07:48:46,439 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.16 vs. limit=15.0 2023-06-20 07:49:02,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=663444.0, ans=0.125 2023-06-20 07:49:28,668 INFO [train.py:996] (3/4) Epoch 4, batch 19100, loss[loss=0.2495, simple_loss=0.2983, pruned_loss=0.1004, over 21143.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3268, pruned_loss=0.09794, over 4269562.37 frames. ], batch size: 143, lr: 7.83e-03, grad_scale: 16.0 2023-06-20 07:49:30,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=663504.0, ans=0.125 2023-06-20 07:49:53,361 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:49:53,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=663564.0, ans=0.0 2023-06-20 07:50:24,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=663684.0, ans=0.1 2023-06-20 07:50:32,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=663684.0, ans=10.0 2023-06-20 07:51:10,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=663744.0, ans=0.125 2023-06-20 07:51:14,820 INFO [train.py:996] (3/4) Epoch 4, batch 19150, loss[loss=0.3275, simple_loss=0.427, pruned_loss=0.114, over 21157.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3313, pruned_loss=0.09963, over 4267105.72 frames. ], batch size: 548, lr: 7.83e-03, grad_scale: 16.0 2023-06-20 07:51:21,645 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.031e+02 3.382e+02 4.089e+02 6.377e+02, threshold=6.763e+02, percent-clipped=0.0 2023-06-20 07:51:22,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=663804.0, ans=0.125 2023-06-20 07:51:30,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=663864.0, ans=0.0 2023-06-20 07:52:04,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=663924.0, ans=0.0 2023-06-20 07:52:39,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=663984.0, ans=0.1 2023-06-20 07:52:54,065 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=15.0 2023-06-20 07:52:57,817 INFO [train.py:996] (3/4) Epoch 4, batch 19200, loss[loss=0.2913, simple_loss=0.3713, pruned_loss=0.1057, over 21231.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3415, pruned_loss=0.1011, over 4257879.18 frames. ], batch size: 143, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 07:53:11,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=664104.0, ans=0.1 2023-06-20 07:53:15,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=664164.0, ans=0.1 2023-06-20 07:53:18,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=664164.0, ans=0.125 2023-06-20 07:53:28,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=664164.0, ans=0.125 2023-06-20 07:54:29,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=664344.0, ans=0.1 2023-06-20 07:54:37,003 INFO [train.py:996] (3/4) Epoch 4, batch 19250, loss[loss=0.2079, simple_loss=0.2952, pruned_loss=0.0603, over 21700.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3392, pruned_loss=0.09441, over 4255424.50 frames. ], batch size: 230, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 07:54:44,730 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.666e+02 2.530e+02 3.089e+02 4.105e+02 6.871e+02, threshold=6.178e+02, percent-clipped=1.0 2023-06-20 07:55:33,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=664524.0, ans=0.0 2023-06-20 07:56:19,716 INFO [train.py:996] (3/4) Epoch 4, batch 19300, loss[loss=0.2621, simple_loss=0.338, pruned_loss=0.09308, over 21513.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3366, pruned_loss=0.09425, over 4260976.40 frames. ], batch size: 471, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 07:56:34,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=12.0 2023-06-20 07:57:04,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.53 vs. limit=15.0 2023-06-20 07:57:22,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=664824.0, ans=0.125 2023-06-20 07:57:33,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=664884.0, ans=0.125 2023-06-20 07:58:03,570 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=15.0 2023-06-20 07:58:04,211 INFO [train.py:996] (3/4) Epoch 4, batch 19350, loss[loss=0.29, simple_loss=0.3689, pruned_loss=0.1055, over 21549.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3332, pruned_loss=0.09155, over 4269451.45 frames. ], batch size: 473, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 07:58:12,487 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.563e+02 3.117e+02 3.921e+02 9.439e+02, threshold=6.235e+02, percent-clipped=6.0 2023-06-20 07:58:41,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=665124.0, ans=0.0 2023-06-20 07:58:52,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=665124.0, ans=0.125 2023-06-20 07:58:55,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=665124.0, ans=0.125 2023-06-20 07:58:57,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=665124.0, ans=0.125 2023-06-20 07:59:25,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=665184.0, ans=0.2 2023-06-20 07:59:46,536 INFO [train.py:996] (3/4) Epoch 4, batch 19400, loss[loss=0.2714, simple_loss=0.3348, pruned_loss=0.104, over 21873.00 frames. ], tot_loss[loss=0.253, simple_loss=0.328, pruned_loss=0.08901, over 4273970.66 frames. ], batch size: 414, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 07:59:54,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=665304.0, ans=0.1 2023-06-20 08:01:02,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=665484.0, ans=10.0 2023-06-20 08:01:09,366 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:01:12,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=665544.0, ans=0.0 2023-06-20 08:01:23,613 INFO [train.py:996] (3/4) Epoch 4, batch 19450, loss[loss=0.2561, simple_loss=0.3041, pruned_loss=0.1041, over 21322.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3264, pruned_loss=0.09129, over 4282806.43 frames. ], batch size: 548, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 08:01:31,158 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.618e+02 3.250e+02 3.891e+02 5.569e+02, threshold=6.499e+02, percent-clipped=0.0 2023-06-20 08:03:06,364 INFO [train.py:996] (3/4) Epoch 4, batch 19500, loss[loss=0.2284, simple_loss=0.3028, pruned_loss=0.07705, over 21663.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3229, pruned_loss=0.09305, over 4276957.39 frames. ], batch size: 332, lr: 7.81e-03, grad_scale: 16.0 2023-06-20 08:03:18,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=665904.0, ans=0.1 2023-06-20 08:03:40,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=665964.0, ans=0.125 2023-06-20 08:04:45,098 INFO [train.py:996] (3/4) Epoch 4, batch 19550, loss[loss=0.2215, simple_loss=0.3178, pruned_loss=0.06258, over 21700.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3183, pruned_loss=0.09108, over 4278621.56 frames. ], batch size: 298, lr: 7.81e-03, grad_scale: 16.0 2023-06-20 08:04:53,569 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.307e+02 2.884e+02 3.306e+02 4.120e+02 1.024e+03, threshold=6.612e+02, percent-clipped=7.0 2023-06-20 08:05:00,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=666264.0, ans=0.2 2023-06-20 08:05:07,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=666264.0, ans=0.0 2023-06-20 08:05:55,751 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.00 vs. limit=15.0 2023-06-20 08:06:02,071 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=12.0 2023-06-20 08:06:16,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=666444.0, ans=0.0 2023-06-20 08:06:27,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=666504.0, ans=0.125 2023-06-20 08:06:29,084 INFO [train.py:996] (3/4) Epoch 4, batch 19600, loss[loss=0.1662, simple_loss=0.2317, pruned_loss=0.05038, over 17614.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.32, pruned_loss=0.09178, over 4278269.62 frames. ], batch size: 60, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 08:06:54,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=666564.0, ans=0.0 2023-06-20 08:07:30,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=666624.0, ans=0.125 2023-06-20 08:07:35,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=666684.0, ans=0.125 2023-06-20 08:07:57,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=666744.0, ans=0.025 2023-06-20 08:07:57,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=666744.0, ans=0.125 2023-06-20 08:08:02,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=666744.0, ans=0.0 2023-06-20 08:08:07,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=666804.0, ans=0.125 2023-06-20 08:08:08,673 INFO [train.py:996] (3/4) Epoch 4, batch 19650, loss[loss=0.3022, simple_loss=0.3544, pruned_loss=0.125, over 21660.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3255, pruned_loss=0.0966, over 4277073.64 frames. ], batch size: 389, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 08:08:17,370 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 3.087e+02 3.531e+02 4.155e+02 7.951e+02, threshold=7.063e+02, percent-clipped=1.0 2023-06-20 08:08:40,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=666864.0, ans=0.1 2023-06-20 08:08:41,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=666864.0, ans=0.125 2023-06-20 08:09:09,517 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:09:12,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=666924.0, ans=0.125 2023-06-20 08:09:26,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=666984.0, ans=0.0 2023-06-20 08:09:42,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=667044.0, ans=0.0 2023-06-20 08:09:44,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=667044.0, ans=0.0 2023-06-20 08:09:59,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=667104.0, ans=0.125 2023-06-20 08:10:00,463 INFO [train.py:996] (3/4) Epoch 4, batch 19700, loss[loss=0.2197, simple_loss=0.2978, pruned_loss=0.07073, over 21587.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3299, pruned_loss=0.09806, over 4274811.03 frames. ], batch size: 230, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 08:10:01,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=667104.0, ans=0.0 2023-06-20 08:10:24,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=667104.0, ans=0.0 2023-06-20 08:10:43,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=667224.0, ans=0.125 2023-06-20 08:10:49,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=667224.0, ans=0.125 2023-06-20 08:11:19,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=667344.0, ans=0.125 2023-06-20 08:11:45,283 INFO [train.py:996] (3/4) Epoch 4, batch 19750, loss[loss=0.3684, simple_loss=0.4267, pruned_loss=0.155, over 21546.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3357, pruned_loss=0.09831, over 4260284.32 frames. ], batch size: 507, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 08:12:04,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 3.193e+02 3.734e+02 5.091e+02 8.572e+02, threshold=7.467e+02, percent-clipped=4.0 2023-06-20 08:12:31,028 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=15.0 2023-06-20 08:12:50,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-20 08:13:08,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=667644.0, ans=0.125 2023-06-20 08:13:12,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=667644.0, ans=0.0 2023-06-20 08:13:33,727 INFO [train.py:996] (3/4) Epoch 4, batch 19800, loss[loss=0.2129, simple_loss=0.2927, pruned_loss=0.06658, over 21828.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3364, pruned_loss=0.09939, over 4261766.03 frames. ], batch size: 316, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:13:47,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=667704.0, ans=0.95 2023-06-20 08:14:10,412 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-20 08:14:49,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=667944.0, ans=0.125 2023-06-20 08:15:12,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=667944.0, ans=0.125 2023-06-20 08:15:22,728 INFO [train.py:996] (3/4) Epoch 4, batch 19850, loss[loss=0.2964, simple_loss=0.3519, pruned_loss=0.1204, over 19878.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3294, pruned_loss=0.0944, over 4251725.45 frames. ], batch size: 702, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:15:30,727 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.674e+02 3.222e+02 4.278e+02 7.795e+02, threshold=6.444e+02, percent-clipped=1.0 2023-06-20 08:16:25,563 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:16:59,862 INFO [train.py:996] (3/4) Epoch 4, batch 19900, loss[loss=0.2219, simple_loss=0.2823, pruned_loss=0.08073, over 21244.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.33, pruned_loss=0.09161, over 4248325.49 frames. ], batch size: 176, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:17:28,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=668364.0, ans=0.035 2023-06-20 08:17:57,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=668484.0, ans=0.125 2023-06-20 08:18:47,715 INFO [train.py:996] (3/4) Epoch 4, batch 19950, loss[loss=0.2498, simple_loss=0.2996, pruned_loss=0.1, over 21803.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3242, pruned_loss=0.09209, over 4255522.57 frames. ], batch size: 102, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:18:55,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=668604.0, ans=0.125 2023-06-20 08:18:56,256 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.748e+02 3.104e+02 3.905e+02 6.692e+02, threshold=6.208e+02, percent-clipped=1.0 2023-06-20 08:18:58,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=668604.0, ans=0.125 2023-06-20 08:19:26,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=22.5 2023-06-20 08:19:41,915 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-20 08:19:43,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=668784.0, ans=0.125 2023-06-20 08:19:57,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=668784.0, ans=0.0 2023-06-20 08:20:26,094 INFO [train.py:996] (3/4) Epoch 4, batch 20000, loss[loss=0.253, simple_loss=0.3165, pruned_loss=0.09475, over 21410.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3266, pruned_loss=0.09296, over 4262238.41 frames. ], batch size: 143, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:20:52,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=668964.0, ans=0.2 2023-06-20 08:21:22,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=669024.0, ans=0.0 2023-06-20 08:21:36,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=669084.0, ans=0.125 2023-06-20 08:21:49,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=669084.0, ans=10.0 2023-06-20 08:22:13,369 INFO [train.py:996] (3/4) Epoch 4, batch 20050, loss[loss=0.2512, simple_loss=0.317, pruned_loss=0.09267, over 21803.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3282, pruned_loss=0.09588, over 4262539.39 frames. ], batch size: 298, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 08:22:21,025 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.338e+02 2.833e+02 3.313e+02 4.048e+02 6.603e+02, threshold=6.626e+02, percent-clipped=1.0 2023-06-20 08:23:34,084 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:23:57,176 INFO [train.py:996] (3/4) Epoch 4, batch 20100, loss[loss=0.2638, simple_loss=0.33, pruned_loss=0.09877, over 21906.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.331, pruned_loss=0.09844, over 4269685.89 frames. ], batch size: 316, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 08:24:04,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=669504.0, ans=0.1 2023-06-20 08:24:05,365 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.23 vs. limit=6.0 2023-06-20 08:24:45,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-06-20 08:25:13,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=669684.0, ans=0.125 2023-06-20 08:25:21,512 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.57 vs. limit=15.0 2023-06-20 08:25:27,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=669744.0, ans=0.125 2023-06-20 08:25:38,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=669744.0, ans=0.1 2023-06-20 08:25:42,875 INFO [train.py:996] (3/4) Epoch 4, batch 20150, loss[loss=0.3096, simple_loss=0.3719, pruned_loss=0.1237, over 21478.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3406, pruned_loss=0.1016, over 4269242.84 frames. ], batch size: 131, lr: 7.79e-03, grad_scale: 16.0 2023-06-20 08:25:50,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=669804.0, ans=0.125 2023-06-20 08:25:53,564 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 3.395e+02 3.883e+02 5.021e+02 8.143e+02, threshold=7.766e+02, percent-clipped=4.0 2023-06-20 08:26:03,451 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.19 vs. limit=8.0 2023-06-20 08:27:29,933 INFO [train.py:996] (3/4) Epoch 4, batch 20200, loss[loss=0.2699, simple_loss=0.3656, pruned_loss=0.08713, over 21676.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3474, pruned_loss=0.105, over 4272357.60 frames. ], batch size: 247, lr: 7.79e-03, grad_scale: 16.0 2023-06-20 08:27:40,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=670104.0, ans=0.0 2023-06-20 08:27:43,200 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.55 vs. limit=10.0 2023-06-20 08:28:35,371 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:29:01,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=670344.0, ans=0.125 2023-06-20 08:29:18,097 INFO [train.py:996] (3/4) Epoch 4, batch 20250, loss[loss=0.2515, simple_loss=0.3363, pruned_loss=0.08329, over 21662.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3461, pruned_loss=0.1031, over 4270750.86 frames. ], batch size: 389, lr: 7.79e-03, grad_scale: 16.0 2023-06-20 08:29:33,200 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.329e+02 3.002e+02 3.510e+02 4.411e+02 6.052e+02, threshold=7.021e+02, percent-clipped=0.0 2023-06-20 08:30:37,592 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=22.5 2023-06-20 08:30:56,361 INFO [train.py:996] (3/4) Epoch 4, batch 20300, loss[loss=0.214, simple_loss=0.2615, pruned_loss=0.08326, over 16176.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3428, pruned_loss=0.09958, over 4264352.22 frames. ], batch size: 61, lr: 7.79e-03, grad_scale: 16.0 2023-06-20 08:31:00,498 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=22.5 2023-06-20 08:32:35,110 INFO [train.py:996] (3/4) Epoch 4, batch 20350, loss[loss=0.2974, simple_loss=0.3464, pruned_loss=0.1242, over 21317.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.342, pruned_loss=0.09987, over 4261243.79 frames. ], batch size: 159, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:32:49,939 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.818e+02 3.132e+02 3.924e+02 7.054e+02, threshold=6.264e+02, percent-clipped=1.0 2023-06-20 08:33:03,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=671064.0, ans=0.125 2023-06-20 08:33:04,977 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:33:27,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=671124.0, ans=0.2 2023-06-20 08:33:57,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=671244.0, ans=0.1 2023-06-20 08:34:04,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=671244.0, ans=0.0 2023-06-20 08:34:05,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=671244.0, ans=0.2 2023-06-20 08:34:13,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=671244.0, ans=0.1 2023-06-20 08:34:21,794 INFO [train.py:996] (3/4) Epoch 4, batch 20400, loss[loss=0.3766, simple_loss=0.4161, pruned_loss=0.1685, over 21408.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3453, pruned_loss=0.1035, over 4263223.49 frames. ], batch size: 508, lr: 7.78e-03, grad_scale: 32.0 2023-06-20 08:34:50,804 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-20 08:35:00,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=671424.0, ans=0.125 2023-06-20 08:35:20,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=671484.0, ans=0.125 2023-06-20 08:35:25,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=671484.0, ans=0.0 2023-06-20 08:35:33,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=671484.0, ans=0.125 2023-06-20 08:35:35,797 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.90 vs. limit=6.0 2023-06-20 08:35:42,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=671544.0, ans=0.0 2023-06-20 08:36:04,133 INFO [train.py:996] (3/4) Epoch 4, batch 20450, loss[loss=0.2071, simple_loss=0.2775, pruned_loss=0.06838, over 16305.00 frames. ], tot_loss[loss=0.2798, simple_loss=0.347, pruned_loss=0.1063, over 4252202.91 frames. ], batch size: 61, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:36:20,589 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 2.947e+02 3.456e+02 4.255e+02 7.158e+02, threshold=6.912e+02, percent-clipped=2.0 2023-06-20 08:36:35,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=671664.0, ans=0.0 2023-06-20 08:36:55,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=671724.0, ans=0.0 2023-06-20 08:37:24,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=671844.0, ans=0.125 2023-06-20 08:37:25,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=671844.0, ans=0.2 2023-06-20 08:37:44,291 INFO [train.py:996] (3/4) Epoch 4, batch 20500, loss[loss=0.2574, simple_loss=0.3112, pruned_loss=0.1018, over 21316.00 frames. ], tot_loss[loss=0.277, simple_loss=0.342, pruned_loss=0.106, over 4259682.86 frames. ], batch size: 159, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:37:58,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=671904.0, ans=0.125 2023-06-20 08:37:58,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=671904.0, ans=0.2 2023-06-20 08:38:32,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=672024.0, ans=0.125 2023-06-20 08:38:32,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=672024.0, ans=0.125 2023-06-20 08:38:37,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=672024.0, ans=0.0 2023-06-20 08:39:02,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=672144.0, ans=0.0 2023-06-20 08:39:07,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=672144.0, ans=0.125 2023-06-20 08:39:28,500 INFO [train.py:996] (3/4) Epoch 4, batch 20550, loss[loss=0.2634, simple_loss=0.3117, pruned_loss=0.1075, over 21769.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3339, pruned_loss=0.1038, over 4258703.67 frames. ], batch size: 351, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:39:45,979 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 2.765e+02 3.154e+02 3.666e+02 5.388e+02, threshold=6.309e+02, percent-clipped=0.0 2023-06-20 08:39:58,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=672264.0, ans=0.125 2023-06-20 08:41:08,643 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.48 vs. limit=15.0 2023-06-20 08:41:12,470 INFO [train.py:996] (3/4) Epoch 4, batch 20600, loss[loss=0.3076, simple_loss=0.3613, pruned_loss=0.1269, over 21745.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3349, pruned_loss=0.1005, over 4259287.73 frames. ], batch size: 441, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:41:30,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=672504.0, ans=0.125 2023-06-20 08:41:32,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=672504.0, ans=0.125 2023-06-20 08:41:59,122 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.71 vs. limit=10.0 2023-06-20 08:42:00,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=672624.0, ans=0.125 2023-06-20 08:42:00,636 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-20 08:42:11,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=672684.0, ans=0.2 2023-06-20 08:42:26,627 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=22.5 2023-06-20 08:42:55,704 INFO [train.py:996] (3/4) Epoch 4, batch 20650, loss[loss=0.2363, simple_loss=0.2889, pruned_loss=0.09185, over 21153.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.332, pruned_loss=0.1017, over 4262701.53 frames. ], batch size: 159, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:43:09,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=672804.0, ans=0.07 2023-06-20 08:43:12,508 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.784e+02 3.361e+02 3.771e+02 8.301e+02, threshold=6.721e+02, percent-clipped=1.0 2023-06-20 08:43:29,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=672864.0, ans=0.125 2023-06-20 08:43:38,926 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-20 08:43:43,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=672924.0, ans=0.1 2023-06-20 08:43:45,299 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:44:01,365 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=12.0 2023-06-20 08:44:45,610 INFO [train.py:996] (3/4) Epoch 4, batch 20700, loss[loss=0.2269, simple_loss=0.3031, pruned_loss=0.07531, over 21573.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3226, pruned_loss=0.09688, over 4266841.46 frames. ], batch size: 441, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:44:47,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=673104.0, ans=0.0 2023-06-20 08:44:48,313 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=22.5 2023-06-20 08:45:19,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=673164.0, ans=0.125 2023-06-20 08:45:31,573 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=15.0 2023-06-20 08:45:47,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=673284.0, ans=0.125 2023-06-20 08:46:29,388 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=15.0 2023-06-20 08:46:31,823 INFO [train.py:996] (3/4) Epoch 4, batch 20750, loss[loss=0.2108, simple_loss=0.2675, pruned_loss=0.077, over 21787.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3281, pruned_loss=0.09674, over 4268977.93 frames. ], batch size: 118, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:46:48,685 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.639e+02 3.184e+02 4.013e+02 6.063e+02, threshold=6.368e+02, percent-clipped=0.0 2023-06-20 08:46:48,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=673404.0, ans=0.125 2023-06-20 08:47:37,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=673584.0, ans=0.125 2023-06-20 08:48:15,625 INFO [train.py:996] (3/4) Epoch 4, batch 20800, loss[loss=0.2388, simple_loss=0.2954, pruned_loss=0.09111, over 21619.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3314, pruned_loss=0.09796, over 4271006.09 frames. ], batch size: 282, lr: 7.77e-03, grad_scale: 32.0 2023-06-20 08:48:49,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=673764.0, ans=0.1 2023-06-20 08:49:21,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=673884.0, ans=0.0 2023-06-20 08:49:57,065 INFO [train.py:996] (3/4) Epoch 4, batch 20850, loss[loss=0.2372, simple_loss=0.2942, pruned_loss=0.09014, over 21187.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.322, pruned_loss=0.0952, over 4260465.74 frames. ], batch size: 607, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:49:57,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=674004.0, ans=0.0 2023-06-20 08:50:15,156 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.828e+02 3.362e+02 3.995e+02 7.673e+02, threshold=6.724e+02, percent-clipped=2.0 2023-06-20 08:50:32,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=674064.0, ans=0.125 2023-06-20 08:51:38,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=674304.0, ans=0.0 2023-06-20 08:51:39,450 INFO [train.py:996] (3/4) Epoch 4, batch 20900, loss[loss=0.2592, simple_loss=0.3238, pruned_loss=0.0973, over 21569.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3236, pruned_loss=0.09665, over 4271474.34 frames. ], batch size: 230, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:51:46,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=674304.0, ans=0.1 2023-06-20 08:51:59,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=674364.0, ans=0.125 2023-06-20 08:52:11,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=674364.0, ans=0.125 2023-06-20 08:53:08,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=674544.0, ans=0.125 2023-06-20 08:53:21,012 INFO [train.py:996] (3/4) Epoch 4, batch 20950, loss[loss=0.2867, simple_loss=0.3453, pruned_loss=0.114, over 21510.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3195, pruned_loss=0.09247, over 4279226.56 frames. ], batch size: 471, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 08:53:29,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=674604.0, ans=0.125 2023-06-20 08:53:29,561 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-20 08:53:33,580 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.815e+02 3.248e+02 3.950e+02 6.519e+02, threshold=6.496e+02, percent-clipped=0.0 2023-06-20 08:54:42,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=674844.0, ans=0.1 2023-06-20 08:54:56,996 INFO [train.py:996] (3/4) Epoch 4, batch 21000, loss[loss=0.2669, simple_loss=0.3245, pruned_loss=0.1046, over 21966.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3174, pruned_loss=0.09251, over 4267810.42 frames. ], batch size: 316, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 08:54:56,996 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 08:55:14,688 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2759, simple_loss=0.3744, pruned_loss=0.08874, over 1796401.00 frames. 2023-06-20 08:55:14,689 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-20 08:55:37,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=674964.0, ans=0.125 2023-06-20 08:56:02,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-06-20 08:56:06,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=675024.0, ans=0.0 2023-06-20 08:56:24,954 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=15.0 2023-06-20 08:56:50,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=675204.0, ans=0.125 2023-06-20 08:56:51,309 INFO [train.py:996] (3/4) Epoch 4, batch 21050, loss[loss=0.239, simple_loss=0.2974, pruned_loss=0.09027, over 21576.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3149, pruned_loss=0.09256, over 4252636.09 frames. ], batch size: 414, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 08:57:04,073 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.634e+02 3.115e+02 4.220e+02 7.961e+02, threshold=6.229e+02, percent-clipped=3.0 2023-06-20 08:57:14,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=675264.0, ans=0.125 2023-06-20 08:57:21,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=675264.0, ans=0.125 2023-06-20 08:57:49,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=675384.0, ans=0.125 2023-06-20 08:58:04,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=675384.0, ans=0.125 2023-06-20 08:58:16,885 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2023-06-20 08:58:33,728 INFO [train.py:996] (3/4) Epoch 4, batch 21100, loss[loss=0.2282, simple_loss=0.2858, pruned_loss=0.08535, over 21386.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3119, pruned_loss=0.09208, over 4251911.67 frames. ], batch size: 194, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 08:58:44,141 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:58:51,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=675564.0, ans=15.0 2023-06-20 08:59:01,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=675564.0, ans=0.1 2023-06-20 08:59:09,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=675564.0, ans=0.125 2023-06-20 08:59:15,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=675624.0, ans=0.125 2023-06-20 09:00:06,156 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.15 vs. limit=15.0 2023-06-20 09:00:12,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=675744.0, ans=0.2 2023-06-20 09:00:16,562 INFO [train.py:996] (3/4) Epoch 4, batch 21150, loss[loss=0.2225, simple_loss=0.2853, pruned_loss=0.07986, over 15427.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3083, pruned_loss=0.09236, over 4233398.29 frames. ], batch size: 60, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 09:00:29,287 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.738e+02 3.201e+02 4.018e+02 7.456e+02, threshold=6.402e+02, percent-clipped=2.0 2023-06-20 09:00:29,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=675804.0, ans=0.0 2023-06-20 09:01:16,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=675924.0, ans=0.125 2023-06-20 09:01:46,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=676044.0, ans=0.2 2023-06-20 09:01:59,514 INFO [train.py:996] (3/4) Epoch 4, batch 21200, loss[loss=0.2162, simple_loss=0.2854, pruned_loss=0.07353, over 21748.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3049, pruned_loss=0.09208, over 4235261.24 frames. ], batch size: 351, lr: 7.76e-03, grad_scale: 32.0 2023-06-20 09:02:36,065 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-06-20 09:02:37,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=676224.0, ans=0.125 2023-06-20 09:02:57,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=676224.0, ans=0.05 2023-06-20 09:02:59,310 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=15.0 2023-06-20 09:03:10,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=676284.0, ans=0.125 2023-06-20 09:03:27,448 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.59 vs. limit=15.0 2023-06-20 09:03:44,319 INFO [train.py:996] (3/4) Epoch 4, batch 21250, loss[loss=0.2393, simple_loss=0.3059, pruned_loss=0.08637, over 21602.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3056, pruned_loss=0.09215, over 4238087.75 frames. ], batch size: 263, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:03:44,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=676404.0, ans=0.125 2023-06-20 09:04:02,647 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 2.827e+02 3.407e+02 4.227e+02 7.586e+02, threshold=6.813e+02, percent-clipped=3.0 2023-06-20 09:04:42,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=676524.0, ans=0.125 2023-06-20 09:05:27,285 INFO [train.py:996] (3/4) Epoch 4, batch 21300, loss[loss=0.254, simple_loss=0.3262, pruned_loss=0.09094, over 21592.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3113, pruned_loss=0.09419, over 4251267.56 frames. ], batch size: 230, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:05:56,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=676764.0, ans=0.0 2023-06-20 09:06:18,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=676824.0, ans=0.015 2023-06-20 09:06:26,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=676824.0, ans=0.125 2023-06-20 09:07:11,221 INFO [train.py:996] (3/4) Epoch 4, batch 21350, loss[loss=0.3081, simple_loss=0.3773, pruned_loss=0.1194, over 21468.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.316, pruned_loss=0.0949, over 4265560.19 frames. ], batch size: 507, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:07:11,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=677004.0, ans=0.0 2023-06-20 09:07:14,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=677004.0, ans=0.125 2023-06-20 09:07:29,382 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.939e+02 3.451e+02 4.084e+02 6.160e+02, threshold=6.901e+02, percent-clipped=0.0 2023-06-20 09:08:02,681 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:08:04,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.63 vs. limit=22.5 2023-06-20 09:08:33,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=677184.0, ans=0.0 2023-06-20 09:08:54,420 INFO [train.py:996] (3/4) Epoch 4, batch 21400, loss[loss=0.3167, simple_loss=0.3804, pruned_loss=0.1265, over 21807.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3215, pruned_loss=0.0964, over 4273700.80 frames. ], batch size: 441, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:09:01,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=677304.0, ans=0.125 2023-06-20 09:09:07,440 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.81 vs. limit=22.5 2023-06-20 09:09:10,423 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=15.0 2023-06-20 09:09:12,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=677304.0, ans=0.0 2023-06-20 09:09:16,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=677364.0, ans=0.125 2023-06-20 09:09:46,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=677424.0, ans=0.125 2023-06-20 09:10:09,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=677484.0, ans=0.5 2023-06-20 09:10:23,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=677544.0, ans=0.0 2023-06-20 09:10:31,803 INFO [train.py:996] (3/4) Epoch 4, batch 21450, loss[loss=0.2841, simple_loss=0.3375, pruned_loss=0.1154, over 21726.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3234, pruned_loss=0.09671, over 4276272.03 frames. ], batch size: 473, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:10:49,752 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.725e+02 3.459e+02 4.369e+02 7.075e+02, threshold=6.919e+02, percent-clipped=1.0 2023-06-20 09:10:53,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=677664.0, ans=0.04949747468305833 2023-06-20 09:11:04,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=677664.0, ans=0.125 2023-06-20 09:11:14,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=677664.0, ans=0.125 2023-06-20 09:11:41,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=677784.0, ans=0.125 2023-06-20 09:11:56,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=677844.0, ans=0.1 2023-06-20 09:12:13,304 INFO [train.py:996] (3/4) Epoch 4, batch 21500, loss[loss=0.263, simple_loss=0.3313, pruned_loss=0.09739, over 20905.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3214, pruned_loss=0.09761, over 4280534.69 frames. ], batch size: 607, lr: 7.74e-03, grad_scale: 32.0 2023-06-20 09:12:18,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=677904.0, ans=0.2 2023-06-20 09:12:35,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=677964.0, ans=0.04949747468305833 2023-06-20 09:13:23,977 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-20 09:13:24,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=678084.0, ans=0.125 2023-06-20 09:13:40,711 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-20 09:13:41,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=678144.0, ans=0.125 2023-06-20 09:13:44,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=678144.0, ans=0.125 2023-06-20 09:13:48,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=678144.0, ans=0.125 2023-06-20 09:13:56,017 INFO [train.py:996] (3/4) Epoch 4, batch 21550, loss[loss=0.2047, simple_loss=0.2734, pruned_loss=0.06804, over 21606.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.314, pruned_loss=0.0941, over 4276480.51 frames. ], batch size: 391, lr: 7.74e-03, grad_scale: 32.0 2023-06-20 09:14:14,645 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.699e+02 3.223e+02 3.892e+02 7.035e+02, threshold=6.447e+02, percent-clipped=1.0 2023-06-20 09:14:38,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=678264.0, ans=0.1 2023-06-20 09:14:49,445 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=22.5 2023-06-20 09:15:17,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=678384.0, ans=0.035 2023-06-20 09:15:45,547 INFO [train.py:996] (3/4) Epoch 4, batch 21600, loss[loss=0.2376, simple_loss=0.2889, pruned_loss=0.09312, over 21547.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3109, pruned_loss=0.0931, over 4278705.43 frames. ], batch size: 442, lr: 7.74e-03, grad_scale: 32.0 2023-06-20 09:15:47,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=678504.0, ans=0.125 2023-06-20 09:15:47,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=678504.0, ans=0.2 2023-06-20 09:15:48,560 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-20 09:16:11,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=678564.0, ans=0.2 2023-06-20 09:16:18,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=678564.0, ans=0.125 2023-06-20 09:17:25,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=678744.0, ans=0.05 2023-06-20 09:17:29,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=678804.0, ans=0.125 2023-06-20 09:17:29,754 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=22.5 2023-06-20 09:17:30,278 INFO [train.py:996] (3/4) Epoch 4, batch 21650, loss[loss=0.2742, simple_loss=0.3665, pruned_loss=0.09095, over 21627.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3154, pruned_loss=0.09113, over 4281306.23 frames. ], batch size: 441, lr: 7.74e-03, grad_scale: 16.0 2023-06-20 09:17:53,973 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.809e+02 3.238e+02 3.645e+02 6.702e+02, threshold=6.475e+02, percent-clipped=1.0 2023-06-20 09:18:35,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=678984.0, ans=0.125 2023-06-20 09:18:38,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=678984.0, ans=0.125 2023-06-20 09:19:05,604 INFO [train.py:996] (3/4) Epoch 4, batch 21700, loss[loss=0.2151, simple_loss=0.2769, pruned_loss=0.07666, over 21682.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3166, pruned_loss=0.08957, over 4276396.18 frames. ], batch size: 282, lr: 7.74e-03, grad_scale: 16.0 2023-06-20 09:19:52,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=679224.0, ans=0.2 2023-06-20 09:20:00,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=679224.0, ans=0.0 2023-06-20 09:20:47,256 INFO [train.py:996] (3/4) Epoch 4, batch 21750, loss[loss=0.2342, simple_loss=0.291, pruned_loss=0.08873, over 21802.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3128, pruned_loss=0.09089, over 4274531.16 frames. ], batch size: 317, lr: 7.74e-03, grad_scale: 16.0 2023-06-20 09:21:12,005 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.575e+02 3.289e+02 4.229e+02 7.703e+02, threshold=6.577e+02, percent-clipped=2.0 2023-06-20 09:21:19,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=679464.0, ans=0.2 2023-06-20 09:21:25,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=679464.0, ans=0.125 2023-06-20 09:21:50,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=679584.0, ans=0.1 2023-06-20 09:22:16,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=679644.0, ans=0.125 2023-06-20 09:22:31,482 INFO [train.py:996] (3/4) Epoch 4, batch 21800, loss[loss=0.2164, simple_loss=0.2853, pruned_loss=0.07378, over 21692.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3108, pruned_loss=0.09154, over 4274964.33 frames. ], batch size: 282, lr: 7.73e-03, grad_scale: 16.0 2023-06-20 09:22:39,101 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-20 09:22:41,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=679704.0, ans=0.1 2023-06-20 09:23:29,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=679824.0, ans=0.0 2023-06-20 09:23:54,966 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:24:00,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=679944.0, ans=0.2 2023-06-20 09:24:08,953 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-20 09:24:19,744 INFO [train.py:996] (3/4) Epoch 4, batch 21850, loss[loss=0.2656, simple_loss=0.3306, pruned_loss=0.1003, over 21257.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3158, pruned_loss=0.0917, over 4277092.62 frames. ], batch size: 176, lr: 7.73e-03, grad_scale: 16.0 2023-06-20 09:24:39,339 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.164e+02 2.758e+02 3.560e+02 4.592e+02 6.859e+02, threshold=7.120e+02, percent-clipped=3.0 2023-06-20 09:25:29,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.01 vs. limit=6.0 2023-06-20 09:25:49,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=680244.0, ans=0.0 2023-06-20 09:26:00,369 INFO [train.py:996] (3/4) Epoch 4, batch 21900, loss[loss=0.2768, simple_loss=0.3116, pruned_loss=0.121, over 21462.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3169, pruned_loss=0.09285, over 4279922.56 frames. ], batch size: 508, lr: 7.73e-03, grad_scale: 16.0 2023-06-20 09:26:12,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=680304.0, ans=0.0 2023-06-20 09:26:29,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=680364.0, ans=0.125 2023-06-20 09:27:12,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=680484.0, ans=0.1 2023-06-20 09:27:42,093 INFO [train.py:996] (3/4) Epoch 4, batch 21950, loss[loss=0.2114, simple_loss=0.2953, pruned_loss=0.06372, over 21512.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3113, pruned_loss=0.09158, over 4278430.80 frames. ], batch size: 441, lr: 7.73e-03, grad_scale: 16.0 2023-06-20 09:28:06,012 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.677e+02 3.027e+02 3.757e+02 5.142e+02, threshold=6.054e+02, percent-clipped=0.0 2023-06-20 09:28:08,194 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:28:53,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=680784.0, ans=0.2 2023-06-20 09:29:05,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=680844.0, ans=0.125 2023-06-20 09:29:12,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=680844.0, ans=0.125 2023-06-20 09:29:24,061 INFO [train.py:996] (3/4) Epoch 4, batch 22000, loss[loss=0.3224, simple_loss=0.3638, pruned_loss=0.1405, over 21352.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3038, pruned_loss=0.08764, over 4280704.67 frames. ], batch size: 507, lr: 7.73e-03, grad_scale: 32.0 2023-06-20 09:30:21,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=681024.0, ans=0.0 2023-06-20 09:30:36,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=681084.0, ans=0.1 2023-06-20 09:31:12,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=681204.0, ans=0.125 2023-06-20 09:31:13,106 INFO [train.py:996] (3/4) Epoch 4, batch 22050, loss[loss=0.3741, simple_loss=0.4367, pruned_loss=0.1558, over 21428.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3071, pruned_loss=0.08861, over 4276289.65 frames. ], batch size: 471, lr: 7.73e-03, grad_scale: 32.0 2023-06-20 09:31:33,502 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.650e+02 3.249e+02 4.022e+02 7.710e+02, threshold=6.498e+02, percent-clipped=6.0 2023-06-20 09:32:36,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=681444.0, ans=0.125 2023-06-20 09:32:57,307 INFO [train.py:996] (3/4) Epoch 4, batch 22100, loss[loss=0.2685, simple_loss=0.3315, pruned_loss=0.1027, over 21834.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3221, pruned_loss=0.09533, over 4272935.43 frames. ], batch size: 351, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:34:38,826 INFO [train.py:996] (3/4) Epoch 4, batch 22150, loss[loss=0.2592, simple_loss=0.3267, pruned_loss=0.09583, over 21889.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.325, pruned_loss=0.09735, over 4277503.04 frames. ], batch size: 316, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:34:57,905 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 3.000e+02 3.535e+02 4.245e+02 7.467e+02, threshold=7.071e+02, percent-clipped=4.0 2023-06-20 09:35:21,379 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-20 09:35:30,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=681924.0, ans=0.0 2023-06-20 09:36:21,212 INFO [train.py:996] (3/4) Epoch 4, batch 22200, loss[loss=0.2728, simple_loss=0.3409, pruned_loss=0.1024, over 21780.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3261, pruned_loss=0.09843, over 4286929.79 frames. ], batch size: 441, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:37:02,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=682164.0, ans=0.0 2023-06-20 09:37:17,966 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-20 09:37:25,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=682284.0, ans=0.125 2023-06-20 09:38:02,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=682344.0, ans=0.015 2023-06-20 09:38:02,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=682344.0, ans=0.0 2023-06-20 09:38:08,606 INFO [train.py:996] (3/4) Epoch 4, batch 22250, loss[loss=0.3915, simple_loss=0.4307, pruned_loss=0.1761, over 21453.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3347, pruned_loss=0.1006, over 4281563.29 frames. ], batch size: 471, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:38:15,843 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-20 09:38:17,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=682404.0, ans=0.125 2023-06-20 09:38:17,602 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=12.0 2023-06-20 09:38:23,004 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 2.823e+02 3.649e+02 4.510e+02 8.047e+02, threshold=7.298e+02, percent-clipped=1.0 2023-06-20 09:39:09,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=682584.0, ans=0.0 2023-06-20 09:39:27,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=682644.0, ans=0.2 2023-06-20 09:39:49,747 INFO [train.py:996] (3/4) Epoch 4, batch 22300, loss[loss=0.2321, simple_loss=0.2904, pruned_loss=0.08691, over 21341.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3365, pruned_loss=0.1028, over 4286503.94 frames. ], batch size: 159, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:39:56,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=682704.0, ans=0.1 2023-06-20 09:40:08,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=682764.0, ans=0.5 2023-06-20 09:40:50,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=682884.0, ans=0.2 2023-06-20 09:41:13,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=682944.0, ans=0.1 2023-06-20 09:41:31,557 INFO [train.py:996] (3/4) Epoch 4, batch 22350, loss[loss=0.2527, simple_loss=0.3173, pruned_loss=0.09406, over 21875.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.334, pruned_loss=0.1024, over 4289700.10 frames. ], batch size: 371, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:41:38,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=683004.0, ans=0.1 2023-06-20 09:41:46,591 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.465e+02 3.048e+02 3.517e+02 4.689e+02 8.292e+02, threshold=7.034e+02, percent-clipped=3.0 2023-06-20 09:41:54,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=683064.0, ans=0.125 2023-06-20 09:42:46,163 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=15.0 2023-06-20 09:43:16,278 INFO [train.py:996] (3/4) Epoch 4, batch 22400, loss[loss=0.2589, simple_loss=0.3192, pruned_loss=0.09929, over 21890.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3306, pruned_loss=0.09887, over 4286703.37 frames. ], batch size: 107, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:43:46,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=683364.0, ans=0.0 2023-06-20 09:44:45,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=683544.0, ans=0.0 2023-06-20 09:44:58,367 INFO [train.py:996] (3/4) Epoch 4, batch 22450, loss[loss=0.2491, simple_loss=0.3085, pruned_loss=0.09482, over 21825.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3235, pruned_loss=0.09741, over 4292678.83 frames. ], batch size: 107, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:45:18,325 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 2.635e+02 3.000e+02 3.519e+02 5.856e+02, threshold=6.001e+02, percent-clipped=0.0 2023-06-20 09:45:50,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-20 09:46:34,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=683844.0, ans=0.05 2023-06-20 09:46:48,143 INFO [train.py:996] (3/4) Epoch 4, batch 22500, loss[loss=0.2544, simple_loss=0.3019, pruned_loss=0.1035, over 20716.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3194, pruned_loss=0.09737, over 4279331.71 frames. ], batch size: 607, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:46:59,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=683904.0, ans=0.125 2023-06-20 09:47:06,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=683964.0, ans=0.1 2023-06-20 09:47:44,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=684024.0, ans=0.2 2023-06-20 09:48:30,843 INFO [train.py:996] (3/4) Epoch 4, batch 22550, loss[loss=0.248, simple_loss=0.3145, pruned_loss=0.09077, over 21553.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3228, pruned_loss=0.09737, over 4280681.18 frames. ], batch size: 548, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:48:45,429 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.179e+02 2.905e+02 3.340e+02 4.222e+02 9.344e+02, threshold=6.680e+02, percent-clipped=7.0 2023-06-20 09:49:01,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=684264.0, ans=0.04949747468305833 2023-06-20 09:49:16,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=684324.0, ans=0.025 2023-06-20 09:49:19,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=684324.0, ans=0.0 2023-06-20 09:49:39,989 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:50:14,893 INFO [train.py:996] (3/4) Epoch 4, batch 22600, loss[loss=0.208, simple_loss=0.271, pruned_loss=0.07253, over 21613.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3266, pruned_loss=0.09868, over 4286545.25 frames. ], batch size: 230, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:50:15,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=684504.0, ans=0.0 2023-06-20 09:50:18,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=684504.0, ans=0.1 2023-06-20 09:50:26,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=684504.0, ans=0.125 2023-06-20 09:51:27,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=684684.0, ans=0.125 2023-06-20 09:51:57,234 INFO [train.py:996] (3/4) Epoch 4, batch 22650, loss[loss=0.2272, simple_loss=0.2825, pruned_loss=0.08594, over 21823.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3242, pruned_loss=0.09779, over 4285188.18 frames. ], batch size: 107, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:52:10,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=684804.0, ans=0.0 2023-06-20 09:52:12,040 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 3.311e+02 3.814e+02 4.832e+02 8.626e+02, threshold=7.628e+02, percent-clipped=4.0 2023-06-20 09:52:12,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=684864.0, ans=15.0 2023-06-20 09:52:26,539 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=22.5 2023-06-20 09:52:35,230 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-20 09:53:24,011 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.56 vs. limit=8.0 2023-06-20 09:53:40,927 INFO [train.py:996] (3/4) Epoch 4, batch 22700, loss[loss=0.2215, simple_loss=0.2714, pruned_loss=0.08582, over 20650.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3168, pruned_loss=0.09647, over 4275073.00 frames. ], batch size: 607, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 09:53:55,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=685164.0, ans=0.1 2023-06-20 09:54:46,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=685284.0, ans=0.1 2023-06-20 09:55:12,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=685344.0, ans=0.125 2023-06-20 09:55:23,774 INFO [train.py:996] (3/4) Epoch 4, batch 22750, loss[loss=0.3076, simple_loss=0.3648, pruned_loss=0.1251, over 21759.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3203, pruned_loss=0.09989, over 4273152.31 frames. ], batch size: 441, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 09:55:43,083 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.761e+02 3.069e+02 3.774e+02 7.547e+02, threshold=6.137e+02, percent-clipped=0.0 2023-06-20 09:56:28,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=685584.0, ans=0.2 2023-06-20 09:56:47,674 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.16 vs. limit=8.0 2023-06-20 09:57:05,730 INFO [train.py:996] (3/4) Epoch 4, batch 22800, loss[loss=0.2487, simple_loss=0.3071, pruned_loss=0.09516, over 21660.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3247, pruned_loss=0.1029, over 4283181.01 frames. ], batch size: 263, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 09:58:18,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=685884.0, ans=0.125 2023-06-20 09:58:47,340 INFO [train.py:996] (3/4) Epoch 4, batch 22850, loss[loss=0.2135, simple_loss=0.2747, pruned_loss=0.07621, over 21640.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3225, pruned_loss=0.1018, over 4284953.31 frames. ], batch size: 247, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 09:58:57,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=686004.0, ans=0.125 2023-06-20 09:59:05,619 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-20 09:59:07,739 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.241e+02 3.174e+02 3.905e+02 4.697e+02 7.560e+02, threshold=7.810e+02, percent-clipped=8.0 2023-06-20 09:59:15,956 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.97 vs. limit=10.0 2023-06-20 09:59:58,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=686184.0, ans=0.0 2023-06-20 10:00:10,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=686184.0, ans=0.125 2023-06-20 10:00:26,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=686244.0, ans=0.1 2023-06-20 10:00:31,651 INFO [train.py:996] (3/4) Epoch 4, batch 22900, loss[loss=0.2939, simple_loss=0.3907, pruned_loss=0.09851, over 21624.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3242, pruned_loss=0.1004, over 4285283.34 frames. ], batch size: 441, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 10:00:36,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=686304.0, ans=0.125 2023-06-20 10:00:45,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=686304.0, ans=0.125 2023-06-20 10:01:51,608 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.49 vs. limit=22.5 2023-06-20 10:02:22,292 INFO [train.py:996] (3/4) Epoch 4, batch 22950, loss[loss=0.2266, simple_loss=0.2899, pruned_loss=0.08167, over 16472.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3377, pruned_loss=0.09802, over 4277657.61 frames. ], batch size: 60, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 10:02:41,644 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.062e+02 3.428e+02 4.438e+02 8.217e+02, threshold=6.855e+02, percent-clipped=1.0 2023-06-20 10:02:55,487 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:04:09,132 INFO [train.py:996] (3/4) Epoch 4, batch 23000, loss[loss=0.2476, simple_loss=0.3118, pruned_loss=0.09169, over 21578.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3371, pruned_loss=0.09579, over 4283696.17 frames. ], batch size: 548, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:04:19,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=686904.0, ans=0.2 2023-06-20 10:04:37,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=686964.0, ans=0.125 2023-06-20 10:05:52,787 INFO [train.py:996] (3/4) Epoch 4, batch 23050, loss[loss=0.2686, simple_loss=0.3235, pruned_loss=0.1068, over 21385.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3376, pruned_loss=0.09827, over 4281487.59 frames. ], batch size: 176, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:05:58,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=687204.0, ans=0.035 2023-06-20 10:06:09,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=687204.0, ans=0.125 2023-06-20 10:06:12,279 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 2.843e+02 3.308e+02 4.395e+02 9.677e+02, threshold=6.616e+02, percent-clipped=9.0 2023-06-20 10:07:11,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=687384.0, ans=0.125 2023-06-20 10:07:16,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=687444.0, ans=0.125 2023-06-20 10:07:31,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=687444.0, ans=0.0 2023-06-20 10:07:33,565 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-20 10:07:35,953 INFO [train.py:996] (3/4) Epoch 4, batch 23100, loss[loss=0.2562, simple_loss=0.3081, pruned_loss=0.1022, over 21557.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3328, pruned_loss=0.09865, over 4274398.50 frames. ], batch size: 391, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:08:45,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=687684.0, ans=0.125 2023-06-20 10:08:57,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=687744.0, ans=0.0 2023-06-20 10:09:17,207 INFO [train.py:996] (3/4) Epoch 4, batch 23150, loss[loss=0.2715, simple_loss=0.3228, pruned_loss=0.1101, over 21768.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3263, pruned_loss=0.09786, over 4275715.72 frames. ], batch size: 441, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:09:30,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=687804.0, ans=0.04949747468305833 2023-06-20 10:09:32,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=687804.0, ans=0.125 2023-06-20 10:09:36,705 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.820e+02 3.256e+02 3.906e+02 5.764e+02, threshold=6.513e+02, percent-clipped=0.0 2023-06-20 10:09:38,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=687864.0, ans=0.1 2023-06-20 10:09:38,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=687864.0, ans=0.04949747468305833 2023-06-20 10:09:46,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=687864.0, ans=0.02 2023-06-20 10:10:00,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=687924.0, ans=0.0 2023-06-20 10:10:20,666 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=16.77 vs. limit=15.0 2023-06-20 10:10:21,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=687984.0, ans=0.125 2023-06-20 10:10:53,455 INFO [train.py:996] (3/4) Epoch 4, batch 23200, loss[loss=0.2705, simple_loss=0.3341, pruned_loss=0.1035, over 21823.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.324, pruned_loss=0.09809, over 4281601.50 frames. ], batch size: 124, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:11:00,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=688104.0, ans=0.05 2023-06-20 10:11:09,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=688104.0, ans=0.125 2023-06-20 10:11:09,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=688104.0, ans=0.125 2023-06-20 10:11:18,497 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:11:24,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=688164.0, ans=0.125 2023-06-20 10:12:13,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=688344.0, ans=0.125 2023-06-20 10:12:36,031 INFO [train.py:996] (3/4) Epoch 4, batch 23250, loss[loss=0.352, simple_loss=0.4403, pruned_loss=0.1319, over 19766.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3246, pruned_loss=0.09947, over 4282735.11 frames. ], batch size: 704, lr: 7.69e-03, grad_scale: 16.0 2023-06-20 10:12:38,628 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.45 vs. limit=10.0 2023-06-20 10:12:49,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=688404.0, ans=0.0 2023-06-20 10:12:57,014 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.915e+02 3.512e+02 4.464e+02 9.491e+02, threshold=7.024e+02, percent-clipped=1.0 2023-06-20 10:13:14,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=688464.0, ans=0.125 2023-06-20 10:13:44,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=688584.0, ans=0.2 2023-06-20 10:13:54,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=688584.0, ans=0.0 2023-06-20 10:14:18,780 INFO [train.py:996] (3/4) Epoch 4, batch 23300, loss[loss=0.2484, simple_loss=0.3372, pruned_loss=0.07984, over 21425.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3344, pruned_loss=0.1022, over 4284296.85 frames. ], batch size: 211, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:14:34,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=688764.0, ans=0.015 2023-06-20 10:14:51,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=688764.0, ans=0.0 2023-06-20 10:15:12,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=688824.0, ans=0.125 2023-06-20 10:15:31,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=688884.0, ans=0.0 2023-06-20 10:15:41,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=688884.0, ans=0.125 2023-06-20 10:16:03,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=688944.0, ans=0.125 2023-06-20 10:16:05,778 INFO [train.py:996] (3/4) Epoch 4, batch 23350, loss[loss=0.2158, simple_loss=0.2964, pruned_loss=0.06762, over 21702.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3384, pruned_loss=0.1011, over 4289339.67 frames. ], batch size: 298, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:16:28,775 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.739e+02 3.338e+02 4.222e+02 6.703e+02, threshold=6.676e+02, percent-clipped=0.0 2023-06-20 10:16:40,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=689064.0, ans=0.0 2023-06-20 10:16:47,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=689124.0, ans=0.0 2023-06-20 10:17:03,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=689184.0, ans=0.125 2023-06-20 10:17:29,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=689244.0, ans=0.125 2023-06-20 10:17:36,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=689244.0, ans=0.1 2023-06-20 10:17:38,740 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.15 vs. limit=6.0 2023-06-20 10:17:42,556 INFO [train.py:996] (3/4) Epoch 4, batch 23400, loss[loss=0.2682, simple_loss=0.3276, pruned_loss=0.1043, over 21824.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3312, pruned_loss=0.09725, over 4284275.12 frames. ], batch size: 124, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:18:30,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=689424.0, ans=0.125 2023-06-20 10:18:55,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=689484.0, ans=0.0 2023-06-20 10:19:01,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=689484.0, ans=0.0 2023-06-20 10:19:04,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=689484.0, ans=0.125 2023-06-20 10:19:17,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=689544.0, ans=0.2 2023-06-20 10:19:30,571 INFO [train.py:996] (3/4) Epoch 4, batch 23450, loss[loss=0.2368, simple_loss=0.2787, pruned_loss=0.09747, over 20280.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3329, pruned_loss=0.1005, over 4290812.39 frames. ], batch size: 702, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:19:48,766 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.709e+02 2.746e+02 3.178e+02 4.019e+02 6.793e+02, threshold=6.356e+02, percent-clipped=1.0 2023-06-20 10:20:40,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=689784.0, ans=0.2 2023-06-20 10:21:08,120 INFO [train.py:996] (3/4) Epoch 4, batch 23500, loss[loss=0.2392, simple_loss=0.3009, pruned_loss=0.08872, over 21855.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.3333, pruned_loss=0.1026, over 4296053.13 frames. ], batch size: 298, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:21:21,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=689904.0, ans=0.0 2023-06-20 10:22:52,360 INFO [train.py:996] (3/4) Epoch 4, batch 23550, loss[loss=0.2417, simple_loss=0.2921, pruned_loss=0.09563, over 21671.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3268, pruned_loss=0.1017, over 4289485.90 frames. ], batch size: 416, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:23:00,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.30 vs. limit=12.0 2023-06-20 10:23:02,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=690204.0, ans=0.125 2023-06-20 10:23:10,766 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.284e+02 2.904e+02 3.219e+02 3.862e+02 7.198e+02, threshold=6.438e+02, percent-clipped=1.0 2023-06-20 10:23:58,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=690384.0, ans=0.0 2023-06-20 10:24:03,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=690384.0, ans=0.125 2023-06-20 10:24:30,306 INFO [train.py:996] (3/4) Epoch 4, batch 23600, loss[loss=0.2699, simple_loss=0.3338, pruned_loss=0.103, over 21819.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3285, pruned_loss=0.1017, over 4283717.87 frames. ], batch size: 247, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:24:51,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=690564.0, ans=0.1 2023-06-20 10:25:05,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=690564.0, ans=0.125 2023-06-20 10:26:00,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=690744.0, ans=0.2 2023-06-20 10:26:10,330 INFO [train.py:996] (3/4) Epoch 4, batch 23650, loss[loss=0.2815, simple_loss=0.3463, pruned_loss=0.1084, over 21205.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3274, pruned_loss=0.09919, over 4282697.42 frames. ], batch size: 143, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:26:21,049 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=22.5 2023-06-20 10:26:38,376 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 3.037e+02 3.480e+02 4.305e+02 8.157e+02, threshold=6.960e+02, percent-clipped=7.0 2023-06-20 10:26:52,078 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-20 10:27:03,816 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.12 vs. limit=15.0 2023-06-20 10:27:33,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=690984.0, ans=0.2 2023-06-20 10:27:53,236 INFO [train.py:996] (3/4) Epoch 4, batch 23700, loss[loss=0.3122, simple_loss=0.3677, pruned_loss=0.1283, over 21798.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3312, pruned_loss=0.0994, over 4277070.26 frames. ], batch size: 124, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:28:08,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=691104.0, ans=0.0 2023-06-20 10:28:17,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=691104.0, ans=0.0 2023-06-20 10:28:28,288 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:29:48,025 INFO [train.py:996] (3/4) Epoch 4, batch 23750, loss[loss=0.2459, simple_loss=0.333, pruned_loss=0.07941, over 21805.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3323, pruned_loss=0.09865, over 4275173.88 frames. ], batch size: 282, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:30:11,538 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.867e+02 3.217e+02 4.313e+02 7.122e+02, threshold=6.434e+02, percent-clipped=1.0 2023-06-20 10:31:37,962 INFO [train.py:996] (3/4) Epoch 4, batch 23800, loss[loss=0.2731, simple_loss=0.3566, pruned_loss=0.0948, over 21638.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3296, pruned_loss=0.09582, over 4273770.76 frames. ], batch size: 263, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:31:41,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=691704.0, ans=0.025 2023-06-20 10:32:13,168 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.12 vs. limit=15.0 2023-06-20 10:32:30,071 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=22.5 2023-06-20 10:32:45,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=691884.0, ans=0.0 2023-06-20 10:33:13,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=691944.0, ans=0.2 2023-06-20 10:33:23,364 INFO [train.py:996] (3/4) Epoch 4, batch 23850, loss[loss=0.3067, simple_loss=0.4216, pruned_loss=0.09592, over 19782.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3383, pruned_loss=0.09766, over 4274723.82 frames. ], batch size: 702, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:33:48,526 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 3.013e+02 3.868e+02 5.325e+02 1.077e+03, threshold=7.737e+02, percent-clipped=14.0 2023-06-20 10:33:52,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=692064.0, ans=0.125 2023-06-20 10:33:56,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=692064.0, ans=0.125 2023-06-20 10:34:55,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=692244.0, ans=0.125 2023-06-20 10:35:00,271 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-06-20 10:35:13,542 INFO [train.py:996] (3/4) Epoch 4, batch 23900, loss[loss=0.2584, simple_loss=0.3248, pruned_loss=0.09597, over 21595.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.3449, pruned_loss=0.09992, over 4283640.04 frames. ], batch size: 263, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:36:32,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=692544.0, ans=0.07 2023-06-20 10:36:52,305 INFO [train.py:996] (3/4) Epoch 4, batch 23950, loss[loss=0.2701, simple_loss=0.3267, pruned_loss=0.1067, over 21901.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.3385, pruned_loss=0.1001, over 4283139.45 frames. ], batch size: 372, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:36:54,397 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:37:10,667 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.929e+02 3.410e+02 4.340e+02 7.845e+02, threshold=6.819e+02, percent-clipped=1.0 2023-06-20 10:37:23,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=692664.0, ans=0.0 2023-06-20 10:37:59,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=692784.0, ans=0.125 2023-06-20 10:38:06,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=692784.0, ans=0.125 2023-06-20 10:38:30,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=692844.0, ans=0.2 2023-06-20 10:38:31,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=692844.0, ans=0.04949747468305833 2023-06-20 10:38:36,279 INFO [train.py:996] (3/4) Epoch 4, batch 24000, loss[loss=0.2568, simple_loss=0.331, pruned_loss=0.09128, over 21713.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3384, pruned_loss=0.102, over 4281462.15 frames. ], batch size: 298, lr: 7.66e-03, grad_scale: 32.0 2023-06-20 10:38:36,279 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 10:38:53,760 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2722, simple_loss=0.3716, pruned_loss=0.08645, over 1796401.00 frames. 2023-06-20 10:38:53,760 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-20 10:38:56,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=692904.0, ans=0.125 2023-06-20 10:39:22,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=692964.0, ans=0.0 2023-06-20 10:39:29,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=692964.0, ans=0.125 2023-06-20 10:39:46,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=693024.0, ans=0.125 2023-06-20 10:39:50,400 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=22.5 2023-06-20 10:40:13,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=693084.0, ans=0.0 2023-06-20 10:40:37,764 INFO [train.py:996] (3/4) Epoch 4, batch 24050, loss[loss=0.2598, simple_loss=0.3435, pruned_loss=0.08811, over 21686.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3414, pruned_loss=0.1032, over 4281409.56 frames. ], batch size: 389, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:40:38,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=693204.0, ans=0.125 2023-06-20 10:41:03,497 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.885e+02 3.459e+02 4.129e+02 6.625e+02, threshold=6.917e+02, percent-clipped=0.0 2023-06-20 10:41:08,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=693264.0, ans=0.125 2023-06-20 10:41:41,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=693384.0, ans=0.125 2023-06-20 10:41:41,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=693384.0, ans=0.0 2023-06-20 10:42:08,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=693444.0, ans=0.125 2023-06-20 10:42:08,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=693444.0, ans=0.125 2023-06-20 10:42:11,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=693444.0, ans=0.125 2023-06-20 10:42:21,496 INFO [train.py:996] (3/4) Epoch 4, batch 24100, loss[loss=0.2787, simple_loss=0.348, pruned_loss=0.1047, over 21300.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3405, pruned_loss=0.1009, over 4264496.18 frames. ], batch size: 159, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:42:24,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=15.0 2023-06-20 10:42:40,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=693504.0, ans=0.125 2023-06-20 10:43:09,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=693624.0, ans=0.125 2023-06-20 10:43:45,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=693744.0, ans=0.125 2023-06-20 10:44:03,688 INFO [train.py:996] (3/4) Epoch 4, batch 24150, loss[loss=0.2681, simple_loss=0.3289, pruned_loss=0.1037, over 21483.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3418, pruned_loss=0.1041, over 4276680.17 frames. ], batch size: 131, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:44:38,764 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 3.099e+02 3.647e+02 4.955e+02 8.844e+02, threshold=7.295e+02, percent-clipped=4.0 2023-06-20 10:44:46,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=693864.0, ans=0.125 2023-06-20 10:44:47,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=693864.0, ans=0.0 2023-06-20 10:44:49,613 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.26 vs. limit=15.0 2023-06-20 10:45:06,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=693924.0, ans=0.125 2023-06-20 10:45:19,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=693984.0, ans=0.125 2023-06-20 10:45:27,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=693984.0, ans=0.125 2023-06-20 10:45:28,433 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.24 vs. limit=15.0 2023-06-20 10:45:34,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=694044.0, ans=0.5 2023-06-20 10:45:44,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=694044.0, ans=0.1 2023-06-20 10:45:52,579 INFO [train.py:996] (3/4) Epoch 4, batch 24200, loss[loss=0.2535, simple_loss=0.3373, pruned_loss=0.08488, over 21780.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.343, pruned_loss=0.1053, over 4270227.04 frames. ], batch size: 282, lr: 7.65e-03, grad_scale: 16.0 2023-06-20 10:46:28,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=694164.0, ans=0.07 2023-06-20 10:47:29,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=694344.0, ans=0.125 2023-06-20 10:47:46,455 INFO [train.py:996] (3/4) Epoch 4, batch 24250, loss[loss=0.233, simple_loss=0.3427, pruned_loss=0.06164, over 21196.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3383, pruned_loss=0.09797, over 4275527.27 frames. ], batch size: 548, lr: 7.65e-03, grad_scale: 16.0 2023-06-20 10:47:50,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=694404.0, ans=0.1 2023-06-20 10:47:58,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=694404.0, ans=0.1 2023-06-20 10:48:11,640 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.800e+02 3.363e+02 4.220e+02 7.304e+02, threshold=6.726e+02, percent-clipped=1.0 2023-06-20 10:48:54,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=694584.0, ans=0.125 2023-06-20 10:49:27,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=694704.0, ans=0.0 2023-06-20 10:49:28,789 INFO [train.py:996] (3/4) Epoch 4, batch 24300, loss[loss=0.1742, simple_loss=0.2579, pruned_loss=0.04528, over 21773.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.329, pruned_loss=0.09017, over 4282019.17 frames. ], batch size: 282, lr: 7.65e-03, grad_scale: 16.0 2023-06-20 10:49:29,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=694704.0, ans=0.0 2023-06-20 10:49:39,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=694704.0, ans=0.1 2023-06-20 10:49:55,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=694764.0, ans=0.1 2023-06-20 10:50:38,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=694944.0, ans=0.0 2023-06-20 10:50:38,839 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=22.5 2023-06-20 10:51:03,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=694944.0, ans=0.0 2023-06-20 10:51:12,271 INFO [train.py:996] (3/4) Epoch 4, batch 24350, loss[loss=0.214, simple_loss=0.2914, pruned_loss=0.06827, over 21635.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3279, pruned_loss=0.0917, over 4286860.57 frames. ], batch size: 230, lr: 7.65e-03, grad_scale: 16.0 2023-06-20 10:51:12,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=695004.0, ans=0.125 2023-06-20 10:51:24,854 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=22.5 2023-06-20 10:51:33,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=695064.0, ans=0.0 2023-06-20 10:51:37,141 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.964e+02 3.769e+02 4.911e+02 1.046e+03, threshold=7.538e+02, percent-clipped=11.0 2023-06-20 10:51:51,668 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=12.0 2023-06-20 10:52:56,598 INFO [train.py:996] (3/4) Epoch 4, batch 24400, loss[loss=0.2715, simple_loss=0.3381, pruned_loss=0.1025, over 21718.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3331, pruned_loss=0.09546, over 4289933.00 frames. ], batch size: 124, lr: 7.65e-03, grad_scale: 32.0 2023-06-20 10:53:07,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=695304.0, ans=0.1 2023-06-20 10:53:13,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=695364.0, ans=0.1 2023-06-20 10:54:20,612 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.32 vs. limit=15.0 2023-06-20 10:54:39,018 INFO [train.py:996] (3/4) Epoch 4, batch 24450, loss[loss=0.2791, simple_loss=0.3557, pruned_loss=0.1013, over 21609.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.3365, pruned_loss=0.09649, over 4285993.70 frames. ], batch size: 263, lr: 7.65e-03, grad_scale: 32.0 2023-06-20 10:54:59,437 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.828e+02 3.266e+02 3.769e+02 5.234e+02, threshold=6.531e+02, percent-clipped=0.0 2023-06-20 10:55:11,285 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.25 vs. limit=15.0 2023-06-20 10:55:12,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=695724.0, ans=0.04949747468305833 2023-06-20 10:55:38,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=695784.0, ans=0.0 2023-06-20 10:56:00,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=695844.0, ans=0.0 2023-06-20 10:56:21,819 INFO [train.py:996] (3/4) Epoch 4, batch 24500, loss[loss=0.2573, simple_loss=0.3158, pruned_loss=0.0994, over 21553.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3362, pruned_loss=0.09607, over 4288362.15 frames. ], batch size: 211, lr: 7.64e-03, grad_scale: 32.0 2023-06-20 10:56:27,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=695904.0, ans=0.1 2023-06-20 10:56:30,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=695904.0, ans=0.125 2023-06-20 10:56:30,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=695904.0, ans=0.125 2023-06-20 10:56:50,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=695964.0, ans=0.0 2023-06-20 10:57:00,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=696024.0, ans=0.125 2023-06-20 10:57:10,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-20 10:57:22,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=696084.0, ans=0.125 2023-06-20 10:57:25,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=696084.0, ans=0.1 2023-06-20 10:58:02,718 INFO [train.py:996] (3/4) Epoch 4, batch 24550, loss[loss=0.3221, simple_loss=0.3798, pruned_loss=0.1322, over 21527.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.34, pruned_loss=0.09984, over 4293880.35 frames. ], batch size: 131, lr: 7.64e-03, grad_scale: 32.0 2023-06-20 10:58:08,412 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-20 10:58:21,862 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 2.855e+02 3.314e+02 4.015e+02 6.051e+02, threshold=6.629e+02, percent-clipped=0.0 2023-06-20 10:59:15,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-20 10:59:40,437 INFO [train.py:996] (3/4) Epoch 4, batch 24600, loss[loss=0.2454, simple_loss=0.2946, pruned_loss=0.09816, over 21216.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3353, pruned_loss=0.1001, over 4285565.68 frames. ], batch size: 143, lr: 7.64e-03, grad_scale: 32.0 2023-06-20 10:59:50,006 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-06-20 10:59:56,203 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-06-20 11:00:05,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=696564.0, ans=0.125 2023-06-20 11:00:05,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=696564.0, ans=0.0 2023-06-20 11:00:25,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=696624.0, ans=0.125 2023-06-20 11:00:26,226 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=15.0 2023-06-20 11:01:18,235 INFO [train.py:996] (3/4) Epoch 4, batch 24650, loss[loss=0.2607, simple_loss=0.3193, pruned_loss=0.1011, over 21418.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3285, pruned_loss=0.09896, over 4286052.02 frames. ], batch size: 131, lr: 7.64e-03, grad_scale: 32.0 2023-06-20 11:01:25,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=696804.0, ans=0.0 2023-06-20 11:01:33,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=696864.0, ans=0.0 2023-06-20 11:01:38,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=696864.0, ans=0.125 2023-06-20 11:01:39,818 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 3.151e+02 3.751e+02 4.902e+02 9.106e+02, threshold=7.501e+02, percent-clipped=6.0 2023-06-20 11:02:39,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=696984.0, ans=15.0 2023-06-20 11:02:46,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=697044.0, ans=0.125 2023-06-20 11:03:01,110 INFO [train.py:996] (3/4) Epoch 4, batch 24700, loss[loss=0.2369, simple_loss=0.2976, pruned_loss=0.08809, over 21798.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3259, pruned_loss=0.09761, over 4271737.26 frames. ], batch size: 107, lr: 7.64e-03, grad_scale: 16.0 2023-06-20 11:03:17,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=697164.0, ans=0.125 2023-06-20 11:03:34,603 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.46 vs. limit=15.0 2023-06-20 11:03:35,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=697164.0, ans=0.0 2023-06-20 11:03:47,113 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.39 vs. limit=10.0 2023-06-20 11:04:16,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=697284.0, ans=0.125 2023-06-20 11:04:21,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=697344.0, ans=0.125 2023-06-20 11:04:38,953 INFO [train.py:996] (3/4) Epoch 4, batch 24750, loss[loss=0.2317, simple_loss=0.2822, pruned_loss=0.09064, over 21500.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3187, pruned_loss=0.09476, over 4266402.31 frames. ], batch size: 195, lr: 7.64e-03, grad_scale: 16.0 2023-06-20 11:04:58,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=697464.0, ans=0.125 2023-06-20 11:05:00,208 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.625e+02 3.082e+02 3.571e+02 6.291e+02, threshold=6.165e+02, percent-clipped=0.0 2023-06-20 11:06:13,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=697644.0, ans=0.1 2023-06-20 11:06:17,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=697644.0, ans=0.125 2023-06-20 11:06:20,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=697704.0, ans=0.125 2023-06-20 11:06:22,081 INFO [train.py:996] (3/4) Epoch 4, batch 24800, loss[loss=0.2365, simple_loss=0.2946, pruned_loss=0.08914, over 21744.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3136, pruned_loss=0.09427, over 4274900.33 frames. ], batch size: 247, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:07:34,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=697884.0, ans=0.125 2023-06-20 11:07:54,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=697944.0, ans=0.125 2023-06-20 11:08:05,494 INFO [train.py:996] (3/4) Epoch 4, batch 24850, loss[loss=0.2398, simple_loss=0.3143, pruned_loss=0.0826, over 21845.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3153, pruned_loss=0.09582, over 4279089.00 frames. ], batch size: 332, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:08:27,073 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 2.980e+02 3.585e+02 4.162e+02 8.983e+02, threshold=7.171e+02, percent-clipped=3.0 2023-06-20 11:08:48,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=698124.0, ans=0.025 2023-06-20 11:09:01,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=698124.0, ans=0.125 2023-06-20 11:09:17,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=698184.0, ans=0.125 2023-06-20 11:09:18,519 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=15.0 2023-06-20 11:09:26,749 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-20 11:09:49,548 INFO [train.py:996] (3/4) Epoch 4, batch 24900, loss[loss=0.2642, simple_loss=0.333, pruned_loss=0.0977, over 21295.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3175, pruned_loss=0.09605, over 4276898.12 frames. ], batch size: 143, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:10:28,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=698364.0, ans=0.07 2023-06-20 11:10:36,080 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-06-20 11:11:06,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=698484.0, ans=0.125 2023-06-20 11:11:10,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=698544.0, ans=0.0 2023-06-20 11:11:21,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=698544.0, ans=0.125 2023-06-20 11:11:29,322 INFO [train.py:996] (3/4) Epoch 4, batch 24950, loss[loss=0.3013, simple_loss=0.37, pruned_loss=0.1163, over 21449.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3261, pruned_loss=0.101, over 4282008.15 frames. ], batch size: 131, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:12:12,244 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.285e+02 3.311e+02 3.992e+02 4.985e+02 7.150e+02, threshold=7.983e+02, percent-clipped=0.0 2023-06-20 11:12:21,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=698664.0, ans=0.125 2023-06-20 11:12:26,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=698724.0, ans=0.125 2023-06-20 11:12:57,462 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.93 vs. limit=22.5 2023-06-20 11:13:08,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=698844.0, ans=0.2 2023-06-20 11:13:09,242 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.06 vs. limit=22.5 2023-06-20 11:13:19,537 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-20 11:13:20,088 INFO [train.py:996] (3/4) Epoch 4, batch 25000, loss[loss=0.2547, simple_loss=0.3142, pruned_loss=0.09761, over 21085.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3327, pruned_loss=0.1028, over 4282253.10 frames. ], batch size: 143, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:13:41,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=698904.0, ans=0.0 2023-06-20 11:14:05,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=699024.0, ans=0.125 2023-06-20 11:14:35,340 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-20 11:14:49,868 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-20 11:14:51,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=699144.0, ans=0.125 2023-06-20 11:15:02,926 INFO [train.py:996] (3/4) Epoch 4, batch 25050, loss[loss=0.2639, simple_loss=0.3118, pruned_loss=0.108, over 21442.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3263, pruned_loss=0.1014, over 4285546.82 frames. ], batch size: 441, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:15:17,406 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-20 11:15:40,003 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.775e+02 3.166e+02 3.769e+02 6.146e+02, threshold=6.333e+02, percent-clipped=0.0 2023-06-20 11:15:56,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=699324.0, ans=0.5 2023-06-20 11:16:28,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=699444.0, ans=0.125 2023-06-20 11:16:33,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=699444.0, ans=0.0 2023-06-20 11:16:47,487 INFO [train.py:996] (3/4) Epoch 4, batch 25100, loss[loss=0.2404, simple_loss=0.3186, pruned_loss=0.08107, over 21802.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3221, pruned_loss=0.09985, over 4280249.21 frames. ], batch size: 371, lr: 7.62e-03, grad_scale: 32.0 2023-06-20 11:17:12,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=699564.0, ans=0.0 2023-06-20 11:17:23,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=699564.0, ans=0.125 2023-06-20 11:17:53,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=699684.0, ans=0.125 2023-06-20 11:18:09,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=699744.0, ans=0.1 2023-06-20 11:18:29,729 INFO [train.py:996] (3/4) Epoch 4, batch 25150, loss[loss=0.2561, simple_loss=0.36, pruned_loss=0.0761, over 20791.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3255, pruned_loss=0.09728, over 4272594.51 frames. ], batch size: 608, lr: 7.62e-03, grad_scale: 32.0 2023-06-20 11:19:00,274 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.678e+02 3.105e+02 3.619e+02 6.270e+02, threshold=6.210e+02, percent-clipped=0.0 2023-06-20 11:19:53,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=700044.0, ans=0.0 2023-06-20 11:20:03,701 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 11:20:06,409 INFO [train.py:996] (3/4) Epoch 4, batch 25200, loss[loss=0.2399, simple_loss=0.3135, pruned_loss=0.08315, over 21235.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3239, pruned_loss=0.09472, over 4274473.27 frames. ], batch size: 176, lr: 7.62e-03, grad_scale: 32.0 2023-06-20 11:20:19,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=700104.0, ans=0.0 2023-06-20 11:20:33,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=700164.0, ans=0.2 2023-06-20 11:20:57,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=700224.0, ans=0.125 2023-06-20 11:21:10,444 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-20 11:21:43,472 INFO [train.py:996] (3/4) Epoch 4, batch 25250, loss[loss=0.2526, simple_loss=0.3041, pruned_loss=0.1005, over 21212.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3218, pruned_loss=0.09362, over 4264216.55 frames. ], batch size: 548, lr: 7.62e-03, grad_scale: 32.0 2023-06-20 11:22:20,663 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.744e+02 3.112e+02 3.810e+02 6.947e+02, threshold=6.224e+02, percent-clipped=3.0 2023-06-20 11:23:06,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=700584.0, ans=0.125 2023-06-20 11:23:32,828 INFO [train.py:996] (3/4) Epoch 4, batch 25300, loss[loss=0.2843, simple_loss=0.3339, pruned_loss=0.1173, over 21746.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3188, pruned_loss=0.09289, over 4255478.00 frames. ], batch size: 351, lr: 7.62e-03, grad_scale: 16.0 2023-06-20 11:23:59,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=700764.0, ans=0.125 2023-06-20 11:24:00,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=22.5 2023-06-20 11:24:20,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=700764.0, ans=15.0 2023-06-20 11:24:36,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=700824.0, ans=0.0 2023-06-20 11:24:44,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=700884.0, ans=0.0 2023-06-20 11:25:09,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=700944.0, ans=0.2 2023-06-20 11:25:27,710 INFO [train.py:996] (3/4) Epoch 4, batch 25350, loss[loss=0.2101, simple_loss=0.2856, pruned_loss=0.06731, over 21174.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3211, pruned_loss=0.09206, over 4252376.51 frames. ], batch size: 548, lr: 7.62e-03, grad_scale: 16.0 2023-06-20 11:25:55,989 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.721e+02 3.100e+02 3.889e+02 7.002e+02, threshold=6.200e+02, percent-clipped=1.0 2023-06-20 11:26:06,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=701124.0, ans=0.125 2023-06-20 11:26:19,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=701124.0, ans=0.125 2023-06-20 11:26:21,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=701184.0, ans=0.125 2023-06-20 11:27:05,567 INFO [train.py:996] (3/4) Epoch 4, batch 25400, loss[loss=0.2165, simple_loss=0.2745, pruned_loss=0.07925, over 21456.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3169, pruned_loss=0.09077, over 4251684.87 frames. ], batch size: 212, lr: 7.62e-03, grad_scale: 16.0 2023-06-20 11:27:33,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=701364.0, ans=0.125 2023-06-20 11:28:06,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=701484.0, ans=0.125 2023-06-20 11:28:42,926 INFO [train.py:996] (3/4) Epoch 4, batch 25450, loss[loss=0.2657, simple_loss=0.3477, pruned_loss=0.09184, over 21478.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3167, pruned_loss=0.09218, over 4246225.44 frames. ], batch size: 194, lr: 7.61e-03, grad_scale: 16.0 2023-06-20 11:29:17,647 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.926e+02 3.539e+02 4.386e+02 7.693e+02, threshold=7.077e+02, percent-clipped=6.0 2023-06-20 11:30:18,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=701844.0, ans=0.125 2023-06-20 11:30:25,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=701844.0, ans=0.0 2023-06-20 11:30:33,819 INFO [train.py:996] (3/4) Epoch 4, batch 25500, loss[loss=0.3537, simple_loss=0.4197, pruned_loss=0.1438, over 21431.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3176, pruned_loss=0.08908, over 4235419.49 frames. ], batch size: 507, lr: 7.61e-03, grad_scale: 16.0 2023-06-20 11:31:08,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=701964.0, ans=0.1 2023-06-20 11:32:16,665 INFO [train.py:996] (3/4) Epoch 4, batch 25550, loss[loss=0.2596, simple_loss=0.3601, pruned_loss=0.07958, over 19827.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3259, pruned_loss=0.08994, over 4243121.49 frames. ], batch size: 702, lr: 7.61e-03, grad_scale: 16.0 2023-06-20 11:32:17,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=702204.0, ans=0.0 2023-06-20 11:32:18,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=702204.0, ans=0.2 2023-06-20 11:32:28,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=702204.0, ans=0.2 2023-06-20 11:32:32,980 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=15.0 2023-06-20 11:32:40,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=702264.0, ans=0.1 2023-06-20 11:32:45,271 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.927e+02 3.531e+02 4.668e+02 7.861e+02, threshold=7.061e+02, percent-clipped=1.0 2023-06-20 11:32:50,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=702264.0, ans=0.5 2023-06-20 11:33:06,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=702324.0, ans=0.125 2023-06-20 11:33:45,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=702444.0, ans=0.1 2023-06-20 11:34:05,303 INFO [train.py:996] (3/4) Epoch 4, batch 25600, loss[loss=0.2883, simple_loss=0.3539, pruned_loss=0.1113, over 21325.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3294, pruned_loss=0.09044, over 4246560.86 frames. ], batch size: 548, lr: 7.61e-03, grad_scale: 32.0 2023-06-20 11:34:17,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=702504.0, ans=0.1 2023-06-20 11:34:33,977 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-06-20 11:34:34,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=702564.0, ans=0.125 2023-06-20 11:34:49,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=702624.0, ans=0.125 2023-06-20 11:34:53,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=702684.0, ans=0.125 2023-06-20 11:35:47,373 INFO [train.py:996] (3/4) Epoch 4, batch 25650, loss[loss=0.3167, simple_loss=0.4422, pruned_loss=0.09558, over 19735.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3309, pruned_loss=0.09385, over 4246075.44 frames. ], batch size: 702, lr: 7.61e-03, grad_scale: 32.0 2023-06-20 11:36:10,585 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 2.974e+02 3.726e+02 4.769e+02 9.123e+02, threshold=7.452e+02, percent-clipped=4.0 2023-06-20 11:36:24,903 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.22 vs. limit=15.0 2023-06-20 11:37:31,108 INFO [train.py:996] (3/4) Epoch 4, batch 25700, loss[loss=0.285, simple_loss=0.3519, pruned_loss=0.109, over 21527.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3268, pruned_loss=0.09497, over 4248418.99 frames. ], batch size: 471, lr: 7.61e-03, grad_scale: 32.0 2023-06-20 11:37:59,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=703164.0, ans=0.2 2023-06-20 11:38:07,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=703224.0, ans=0.2 2023-06-20 11:38:25,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=703284.0, ans=0.125 2023-06-20 11:38:28,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=703284.0, ans=0.2 2023-06-20 11:38:47,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=703344.0, ans=0.0 2023-06-20 11:39:12,200 INFO [train.py:996] (3/4) Epoch 4, batch 25750, loss[loss=0.4376, simple_loss=0.4718, pruned_loss=0.2017, over 21413.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3329, pruned_loss=0.09955, over 4252039.05 frames. ], batch size: 508, lr: 7.60e-03, grad_scale: 32.0 2023-06-20 11:39:16,561 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=15.0 2023-06-20 11:39:17,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=703404.0, ans=0.125 2023-06-20 11:39:25,254 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-06-20 11:39:35,970 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 3.258e+02 3.853e+02 4.721e+02 7.384e+02, threshold=7.705e+02, percent-clipped=0.0 2023-06-20 11:39:41,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=703464.0, ans=0.025 2023-06-20 11:40:11,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=703524.0, ans=0.2 2023-06-20 11:40:11,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=703524.0, ans=0.2 2023-06-20 11:40:27,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=703584.0, ans=0.2 2023-06-20 11:40:38,043 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.54 vs. limit=15.0 2023-06-20 11:40:57,297 INFO [train.py:996] (3/4) Epoch 4, batch 25800, loss[loss=0.3624, simple_loss=0.4079, pruned_loss=0.1585, over 21436.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.345, pruned_loss=0.1044, over 4257351.39 frames. ], batch size: 471, lr: 7.60e-03, grad_scale: 32.0 2023-06-20 11:41:14,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=703764.0, ans=0.125 2023-06-20 11:41:54,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=703824.0, ans=0.0 2023-06-20 11:42:29,220 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=15.0 2023-06-20 11:42:38,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=704004.0, ans=0.0 2023-06-20 11:42:39,608 INFO [train.py:996] (3/4) Epoch 4, batch 25850, loss[loss=0.3122, simple_loss=0.3612, pruned_loss=0.1316, over 21672.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3471, pruned_loss=0.1037, over 4262687.78 frames. ], batch size: 473, lr: 7.60e-03, grad_scale: 32.0 2023-06-20 11:43:23,434 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.241e+02 3.030e+02 3.694e+02 4.405e+02 6.989e+02, threshold=7.387e+02, percent-clipped=0.0 2023-06-20 11:43:39,460 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.69 vs. limit=22.5 2023-06-20 11:44:28,573 INFO [train.py:996] (3/4) Epoch 4, batch 25900, loss[loss=0.3683, simple_loss=0.4388, pruned_loss=0.1489, over 21582.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.3492, pruned_loss=0.1047, over 4274836.52 frames. ], batch size: 471, lr: 7.60e-03, grad_scale: 16.0 2023-06-20 11:44:40,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=704304.0, ans=0.0 2023-06-20 11:45:27,918 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-20 11:45:27,960 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-06-20 11:46:17,918 INFO [train.py:996] (3/4) Epoch 4, batch 25950, loss[loss=0.2974, simple_loss=0.3625, pruned_loss=0.1162, over 21949.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.3534, pruned_loss=0.1071, over 4274809.33 frames. ], batch size: 317, lr: 7.60e-03, grad_scale: 16.0 2023-06-20 11:46:18,843 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=22.5 2023-06-20 11:46:26,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=704604.0, ans=0.2 2023-06-20 11:46:52,817 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.258e+02 4.006e+02 4.613e+02 7.769e+02, threshold=8.011e+02, percent-clipped=1.0 2023-06-20 11:46:59,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=704724.0, ans=0.0 2023-06-20 11:47:10,517 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-20 11:48:11,730 INFO [train.py:996] (3/4) Epoch 4, batch 26000, loss[loss=0.3148, simple_loss=0.4154, pruned_loss=0.1072, over 19738.00 frames. ], tot_loss[loss=0.283, simple_loss=0.354, pruned_loss=0.1059, over 4266487.66 frames. ], batch size: 703, lr: 7.60e-03, grad_scale: 32.0 2023-06-20 11:48:27,434 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=15.0 2023-06-20 11:48:35,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=704964.0, ans=0.0 2023-06-20 11:48:56,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=705024.0, ans=0.125 2023-06-20 11:48:58,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=705024.0, ans=0.1 2023-06-20 11:49:53,023 INFO [train.py:996] (3/4) Epoch 4, batch 26050, loss[loss=0.2909, simple_loss=0.3401, pruned_loss=0.1208, over 21924.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3529, pruned_loss=0.1066, over 4269313.70 frames. ], batch size: 351, lr: 7.59e-03, grad_scale: 32.0 2023-06-20 11:49:53,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=705204.0, ans=0.2 2023-06-20 11:50:01,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=705204.0, ans=0.5 2023-06-20 11:50:17,433 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.846e+02 3.300e+02 3.976e+02 7.984e+02, threshold=6.600e+02, percent-clipped=0.0 2023-06-20 11:50:34,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-20 11:51:02,915 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-20 11:51:16,496 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.18 vs. limit=10.0 2023-06-20 11:51:35,534 INFO [train.py:996] (3/4) Epoch 4, batch 26100, loss[loss=0.2485, simple_loss=0.3191, pruned_loss=0.08889, over 21485.00 frames. ], tot_loss[loss=0.2802, simple_loss=0.3481, pruned_loss=0.1061, over 4276140.01 frames. ], batch size: 131, lr: 7.59e-03, grad_scale: 32.0 2023-06-20 11:52:23,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=705624.0, ans=0.1 2023-06-20 11:52:30,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=705684.0, ans=0.05 2023-06-20 11:52:48,866 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=12.0 2023-06-20 11:52:51,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=705744.0, ans=0.95 2023-06-20 11:53:19,670 INFO [train.py:996] (3/4) Epoch 4, batch 26150, loss[loss=0.3264, simple_loss=0.3769, pruned_loss=0.138, over 21816.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.3453, pruned_loss=0.1071, over 4289155.02 frames. ], batch size: 441, lr: 7.59e-03, grad_scale: 32.0 2023-06-20 11:53:28,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=705804.0, ans=0.125 2023-06-20 11:53:39,321 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-20 11:53:45,111 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.363e+02 3.006e+02 3.444e+02 4.248e+02 6.303e+02, threshold=6.888e+02, percent-clipped=0.0 2023-06-20 11:55:05,064 INFO [train.py:996] (3/4) Epoch 4, batch 26200, loss[loss=0.3522, simple_loss=0.429, pruned_loss=0.1377, over 21675.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.3454, pruned_loss=0.1056, over 4283912.82 frames. ], batch size: 441, lr: 7.59e-03, grad_scale: 32.0 2023-06-20 11:55:39,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=706164.0, ans=0.0 2023-06-20 11:55:46,982 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=15.0 2023-06-20 11:55:54,456 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 11:56:42,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=706344.0, ans=0.125 2023-06-20 11:56:46,811 INFO [train.py:996] (3/4) Epoch 4, batch 26250, loss[loss=0.2454, simple_loss=0.3154, pruned_loss=0.08774, over 21856.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.349, pruned_loss=0.1038, over 4278480.27 frames. ], batch size: 282, lr: 7.59e-03, grad_scale: 16.0 2023-06-20 11:56:49,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=706404.0, ans=0.125 2023-06-20 11:56:51,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=706404.0, ans=0.125 2023-06-20 11:57:12,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.231e+02 2.833e+02 3.243e+02 4.065e+02 7.438e+02, threshold=6.486e+02, percent-clipped=1.0 2023-06-20 11:57:27,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=706524.0, ans=0.125 2023-06-20 11:57:45,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=706584.0, ans=0.125 2023-06-20 11:58:00,141 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2023-06-20 11:58:28,948 INFO [train.py:996] (3/4) Epoch 4, batch 26300, loss[loss=0.3249, simple_loss=0.3634, pruned_loss=0.1432, over 21753.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3447, pruned_loss=0.1036, over 4280602.93 frames. ], batch size: 508, lr: 7.59e-03, grad_scale: 16.0 2023-06-20 11:58:44,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=706764.0, ans=0.125 2023-06-20 12:00:14,452 INFO [train.py:996] (3/4) Epoch 4, batch 26350, loss[loss=0.237, simple_loss=0.2996, pruned_loss=0.08723, over 20085.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.3422, pruned_loss=0.1037, over 4280250.66 frames. ], batch size: 703, lr: 7.58e-03, grad_scale: 16.0 2023-06-20 12:00:31,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=707004.0, ans=0.125 2023-06-20 12:00:42,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=707064.0, ans=0.0 2023-06-20 12:00:43,051 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-20 12:00:50,651 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 3.098e+02 3.455e+02 4.050e+02 6.767e+02, threshold=6.909e+02, percent-clipped=5.0 2023-06-20 12:00:59,430 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.85 vs. limit=15.0 2023-06-20 12:01:03,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=707124.0, ans=0.0 2023-06-20 12:01:24,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=707184.0, ans=0.0 2023-06-20 12:01:30,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=707184.0, ans=0.125 2023-06-20 12:01:37,914 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-20 12:01:40,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=707244.0, ans=0.125 2023-06-20 12:01:53,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=707244.0, ans=0.0 2023-06-20 12:01:57,020 INFO [train.py:996] (3/4) Epoch 4, batch 26400, loss[loss=0.2053, simple_loss=0.2665, pruned_loss=0.07202, over 21580.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.338, pruned_loss=0.1047, over 4270416.01 frames. ], batch size: 263, lr: 7.58e-03, grad_scale: 32.0 2023-06-20 12:02:14,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=707304.0, ans=0.1 2023-06-20 12:02:22,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=707364.0, ans=0.0 2023-06-20 12:02:56,166 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.10 vs. limit=12.0 2023-06-20 12:03:49,514 INFO [train.py:996] (3/4) Epoch 4, batch 26450, loss[loss=0.2577, simple_loss=0.303, pruned_loss=0.1062, over 21832.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3401, pruned_loss=0.1043, over 4260969.34 frames. ], batch size: 102, lr: 7.58e-03, grad_scale: 32.0 2023-06-20 12:03:49,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=707604.0, ans=0.2 2023-06-20 12:03:52,054 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-20 12:04:08,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=707604.0, ans=0.125 2023-06-20 12:04:21,775 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 2.976e+02 3.665e+02 4.778e+02 9.045e+02, threshold=7.330e+02, percent-clipped=3.0 2023-06-20 12:04:34,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=707724.0, ans=0.125 2023-06-20 12:04:35,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=707724.0, ans=0.0 2023-06-20 12:05:08,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=707784.0, ans=0.125 2023-06-20 12:05:12,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=707844.0, ans=0.125 2023-06-20 12:05:35,239 INFO [train.py:996] (3/4) Epoch 4, batch 26500, loss[loss=0.2542, simple_loss=0.3397, pruned_loss=0.08431, over 21785.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.3402, pruned_loss=0.1026, over 4264776.63 frames. ], batch size: 332, lr: 7.58e-03, grad_scale: 16.0 2023-06-20 12:07:12,273 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:07:32,508 INFO [train.py:996] (3/4) Epoch 4, batch 26550, loss[loss=0.1966, simple_loss=0.2589, pruned_loss=0.06716, over 21252.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3372, pruned_loss=0.09944, over 4266277.29 frames. ], batch size: 176, lr: 7.58e-03, grad_scale: 16.0 2023-06-20 12:07:48,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=708264.0, ans=0.1 2023-06-20 12:08:00,819 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.832e+02 3.306e+02 3.943e+02 6.835e+02, threshold=6.613e+02, percent-clipped=0.0 2023-06-20 12:08:23,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=708324.0, ans=0.2 2023-06-20 12:09:10,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=708444.0, ans=0.1 2023-06-20 12:09:13,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=708444.0, ans=0.125 2023-06-20 12:09:16,645 INFO [train.py:996] (3/4) Epoch 4, batch 26600, loss[loss=0.2509, simple_loss=0.3229, pruned_loss=0.08949, over 21732.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3342, pruned_loss=0.09558, over 4266805.97 frames. ], batch size: 351, lr: 7.58e-03, grad_scale: 16.0 2023-06-20 12:09:35,731 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.64 vs. limit=12.0 2023-06-20 12:10:54,693 INFO [train.py:996] (3/4) Epoch 4, batch 26650, loss[loss=0.2063, simple_loss=0.2712, pruned_loss=0.07068, over 21605.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3274, pruned_loss=0.09407, over 4268851.86 frames. ], batch size: 263, lr: 7.57e-03, grad_scale: 16.0 2023-06-20 12:11:00,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=708804.0, ans=0.0 2023-06-20 12:11:00,930 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.26 vs. limit=15.0 2023-06-20 12:11:27,247 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.914e+02 3.392e+02 4.054e+02 7.182e+02, threshold=6.783e+02, percent-clipped=2.0 2023-06-20 12:11:29,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=708864.0, ans=0.125 2023-06-20 12:11:37,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=708924.0, ans=0.125 2023-06-20 12:12:28,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.76 vs. limit=22.5 2023-06-20 12:12:32,346 INFO [train.py:996] (3/4) Epoch 4, batch 26700, loss[loss=0.2412, simple_loss=0.3025, pruned_loss=0.08995, over 21301.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3193, pruned_loss=0.09033, over 4273042.64 frames. ], batch size: 608, lr: 7.57e-03, grad_scale: 16.0 2023-06-20 12:13:53,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=709344.0, ans=0.125 2023-06-20 12:14:11,929 INFO [train.py:996] (3/4) Epoch 4, batch 26750, loss[loss=0.2633, simple_loss=0.3414, pruned_loss=0.09259, over 21914.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3179, pruned_loss=0.08811, over 4275806.33 frames. ], batch size: 372, lr: 7.57e-03, grad_scale: 16.0 2023-06-20 12:14:17,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=709404.0, ans=0.125 2023-06-20 12:14:32,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=709464.0, ans=0.125 2023-06-20 12:14:41,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=709464.0, ans=0.125 2023-06-20 12:14:50,947 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 2.700e+02 3.315e+02 4.013e+02 5.519e+02, threshold=6.631e+02, percent-clipped=0.0 2023-06-20 12:15:26,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=709584.0, ans=0.0 2023-06-20 12:15:35,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=709584.0, ans=0.0 2023-06-20 12:15:51,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=709644.0, ans=0.0 2023-06-20 12:15:56,230 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.24 vs. limit=5.0 2023-06-20 12:15:56,456 INFO [train.py:996] (3/4) Epoch 4, batch 26800, loss[loss=0.322, simple_loss=0.3808, pruned_loss=0.1316, over 21693.00 frames. ], tot_loss[loss=0.256, simple_loss=0.326, pruned_loss=0.09296, over 4278496.33 frames. ], batch size: 351, lr: 7.57e-03, grad_scale: 32.0 2023-06-20 12:16:10,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=709704.0, ans=0.0 2023-06-20 12:16:34,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=709764.0, ans=0.125 2023-06-20 12:16:42,360 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.48 vs. limit=12.0 2023-06-20 12:17:05,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=709884.0, ans=0.125 2023-06-20 12:17:09,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=709884.0, ans=0.125 2023-06-20 12:17:24,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=709944.0, ans=0.0 2023-06-20 12:17:43,540 INFO [train.py:996] (3/4) Epoch 4, batch 26850, loss[loss=0.2603, simple_loss=0.309, pruned_loss=0.1058, over 21801.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3288, pruned_loss=0.09659, over 4277135.15 frames. ], batch size: 98, lr: 7.57e-03, grad_scale: 32.0 2023-06-20 12:17:47,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=710004.0, ans=10.0 2023-06-20 12:18:21,730 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.334e+02 2.953e+02 3.297e+02 3.985e+02 6.841e+02, threshold=6.593e+02, percent-clipped=1.0 2023-06-20 12:18:55,965 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-20 12:19:20,900 INFO [train.py:996] (3/4) Epoch 4, batch 26900, loss[loss=0.2338, simple_loss=0.289, pruned_loss=0.08932, over 21804.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3203, pruned_loss=0.09544, over 4276780.16 frames. ], batch size: 352, lr: 7.57e-03, grad_scale: 32.0 2023-06-20 12:19:32,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=710304.0, ans=0.0 2023-06-20 12:20:11,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=710424.0, ans=0.2 2023-06-20 12:20:18,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=710424.0, ans=0.125 2023-06-20 12:21:01,661 INFO [train.py:996] (3/4) Epoch 4, batch 26950, loss[loss=0.2342, simple_loss=0.2957, pruned_loss=0.08631, over 21813.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3202, pruned_loss=0.09576, over 4272286.26 frames. ], batch size: 98, lr: 7.57e-03, grad_scale: 32.0 2023-06-20 12:21:16,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=710604.0, ans=0.0 2023-06-20 12:21:18,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=710604.0, ans=0.125 2023-06-20 12:21:23,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=710664.0, ans=0.0 2023-06-20 12:21:38,727 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.998e+02 3.332e+02 4.075e+02 6.086e+02, threshold=6.663e+02, percent-clipped=0.0 2023-06-20 12:21:40,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=710664.0, ans=0.125 2023-06-20 12:21:53,993 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.45 vs. limit=10.0 2023-06-20 12:22:45,190 INFO [train.py:996] (3/4) Epoch 4, batch 27000, loss[loss=0.2549, simple_loss=0.3396, pruned_loss=0.08513, over 21395.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3211, pruned_loss=0.09349, over 4264399.27 frames. ], batch size: 471, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:22:45,190 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 12:22:54,857 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.6803, 3.5522, 2.0653, 1.5172], device='cuda:3') 2023-06-20 12:23:07,057 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2473, simple_loss=0.3466, pruned_loss=0.07399, over 1796401.00 frames. 2023-06-20 12:23:07,058 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-20 12:23:07,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=710904.0, ans=0.125 2023-06-20 12:24:50,422 INFO [train.py:996] (3/4) Epoch 4, batch 27050, loss[loss=0.2575, simple_loss=0.3341, pruned_loss=0.09048, over 21751.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.323, pruned_loss=0.09011, over 4267088.31 frames. ], batch size: 112, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:25:23,362 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.528e+02 2.843e+02 3.432e+02 6.081e+02, threshold=5.686e+02, percent-clipped=0.0 2023-06-20 12:25:45,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=711384.0, ans=0.125 2023-06-20 12:26:04,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=711384.0, ans=0.125 2023-06-20 12:26:13,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=711444.0, ans=0.2 2023-06-20 12:26:28,419 INFO [train.py:996] (3/4) Epoch 4, batch 27100, loss[loss=0.2434, simple_loss=0.3358, pruned_loss=0.0755, over 21777.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3266, pruned_loss=0.0922, over 4273415.40 frames. ], batch size: 298, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:27:03,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=711564.0, ans=0.125 2023-06-20 12:27:13,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=711624.0, ans=0.125 2023-06-20 12:27:15,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=711624.0, ans=0.1 2023-06-20 12:27:17,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=711624.0, ans=0.2 2023-06-20 12:27:18,682 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:27:42,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=711684.0, ans=0.125 2023-06-20 12:27:42,709 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=22.5 2023-06-20 12:28:05,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=711744.0, ans=0.015 2023-06-20 12:28:16,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=711744.0, ans=0.125 2023-06-20 12:28:17,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=711804.0, ans=0.0 2023-06-20 12:28:19,200 INFO [train.py:996] (3/4) Epoch 4, batch 27150, loss[loss=0.3746, simple_loss=0.447, pruned_loss=0.1511, over 21538.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3375, pruned_loss=0.09516, over 4269478.06 frames. ], batch size: 471, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:28:27,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=711804.0, ans=0.125 2023-06-20 12:28:47,384 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.195e+02 3.755e+02 4.562e+02 7.359e+02, threshold=7.509e+02, percent-clipped=7.0 2023-06-20 12:29:57,437 INFO [train.py:996] (3/4) Epoch 4, batch 27200, loss[loss=0.2691, simple_loss=0.3338, pruned_loss=0.1022, over 21300.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3455, pruned_loss=0.09829, over 4277544.56 frames. ], batch size: 159, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:29:59,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=712104.0, ans=0.04949747468305833 2023-06-20 12:30:28,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=712164.0, ans=0.125 2023-06-20 12:30:42,683 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.85 vs. limit=15.0 2023-06-20 12:31:06,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=712284.0, ans=0.125 2023-06-20 12:31:40,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-06-20 12:31:42,694 INFO [train.py:996] (3/4) Epoch 4, batch 27250, loss[loss=0.3558, simple_loss=0.3994, pruned_loss=0.1562, over 21431.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.3505, pruned_loss=0.104, over 4279284.29 frames. ], batch size: 510, lr: 7.56e-03, grad_scale: 16.0 2023-06-20 12:31:47,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=712404.0, ans=0.125 2023-06-20 12:31:48,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=712404.0, ans=0.0 2023-06-20 12:32:17,844 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2023-06-20 12:32:18,763 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 3.455e+02 4.039e+02 4.883e+02 8.665e+02, threshold=8.078e+02, percent-clipped=1.0 2023-06-20 12:33:32,147 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=1.99 vs. limit=12.0 2023-06-20 12:33:33,286 INFO [train.py:996] (3/4) Epoch 4, batch 27300, loss[loss=0.2837, simple_loss=0.3713, pruned_loss=0.0981, over 20749.00 frames. ], tot_loss[loss=0.2812, simple_loss=0.3525, pruned_loss=0.1049, over 4281398.59 frames. ], batch size: 607, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:33:35,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=712704.0, ans=0.125 2023-06-20 12:33:39,527 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.67 vs. limit=6.0 2023-06-20 12:34:10,621 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-06-20 12:34:18,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=712764.0, ans=0.07 2023-06-20 12:34:20,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=712824.0, ans=0.125 2023-06-20 12:34:30,132 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:34:38,910 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-06-20 12:35:16,724 INFO [train.py:996] (3/4) Epoch 4, batch 27350, loss[loss=0.2652, simple_loss=0.3433, pruned_loss=0.09357, over 21720.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.3542, pruned_loss=0.1048, over 4284951.31 frames. ], batch size: 414, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:35:50,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=713064.0, ans=0.125 2023-06-20 12:36:00,961 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.787e+02 3.114e+02 3.820e+02 5.936e+02, threshold=6.228e+02, percent-clipped=0.0 2023-06-20 12:36:01,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=713064.0, ans=0.2 2023-06-20 12:36:13,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=713124.0, ans=0.125 2023-06-20 12:36:58,510 INFO [train.py:996] (3/4) Epoch 4, batch 27400, loss[loss=0.2672, simple_loss=0.3271, pruned_loss=0.1037, over 21738.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3482, pruned_loss=0.1038, over 4282590.55 frames. ], batch size: 112, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:37:20,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=713364.0, ans=0.0 2023-06-20 12:37:20,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=713364.0, ans=0.125 2023-06-20 12:37:30,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=713364.0, ans=0.125 2023-06-20 12:37:48,151 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=15.0 2023-06-20 12:38:41,420 INFO [train.py:996] (3/4) Epoch 4, batch 27450, loss[loss=0.2404, simple_loss=0.3271, pruned_loss=0.0769, over 21556.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3406, pruned_loss=0.1012, over 4284118.18 frames. ], batch size: 230, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:39:12,197 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=22.5 2023-06-20 12:39:26,433 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 2.637e+02 2.964e+02 3.334e+02 5.036e+02, threshold=5.928e+02, percent-clipped=0.0 2023-06-20 12:39:32,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=713724.0, ans=0.0 2023-06-20 12:39:36,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=713724.0, ans=0.125 2023-06-20 12:40:23,676 INFO [train.py:996] (3/4) Epoch 4, batch 27500, loss[loss=0.2144, simple_loss=0.2846, pruned_loss=0.07204, over 21604.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3393, pruned_loss=0.1024, over 4292413.80 frames. ], batch size: 263, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:42:02,872 INFO [train.py:996] (3/4) Epoch 4, batch 27550, loss[loss=0.2355, simple_loss=0.3028, pruned_loss=0.08407, over 21741.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3365, pruned_loss=0.09978, over 4282719.99 frames. ], batch size: 351, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:42:42,054 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.53 vs. limit=10.0 2023-06-20 12:42:49,325 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 3.202e+02 3.879e+02 5.081e+02 9.458e+02, threshold=7.759e+02, percent-clipped=14.0 2023-06-20 12:42:52,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=714324.0, ans=0.0 2023-06-20 12:43:19,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=714384.0, ans=0.125 2023-06-20 12:43:50,122 INFO [train.py:996] (3/4) Epoch 4, batch 27600, loss[loss=0.2237, simple_loss=0.2801, pruned_loss=0.08364, over 21597.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3286, pruned_loss=0.09832, over 4269118.77 frames. ], batch size: 247, lr: 7.54e-03, grad_scale: 32.0 2023-06-20 12:43:51,002 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=15.0 2023-06-20 12:44:08,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=714504.0, ans=0.1 2023-06-20 12:44:56,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=714684.0, ans=10.0 2023-06-20 12:45:24,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=714804.0, ans=0.0 2023-06-20 12:45:26,196 INFO [train.py:996] (3/4) Epoch 4, batch 27650, loss[loss=0.2443, simple_loss=0.3222, pruned_loss=0.08319, over 21471.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3228, pruned_loss=0.09784, over 4271037.91 frames. ], batch size: 211, lr: 7.54e-03, grad_scale: 32.0 2023-06-20 12:45:26,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=714804.0, ans=0.125 2023-06-20 12:45:39,949 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=15.0 2023-06-20 12:45:39,987 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.61 vs. limit=10.0 2023-06-20 12:46:05,561 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.142e+02 2.761e+02 3.105e+02 3.536e+02 5.675e+02, threshold=6.210e+02, percent-clipped=0.0 2023-06-20 12:46:07,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=714924.0, ans=0.125 2023-06-20 12:47:03,666 INFO [train.py:996] (3/4) Epoch 4, batch 27700, loss[loss=0.2458, simple_loss=0.3017, pruned_loss=0.09491, over 20229.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3222, pruned_loss=0.0965, over 4265014.23 frames. ], batch size: 703, lr: 7.54e-03, grad_scale: 32.0 2023-06-20 12:47:56,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=715224.0, ans=0.125 2023-06-20 12:48:03,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=715224.0, ans=0.0 2023-06-20 12:48:09,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=715224.0, ans=0.125 2023-06-20 12:48:51,172 INFO [train.py:996] (3/4) Epoch 4, batch 27750, loss[loss=0.3007, simple_loss=0.3805, pruned_loss=0.1104, over 21498.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3266, pruned_loss=0.09679, over 4270487.17 frames. ], batch size: 508, lr: 7.54e-03, grad_scale: 16.0 2023-06-20 12:49:33,992 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 3.026e+02 3.500e+02 4.221e+02 6.656e+02, threshold=7.000e+02, percent-clipped=2.0 2023-06-20 12:50:02,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=715584.0, ans=0.0 2023-06-20 12:50:08,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=715584.0, ans=0.125 2023-06-20 12:50:19,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=715644.0, ans=10.0 2023-06-20 12:50:29,327 INFO [train.py:996] (3/4) Epoch 4, batch 27800, loss[loss=0.2989, simple_loss=0.3558, pruned_loss=0.121, over 21844.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3248, pruned_loss=0.09649, over 4275400.70 frames. ], batch size: 124, lr: 7.54e-03, grad_scale: 16.0 2023-06-20 12:52:03,591 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-20 12:52:04,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=715944.0, ans=0.0 2023-06-20 12:52:17,402 INFO [train.py:996] (3/4) Epoch 4, batch 27850, loss[loss=0.2412, simple_loss=0.3066, pruned_loss=0.08793, over 21610.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3245, pruned_loss=0.09787, over 4287514.88 frames. ], batch size: 263, lr: 7.54e-03, grad_scale: 16.0 2023-06-20 12:52:46,433 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=22.5 2023-06-20 12:52:47,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=716064.0, ans=0.125 2023-06-20 12:53:00,718 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.961e+02 3.530e+02 4.217e+02 1.068e+03, threshold=7.060e+02, percent-clipped=1.0 2023-06-20 12:53:18,561 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.09 vs. limit=15.0 2023-06-20 12:54:13,187 INFO [train.py:996] (3/4) Epoch 4, batch 27900, loss[loss=0.2342, simple_loss=0.3122, pruned_loss=0.07816, over 21243.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3323, pruned_loss=0.09814, over 4293563.32 frames. ], batch size: 176, lr: 7.54e-03, grad_scale: 16.0 2023-06-20 12:54:21,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=716304.0, ans=0.0 2023-06-20 12:54:28,853 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.36 vs. limit=6.0 2023-06-20 12:55:12,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=716484.0, ans=0.0 2023-06-20 12:55:33,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=716544.0, ans=0.0 2023-06-20 12:55:56,702 INFO [train.py:996] (3/4) Epoch 4, batch 27950, loss[loss=0.3374, simple_loss=0.4008, pruned_loss=0.1369, over 21391.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3318, pruned_loss=0.09402, over 4289626.00 frames. ], batch size: 507, lr: 7.53e-03, grad_scale: 16.0 2023-06-20 12:56:33,392 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.009e+02 2.859e+02 3.547e+02 4.362e+02 7.820e+02, threshold=7.095e+02, percent-clipped=2.0 2023-06-20 12:56:33,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=716724.0, ans=0.125 2023-06-20 12:57:30,829 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-20 12:57:34,405 INFO [train.py:996] (3/4) Epoch 4, batch 28000, loss[loss=0.2579, simple_loss=0.3472, pruned_loss=0.08428, over 21271.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3295, pruned_loss=0.09153, over 4290131.12 frames. ], batch size: 549, lr: 7.53e-03, grad_scale: 32.0 2023-06-20 12:58:37,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=717084.0, ans=0.125 2023-06-20 12:58:52,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=717084.0, ans=0.125 2023-06-20 12:59:17,599 INFO [train.py:996] (3/4) Epoch 4, batch 28050, loss[loss=0.232, simple_loss=0.3144, pruned_loss=0.07482, over 21822.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3268, pruned_loss=0.09229, over 4296737.03 frames. ], batch size: 332, lr: 7.53e-03, grad_scale: 32.0 2023-06-20 12:59:19,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=717204.0, ans=0.0 2023-06-20 12:59:21,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=717204.0, ans=0.2 2023-06-20 12:59:38,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=717264.0, ans=0.09899494936611666 2023-06-20 12:59:54,016 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.872e+02 3.334e+02 4.118e+02 8.421e+02, threshold=6.667e+02, percent-clipped=2.0 2023-06-20 13:00:34,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=717384.0, ans=0.1 2023-06-20 13:01:00,656 INFO [train.py:996] (3/4) Epoch 4, batch 28100, loss[loss=0.2236, simple_loss=0.2773, pruned_loss=0.08497, over 21236.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3253, pruned_loss=0.09246, over 4278014.54 frames. ], batch size: 159, lr: 7.53e-03, grad_scale: 32.0 2023-06-20 13:01:16,971 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:01:26,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=717564.0, ans=0.0 2023-06-20 13:01:52,140 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.94 vs. limit=15.0 2023-06-20 13:02:09,777 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.27 vs. limit=10.0 2023-06-20 13:02:38,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=717744.0, ans=0.0 2023-06-20 13:02:41,116 INFO [train.py:996] (3/4) Epoch 4, batch 28150, loss[loss=0.2593, simple_loss=0.3101, pruned_loss=0.1043, over 21464.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3189, pruned_loss=0.09287, over 4274122.82 frames. ], batch size: 132, lr: 7.53e-03, grad_scale: 16.0 2023-06-20 13:03:07,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=717864.0, ans=0.125 2023-06-20 13:03:19,507 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2023-06-20 13:03:23,524 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.345e+02 3.192e+02 4.033e+02 4.957e+02 1.192e+03, threshold=8.065e+02, percent-clipped=8.0 2023-06-20 13:04:05,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=718044.0, ans=0.0 2023-06-20 13:04:05,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=718044.0, ans=0.0 2023-06-20 13:04:14,739 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:04:27,742 INFO [train.py:996] (3/4) Epoch 4, batch 28200, loss[loss=0.2922, simple_loss=0.3537, pruned_loss=0.1154, over 21567.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3183, pruned_loss=0.09523, over 4276640.79 frames. ], batch size: 389, lr: 7.53e-03, grad_scale: 16.0 2023-06-20 13:04:30,251 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.72 vs. limit=10.0 2023-06-20 13:04:32,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=718104.0, ans=0.0 2023-06-20 13:04:42,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=718164.0, ans=0.125 2023-06-20 13:04:58,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=718164.0, ans=0.125 2023-06-20 13:05:50,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=718344.0, ans=0.125 2023-06-20 13:05:56,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=718344.0, ans=0.125 2023-06-20 13:06:04,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=718344.0, ans=0.0 2023-06-20 13:06:10,219 INFO [train.py:996] (3/4) Epoch 4, batch 28250, loss[loss=0.2267, simple_loss=0.29, pruned_loss=0.08172, over 22016.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3217, pruned_loss=0.09724, over 4269861.08 frames. ], batch size: 103, lr: 7.52e-03, grad_scale: 16.0 2023-06-20 13:06:42,139 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=12.0 2023-06-20 13:06:53,553 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 3.131e+02 3.683e+02 4.386e+02 7.452e+02, threshold=7.367e+02, percent-clipped=0.0 2023-06-20 13:07:05,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=718524.0, ans=0.125 2023-06-20 13:07:11,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=718524.0, ans=0.1 2023-06-20 13:07:23,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=718584.0, ans=0.125 2023-06-20 13:07:50,128 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=15.0 2023-06-20 13:07:54,528 INFO [train.py:996] (3/4) Epoch 4, batch 28300, loss[loss=0.2549, simple_loss=0.3486, pruned_loss=0.0806, over 21496.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3211, pruned_loss=0.09547, over 4271450.10 frames. ], batch size: 471, lr: 7.52e-03, grad_scale: 16.0 2023-06-20 13:07:56,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=718704.0, ans=0.0 2023-06-20 13:09:18,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=718884.0, ans=0.0 2023-06-20 13:09:23,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=718944.0, ans=0.125 2023-06-20 13:09:43,125 INFO [train.py:996] (3/4) Epoch 4, batch 28350, loss[loss=0.2757, simple_loss=0.3264, pruned_loss=0.1125, over 21343.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3168, pruned_loss=0.08945, over 4268339.98 frames. ], batch size: 507, lr: 7.52e-03, grad_scale: 16.0 2023-06-20 13:09:43,564 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:10:02,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=719004.0, ans=15.0 2023-06-20 13:10:17,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=719064.0, ans=0.125 2023-06-20 13:10:26,136 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.689e+02 3.125e+02 3.923e+02 6.563e+02, threshold=6.250e+02, percent-clipped=0.0 2023-06-20 13:10:31,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=719124.0, ans=0.0 2023-06-20 13:11:30,282 INFO [train.py:996] (3/4) Epoch 4, batch 28400, loss[loss=0.3076, simple_loss=0.3506, pruned_loss=0.1324, over 21604.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3133, pruned_loss=0.08967, over 4268447.69 frames. ], batch size: 441, lr: 7.52e-03, grad_scale: 32.0 2023-06-20 13:11:57,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=719364.0, ans=0.125 2023-06-20 13:12:37,278 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:12:43,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=719484.0, ans=0.2 2023-06-20 13:13:07,862 INFO [train.py:996] (3/4) Epoch 4, batch 28450, loss[loss=0.396, simple_loss=0.4283, pruned_loss=0.1818, over 21500.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3192, pruned_loss=0.09419, over 4263179.18 frames. ], batch size: 471, lr: 7.52e-03, grad_scale: 32.0 2023-06-20 13:13:50,962 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.987e+02 3.410e+02 4.090e+02 6.526e+02, threshold=6.821e+02, percent-clipped=2.0 2023-06-20 13:14:19,020 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:14:19,596 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-20 13:14:40,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=719844.0, ans=0.125 2023-06-20 13:14:45,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=719844.0, ans=0.125 2023-06-20 13:14:50,122 INFO [train.py:996] (3/4) Epoch 4, batch 28500, loss[loss=0.27, simple_loss=0.3339, pruned_loss=0.103, over 21774.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3235, pruned_loss=0.09808, over 4275533.26 frames. ], batch size: 112, lr: 7.52e-03, grad_scale: 32.0 2023-06-20 13:14:50,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=719904.0, ans=0.1 2023-06-20 13:14:52,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=719904.0, ans=0.125 2023-06-20 13:15:09,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=719904.0, ans=0.2 2023-06-20 13:15:10,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=719964.0, ans=0.125 2023-06-20 13:15:10,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=719964.0, ans=0.125 2023-06-20 13:15:13,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=719964.0, ans=0.0 2023-06-20 13:15:20,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=719964.0, ans=0.125 2023-06-20 13:15:56,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=720084.0, ans=15.0 2023-06-20 13:16:41,750 INFO [train.py:996] (3/4) Epoch 4, batch 28550, loss[loss=0.2807, simple_loss=0.3801, pruned_loss=0.09067, over 21229.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3327, pruned_loss=0.1018, over 4282937.94 frames. ], batch size: 548, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:16:52,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=720204.0, ans=0.125 2023-06-20 13:17:03,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=720264.0, ans=0.0 2023-06-20 13:17:11,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=720264.0, ans=0.04949747468305833 2023-06-20 13:17:20,269 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 3.197e+02 3.692e+02 4.478e+02 6.914e+02, threshold=7.384e+02, percent-clipped=1.0 2023-06-20 13:17:27,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=720324.0, ans=0.2 2023-06-20 13:18:25,248 INFO [train.py:996] (3/4) Epoch 4, batch 28600, loss[loss=0.3051, simple_loss=0.3641, pruned_loss=0.1231, over 21951.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3392, pruned_loss=0.1032, over 4285790.29 frames. ], batch size: 373, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:18:50,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=720564.0, ans=0.95 2023-06-20 13:19:26,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=720624.0, ans=0.0 2023-06-20 13:19:43,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=720744.0, ans=0.125 2023-06-20 13:20:07,378 INFO [train.py:996] (3/4) Epoch 4, batch 28650, loss[loss=0.2607, simple_loss=0.3117, pruned_loss=0.1048, over 21263.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3339, pruned_loss=0.1026, over 4272907.47 frames. ], batch size: 549, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:20:46,266 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.224e+02 2.893e+02 3.240e+02 3.665e+02 6.143e+02, threshold=6.480e+02, percent-clipped=0.0 2023-06-20 13:21:08,947 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=15.0 2023-06-20 13:21:22,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=720984.0, ans=0.09899494936611666 2023-06-20 13:21:37,015 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=22.5 2023-06-20 13:21:48,691 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:21:50,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=721104.0, ans=0.1 2023-06-20 13:21:51,586 INFO [train.py:996] (3/4) Epoch 4, batch 28700, loss[loss=0.2919, simple_loss=0.3385, pruned_loss=0.1227, over 21423.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.3329, pruned_loss=0.1032, over 4272740.02 frames. ], batch size: 211, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:21:53,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=721104.0, ans=0.125 2023-06-20 13:21:58,947 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.29 vs. limit=6.0 2023-06-20 13:22:19,590 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-20 13:22:27,603 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:23:24,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=721344.0, ans=0.1 2023-06-20 13:23:32,185 INFO [train.py:996] (3/4) Epoch 4, batch 28750, loss[loss=0.2561, simple_loss=0.3275, pruned_loss=0.09233, over 21793.00 frames. ], tot_loss[loss=0.269, simple_loss=0.332, pruned_loss=0.103, over 4279776.30 frames. ], batch size: 298, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:23:59,213 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=15.0 2023-06-20 13:24:15,481 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.193e+02 3.841e+02 4.687e+02 9.363e+02, threshold=7.682e+02, percent-clipped=10.0 2023-06-20 13:24:17,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=721524.0, ans=0.125 2023-06-20 13:24:44,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=721584.0, ans=0.125 2023-06-20 13:24:56,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=721644.0, ans=0.1 2023-06-20 13:25:15,131 INFO [train.py:996] (3/4) Epoch 4, batch 28800, loss[loss=0.2738, simple_loss=0.3481, pruned_loss=0.09971, over 21592.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3353, pruned_loss=0.1021, over 4282385.27 frames. ], batch size: 389, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:25:47,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=721764.0, ans=0.0 2023-06-20 13:25:56,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=721824.0, ans=0.2 2023-06-20 13:26:21,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=721884.0, ans=0.125 2023-06-20 13:26:38,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=721944.0, ans=0.125 2023-06-20 13:26:55,714 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.59 vs. limit=10.0 2023-06-20 13:26:58,345 INFO [train.py:996] (3/4) Epoch 4, batch 28850, loss[loss=0.2461, simple_loss=0.3065, pruned_loss=0.09286, over 21837.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3353, pruned_loss=0.1033, over 4287939.24 frames. ], batch size: 247, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:27:33,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=722064.0, ans=0.0 2023-06-20 13:27:34,092 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=15.0 2023-06-20 13:27:36,232 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 3.137e+02 3.550e+02 4.295e+02 6.856e+02, threshold=7.100e+02, percent-clipped=0.0 2023-06-20 13:28:06,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=722184.0, ans=0.125 2023-06-20 13:28:35,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=722304.0, ans=0.2 2023-06-20 13:28:42,245 INFO [train.py:996] (3/4) Epoch 4, batch 28900, loss[loss=0.3303, simple_loss=0.4333, pruned_loss=0.1137, over 19920.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3385, pruned_loss=0.1052, over 4287696.86 frames. ], batch size: 702, lr: 7.50e-03, grad_scale: 32.0 2023-06-20 13:28:44,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=722304.0, ans=0.125 2023-06-20 13:28:49,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=722304.0, ans=0.125 2023-06-20 13:29:11,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=722364.0, ans=0.125 2023-06-20 13:29:11,776 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.31 vs. limit=22.5 2023-06-20 13:29:12,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=722364.0, ans=0.0 2023-06-20 13:29:20,240 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=15.0 2023-06-20 13:29:36,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=722424.0, ans=0.125 2023-06-20 13:29:53,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=722484.0, ans=0.125 2023-06-20 13:30:30,820 INFO [train.py:996] (3/4) Epoch 4, batch 28950, loss[loss=0.3617, simple_loss=0.419, pruned_loss=0.1522, over 21531.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3392, pruned_loss=0.1045, over 4275004.33 frames. ], batch size: 508, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:30:34,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=722604.0, ans=0.0 2023-06-20 13:30:46,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=722664.0, ans=0.125 2023-06-20 13:31:08,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=722724.0, ans=0.0 2023-06-20 13:31:11,278 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 3.092e+02 3.646e+02 4.374e+02 7.156e+02, threshold=7.293e+02, percent-clipped=1.0 2023-06-20 13:31:21,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=722724.0, ans=0.0 2023-06-20 13:32:13,700 INFO [train.py:996] (3/4) Epoch 4, batch 29000, loss[loss=0.2779, simple_loss=0.3553, pruned_loss=0.1002, over 21404.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.3428, pruned_loss=0.1033, over 4270469.33 frames. ], batch size: 131, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:32:39,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=722964.0, ans=0.1 2023-06-20 13:33:07,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=723024.0, ans=0.125 2023-06-20 13:33:56,570 INFO [train.py:996] (3/4) Epoch 4, batch 29050, loss[loss=0.2524, simple_loss=0.3219, pruned_loss=0.09147, over 20111.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3416, pruned_loss=0.1042, over 4272675.73 frames. ], batch size: 702, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:34:15,729 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-20 13:34:40,188 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 2.777e+02 3.114e+02 3.614e+02 7.723e+02, threshold=6.228e+02, percent-clipped=0.0 2023-06-20 13:34:40,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=723324.0, ans=0.125 2023-06-20 13:34:50,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.whiten.whitening_limit, batch_count=723324.0, ans=12.0 2023-06-20 13:35:37,366 INFO [train.py:996] (3/4) Epoch 4, batch 29100, loss[loss=0.2462, simple_loss=0.2963, pruned_loss=0.098, over 21697.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3318, pruned_loss=0.1016, over 4276254.48 frames. ], batch size: 316, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:36:16,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=723564.0, ans=0.125 2023-06-20 13:36:30,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=723624.0, ans=0.125 2023-06-20 13:37:19,147 INFO [train.py:996] (3/4) Epoch 4, batch 29150, loss[loss=0.2736, simple_loss=0.3523, pruned_loss=0.09744, over 21635.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3302, pruned_loss=0.0992, over 4275358.57 frames. ], batch size: 263, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:38:03,276 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.983e+02 3.477e+02 4.696e+02 8.127e+02, threshold=6.954e+02, percent-clipped=11.0 2023-06-20 13:38:21,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=723984.0, ans=0.0 2023-06-20 13:39:00,562 INFO [train.py:996] (3/4) Epoch 4, batch 29200, loss[loss=0.2352, simple_loss=0.2884, pruned_loss=0.09103, over 21407.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3263, pruned_loss=0.09844, over 4274933.00 frames. ], batch size: 194, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:39:16,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=724104.0, ans=0.1 2023-06-20 13:39:41,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=724164.0, ans=0.2 2023-06-20 13:39:44,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=724224.0, ans=0.125 2023-06-20 13:39:54,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=724224.0, ans=0.125 2023-06-20 13:40:19,375 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-20 13:40:25,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=724344.0, ans=0.0 2023-06-20 13:40:45,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=724344.0, ans=0.0 2023-06-20 13:40:48,581 INFO [train.py:996] (3/4) Epoch 4, batch 29250, loss[loss=0.2325, simple_loss=0.2962, pruned_loss=0.08439, over 21794.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.325, pruned_loss=0.09667, over 4274637.13 frames. ], batch size: 98, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:41:32,102 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.828e+02 3.224e+02 4.215e+02 8.591e+02, threshold=6.449e+02, percent-clipped=2.0 2023-06-20 13:41:34,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=724524.0, ans=0.125 2023-06-20 13:41:59,919 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=22.5 2023-06-20 13:42:05,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=724584.0, ans=0.0 2023-06-20 13:42:12,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=724644.0, ans=0.0 2023-06-20 13:42:29,504 INFO [train.py:996] (3/4) Epoch 4, batch 29300, loss[loss=0.2384, simple_loss=0.2957, pruned_loss=0.09055, over 21564.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3272, pruned_loss=0.09658, over 4269050.48 frames. ], batch size: 231, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:42:58,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=724764.0, ans=0.125 2023-06-20 13:43:00,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=724764.0, ans=0.125 2023-06-20 13:43:44,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=724884.0, ans=0.125 2023-06-20 13:44:16,756 INFO [train.py:996] (3/4) Epoch 4, batch 29350, loss[loss=0.2484, simple_loss=0.3096, pruned_loss=0.09364, over 21128.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3238, pruned_loss=0.09487, over 4258087.46 frames. ], batch size: 143, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:44:32,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=725064.0, ans=0.125 2023-06-20 13:44:48,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=725064.0, ans=0.125 2023-06-20 13:45:02,532 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.151e+02 2.825e+02 3.145e+02 3.740e+02 7.269e+02, threshold=6.289e+02, percent-clipped=1.0 2023-06-20 13:45:37,767 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:46:00,210 INFO [train.py:996] (3/4) Epoch 4, batch 29400, loss[loss=0.2406, simple_loss=0.3189, pruned_loss=0.08114, over 21731.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3213, pruned_loss=0.09203, over 4257399.80 frames. ], batch size: 351, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:46:25,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=725364.0, ans=0.125 2023-06-20 13:46:34,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=725364.0, ans=0.2 2023-06-20 13:47:42,425 INFO [train.py:996] (3/4) Epoch 4, batch 29450, loss[loss=0.3029, simple_loss=0.3686, pruned_loss=0.1186, over 21370.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.321, pruned_loss=0.09126, over 4266032.78 frames. ], batch size: 549, lr: 7.49e-03, grad_scale: 16.0 2023-06-20 13:48:19,460 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:48:28,455 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.191e+02 3.113e+02 3.518e+02 4.347e+02 7.926e+02, threshold=7.036e+02, percent-clipped=6.0 2023-06-20 13:48:43,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=725784.0, ans=0.125 2023-06-20 13:48:44,592 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=15.0 2023-06-20 13:49:24,084 INFO [train.py:996] (3/4) Epoch 4, batch 29500, loss[loss=0.2409, simple_loss=0.3001, pruned_loss=0.09081, over 21789.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3244, pruned_loss=0.09412, over 4272211.43 frames. ], batch size: 247, lr: 7.49e-03, grad_scale: 16.0 2023-06-20 13:49:40,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=725904.0, ans=0.2 2023-06-20 13:49:48,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=725964.0, ans=0.05 2023-06-20 13:50:58,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=726144.0, ans=0.0 2023-06-20 13:51:04,347 INFO [train.py:996] (3/4) Epoch 4, batch 29550, loss[loss=0.2585, simple_loss=0.3175, pruned_loss=0.09977, over 21461.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3246, pruned_loss=0.09658, over 4277851.35 frames. ], batch size: 144, lr: 7.48e-03, grad_scale: 16.0 2023-06-20 13:51:50,786 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.392e+02 2.813e+02 3.221e+02 3.823e+02 6.296e+02, threshold=6.442e+02, percent-clipped=0.0 2023-06-20 13:52:51,266 INFO [train.py:996] (3/4) Epoch 4, batch 29600, loss[loss=0.2846, simple_loss=0.3647, pruned_loss=0.1023, over 21802.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3313, pruned_loss=0.0992, over 4282210.43 frames. ], batch size: 282, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 13:54:05,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=726684.0, ans=0.1 2023-06-20 13:54:11,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=726744.0, ans=0.0 2023-06-20 13:54:26,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=726804.0, ans=0.1 2023-06-20 13:54:33,045 INFO [train.py:996] (3/4) Epoch 4, batch 29650, loss[loss=0.2473, simple_loss=0.3051, pruned_loss=0.09478, over 21451.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3344, pruned_loss=0.0976, over 4279426.96 frames. ], batch size: 144, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 13:55:05,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=726864.0, ans=0.1 2023-06-20 13:55:13,471 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 3.008e+02 3.849e+02 5.213e+02 1.335e+03, threshold=7.697e+02, percent-clipped=14.0 2023-06-20 13:56:14,494 INFO [train.py:996] (3/4) Epoch 4, batch 29700, loss[loss=0.2742, simple_loss=0.359, pruned_loss=0.09468, over 21311.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3337, pruned_loss=0.09715, over 4279280.36 frames. ], batch size: 159, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 13:56:31,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=727104.0, ans=0.05 2023-06-20 13:56:33,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=727104.0, ans=0.125 2023-06-20 13:57:56,340 INFO [train.py:996] (3/4) Epoch 4, batch 29750, loss[loss=0.241, simple_loss=0.328, pruned_loss=0.07694, over 21700.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3377, pruned_loss=0.0968, over 4276513.08 frames. ], batch size: 263, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 13:58:11,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=727404.0, ans=0.1 2023-06-20 13:58:22,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=727464.0, ans=0.125 2023-06-20 13:58:36,373 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.282e+02 2.793e+02 3.232e+02 3.849e+02 7.208e+02, threshold=6.464e+02, percent-clipped=0.0 2023-06-20 13:59:06,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=727584.0, ans=0.2 2023-06-20 13:59:36,815 INFO [train.py:996] (3/4) Epoch 4, batch 29800, loss[loss=0.2757, simple_loss=0.3331, pruned_loss=0.1091, over 21808.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3385, pruned_loss=0.09801, over 4282083.21 frames. ], batch size: 389, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 13:59:53,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=727704.0, ans=0.2 2023-06-20 14:00:00,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=727764.0, ans=0.1 2023-06-20 14:00:12,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=727764.0, ans=0.2 2023-06-20 14:00:27,236 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-20 14:00:52,390 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:00:55,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=727884.0, ans=0.125 2023-06-20 14:01:23,933 INFO [train.py:996] (3/4) Epoch 4, batch 29850, loss[loss=0.2158, simple_loss=0.2934, pruned_loss=0.06913, over 21430.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3345, pruned_loss=0.09536, over 4274341.81 frames. ], batch size: 131, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:02:05,834 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.765e+02 3.288e+02 3.804e+02 6.956e+02, threshold=6.577e+02, percent-clipped=2.0 2023-06-20 14:02:50,905 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.33 vs. limit=15.0 2023-06-20 14:03:06,607 INFO [train.py:996] (3/4) Epoch 4, batch 29900, loss[loss=0.211, simple_loss=0.2898, pruned_loss=0.06605, over 21238.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3326, pruned_loss=0.0967, over 4288461.66 frames. ], batch size: 176, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:03:25,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=728364.0, ans=0.1 2023-06-20 14:04:50,341 INFO [train.py:996] (3/4) Epoch 4, batch 29950, loss[loss=0.2802, simple_loss=0.34, pruned_loss=0.1102, over 21739.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3353, pruned_loss=0.1004, over 4290351.65 frames. ], batch size: 332, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:04:57,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=728604.0, ans=0.07 2023-06-20 14:05:11,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-20 14:05:15,542 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:05:36,647 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.386e+02 3.024e+02 3.410e+02 4.010e+02 6.604e+02, threshold=6.821e+02, percent-clipped=1.0 2023-06-20 14:06:33,129 INFO [train.py:996] (3/4) Epoch 4, batch 30000, loss[loss=0.2419, simple_loss=0.3303, pruned_loss=0.07673, over 21640.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3367, pruned_loss=0.09997, over 4283784.66 frames. ], batch size: 230, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:06:33,129 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 14:06:55,177 INFO [train.py:1028] (3/4) Epoch 4, validation: loss=0.2513, simple_loss=0.3514, pruned_loss=0.07557, over 1796401.00 frames. 2023-06-20 14:06:55,178 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-20 14:07:19,170 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2023-06-20 14:07:56,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=729024.0, ans=0.04949747468305833 2023-06-20 14:08:07,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=729084.0, ans=0.0 2023-06-20 14:08:48,646 INFO [train.py:996] (3/4) Epoch 4, batch 30050, loss[loss=0.2688, simple_loss=0.3602, pruned_loss=0.08866, over 21773.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3388, pruned_loss=0.09593, over 4276347.05 frames. ], batch size: 316, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:09:06,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=729204.0, ans=0.025 2023-06-20 14:09:25,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=729264.0, ans=0.125 2023-06-20 14:09:34,076 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.668e+02 3.143e+02 3.876e+02 8.051e+02, threshold=6.286e+02, percent-clipped=2.0 2023-06-20 14:09:34,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=729324.0, ans=0.0 2023-06-20 14:09:42,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=729324.0, ans=0.0 2023-06-20 14:09:45,019 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.66 vs. limit=6.0 2023-06-20 14:09:53,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=729384.0, ans=0.125 2023-06-20 14:10:11,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=729384.0, ans=0.125 2023-06-20 14:10:16,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=729444.0, ans=0.0 2023-06-20 14:10:30,481 INFO [train.py:996] (3/4) Epoch 4, batch 30100, loss[loss=0.3089, simple_loss=0.3357, pruned_loss=0.1411, over 21292.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3381, pruned_loss=0.0961, over 4272120.75 frames. ], batch size: 507, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:11:13,746 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:11:17,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=729624.0, ans=0.125 2023-06-20 14:11:53,846 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.18 vs. limit=6.0 2023-06-20 14:12:03,559 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.66 vs. limit=15.0 2023-06-20 14:12:13,063 INFO [train.py:996] (3/4) Epoch 4, batch 30150, loss[loss=0.2541, simple_loss=0.3201, pruned_loss=0.09408, over 21646.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3369, pruned_loss=0.09783, over 4266793.05 frames. ], batch size: 263, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:12:13,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=729804.0, ans=0.2 2023-06-20 14:13:05,703 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=22.5 2023-06-20 14:13:05,885 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-20 14:13:07,775 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.395e+02 3.286e+02 3.695e+02 4.286e+02 6.850e+02, threshold=7.389e+02, percent-clipped=1.0 2023-06-20 14:13:34,410 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.81 vs. limit=22.5 2023-06-20 14:13:43,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=730044.0, ans=0.125 2023-06-20 14:13:45,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=730044.0, ans=0.125 2023-06-20 14:14:05,641 INFO [train.py:996] (3/4) Epoch 4, batch 30200, loss[loss=0.2436, simple_loss=0.3101, pruned_loss=0.08856, over 21362.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.3377, pruned_loss=0.09578, over 4266305.77 frames. ], batch size: 549, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:14:06,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=730104.0, ans=0.125 2023-06-20 14:15:09,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=730284.0, ans=0.125 2023-06-20 14:15:13,256 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.47 vs. limit=10.0 2023-06-20 14:15:39,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=730344.0, ans=0.125 2023-06-20 14:15:55,837 INFO [train.py:996] (3/4) Epoch 4, batch 30250, loss[loss=0.2359, simple_loss=0.2982, pruned_loss=0.0868, over 21905.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3433, pruned_loss=0.09802, over 4258733.12 frames. ], batch size: 98, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:16:01,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=730404.0, ans=0.125 2023-06-20 14:16:06,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-06-20 14:16:09,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=730404.0, ans=0.0 2023-06-20 14:16:14,385 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.33 vs. limit=15.0 2023-06-20 14:16:23,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=730464.0, ans=0.125 2023-06-20 14:16:29,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=730464.0, ans=0.2 2023-06-20 14:16:33,486 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-06-20 14:16:35,799 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 2.958e+02 3.586e+02 4.420e+02 6.930e+02, threshold=7.173e+02, percent-clipped=0.0 2023-06-20 14:17:37,291 INFO [train.py:996] (3/4) Epoch 4, batch 30300, loss[loss=0.201, simple_loss=0.265, pruned_loss=0.06854, over 21113.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3405, pruned_loss=0.09805, over 4262198.89 frames. ], batch size: 176, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:18:51,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=730884.0, ans=0.125 2023-06-20 14:19:16,741 INFO [train.py:996] (3/4) Epoch 4, batch 30350, loss[loss=0.2645, simple_loss=0.3312, pruned_loss=0.09886, over 21759.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3429, pruned_loss=0.1002, over 4269341.09 frames. ], batch size: 282, lr: 7.46e-03, grad_scale: 16.0 2023-06-20 14:19:35,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=731064.0, ans=0.025 2023-06-20 14:19:43,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=731064.0, ans=0.1 2023-06-20 14:19:47,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=731064.0, ans=0.2 2023-06-20 14:19:51,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=731124.0, ans=0.0 2023-06-20 14:19:55,921 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.242e+02 3.707e+02 4.467e+02 6.879e+02, threshold=7.414e+02, percent-clipped=0.0 2023-06-20 14:20:10,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=731184.0, ans=0.1 2023-06-20 14:20:25,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=731244.0, ans=0.0 2023-06-20 14:20:33,081 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.79 vs. limit=15.0 2023-06-20 14:20:43,872 INFO [train.py:996] (3/4) Epoch 4, batch 30400, loss[loss=0.2874, simple_loss=0.3433, pruned_loss=0.1158, over 20028.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3348, pruned_loss=0.09761, over 4261176.10 frames. ], batch size: 703, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:22:05,337 INFO [train.py:996] (3/4) Epoch 4, batch 30450, loss[loss=0.3743, simple_loss=0.4761, pruned_loss=0.1362, over 19787.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3373, pruned_loss=0.09773, over 4202920.98 frames. ], batch size: 702, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:22:43,246 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.654e+02 3.992e+02 5.785e+02 8.199e+02 3.035e+03, threshold=1.157e+03, percent-clipped=30.0 2023-06-20 14:22:47,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=731724.0, ans=0.125 2023-06-20 14:23:00,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=731784.0, ans=0.0 2023-06-20 14:25:02,448 INFO [train.py:996] (3/4) Epoch 5, batch 0, loss[loss=0.2858, simple_loss=0.3426, pruned_loss=0.1145, over 21717.00 frames. ], tot_loss[loss=0.2858, simple_loss=0.3426, pruned_loss=0.1145, over 21717.00 frames. ], batch size: 124, lr: 6.61e-03, grad_scale: 32.0 2023-06-20 14:25:02,449 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 14:25:18,275 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2519, simple_loss=0.3587, pruned_loss=0.07257, over 1796401.00 frames. 2023-06-20 14:25:18,276 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-20 14:26:03,580 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-20 14:26:16,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=732054.0, ans=0.035 2023-06-20 14:26:25,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=732054.0, ans=0.2 2023-06-20 14:26:26,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=732054.0, ans=0.125 2023-06-20 14:26:38,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=732114.0, ans=0.1 2023-06-20 14:26:40,554 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-20 14:26:45,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=732114.0, ans=0.0 2023-06-20 14:26:54,767 INFO [train.py:996] (3/4) Epoch 5, batch 50, loss[loss=0.3267, simple_loss=0.4034, pruned_loss=0.125, over 21643.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.3523, pruned_loss=0.102, over 963703.95 frames. ], batch size: 389, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:26:58,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=732174.0, ans=0.125 2023-06-20 14:27:14,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=732234.0, ans=0.125 2023-06-20 14:27:53,554 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.548e+02 3.294e+02 4.063e+02 6.432e+02 1.595e+03, threshold=8.127e+02, percent-clipped=6.0 2023-06-20 14:27:57,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=732354.0, ans=0.125 2023-06-20 14:28:24,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=732414.0, ans=0.125 2023-06-20 14:28:32,480 INFO [train.py:996] (3/4) Epoch 5, batch 100, loss[loss=0.2742, simple_loss=0.3506, pruned_loss=0.09887, over 21475.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.3611, pruned_loss=0.1032, over 1693366.14 frames. ], batch size: 131, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:28:47,120 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:28:48,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=732474.0, ans=0.2 2023-06-20 14:29:25,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=732594.0, ans=0.2 2023-06-20 14:29:36,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=732654.0, ans=0.1 2023-06-20 14:29:48,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=732654.0, ans=0.09899494936611666 2023-06-20 14:30:00,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=732714.0, ans=0.2 2023-06-20 14:30:09,602 INFO [train.py:996] (3/4) Epoch 5, batch 150, loss[loss=0.3067, simple_loss=0.3906, pruned_loss=0.1114, over 21588.00 frames. ], tot_loss[loss=0.2824, simple_loss=0.3608, pruned_loss=0.102, over 2261966.79 frames. ], batch size: 441, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:30:13,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=732774.0, ans=10.0 2023-06-20 14:31:01,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=732894.0, ans=0.0 2023-06-20 14:31:07,038 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.823e+02 3.207e+02 3.913e+02 7.422e+02, threshold=6.414e+02, percent-clipped=0.0 2023-06-20 14:31:27,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=732954.0, ans=0.0 2023-06-20 14:31:31,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=732954.0, ans=0.2 2023-06-20 14:31:50,621 INFO [train.py:996] (3/4) Epoch 5, batch 200, loss[loss=0.3313, simple_loss=0.3912, pruned_loss=0.1357, over 21597.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3542, pruned_loss=0.09941, over 2708664.69 frames. ], batch size: 414, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:32:39,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=733194.0, ans=0.0 2023-06-20 14:33:24,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=733314.0, ans=0.125 2023-06-20 14:33:27,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=733314.0, ans=0.125 2023-06-20 14:33:32,177 INFO [train.py:996] (3/4) Epoch 5, batch 250, loss[loss=0.2761, simple_loss=0.358, pruned_loss=0.09712, over 21813.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3493, pruned_loss=0.09921, over 3052226.01 frames. ], batch size: 332, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:34:03,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=733434.0, ans=0.125 2023-06-20 14:34:15,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=733494.0, ans=0.2 2023-06-20 14:34:30,979 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.159e+02 2.953e+02 3.443e+02 4.116e+02 7.444e+02, threshold=6.886e+02, percent-clipped=2.0 2023-06-20 14:35:14,582 INFO [train.py:996] (3/4) Epoch 5, batch 300, loss[loss=0.2603, simple_loss=0.3783, pruned_loss=0.07118, over 19883.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3422, pruned_loss=0.09739, over 3310607.76 frames. ], batch size: 703, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:36:51,862 INFO [train.py:996] (3/4) Epoch 5, batch 350, loss[loss=0.2161, simple_loss=0.2725, pruned_loss=0.07985, over 21195.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3327, pruned_loss=0.09533, over 3523643.62 frames. ], batch size: 176, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:37:04,600 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.60 vs. limit=10.0 2023-06-20 14:37:45,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=734094.0, ans=0.1 2023-06-20 14:37:48,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=734094.0, ans=0.125 2023-06-20 14:37:49,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.84 vs. limit=15.0 2023-06-20 14:37:51,534 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.253e+02 2.828e+02 3.218e+02 3.899e+02 6.662e+02, threshold=6.437e+02, percent-clipped=0.0 2023-06-20 14:38:33,738 INFO [train.py:996] (3/4) Epoch 5, batch 400, loss[loss=0.2318, simple_loss=0.2874, pruned_loss=0.08805, over 21681.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3279, pruned_loss=0.09381, over 3690262.74 frames. ], batch size: 282, lr: 6.59e-03, grad_scale: 32.0 2023-06-20 14:39:14,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=734394.0, ans=0.0 2023-06-20 14:40:10,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=734514.0, ans=0.125 2023-06-20 14:40:11,215 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-06-20 14:40:15,213 INFO [train.py:996] (3/4) Epoch 5, batch 450, loss[loss=0.1935, simple_loss=0.2569, pruned_loss=0.065, over 21250.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3285, pruned_loss=0.09402, over 3828235.87 frames. ], batch size: 144, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:40:15,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=734574.0, ans=0.1 2023-06-20 14:40:37,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=734634.0, ans=0.05 2023-06-20 14:40:48,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2023-06-20 14:41:08,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=734694.0, ans=0.0 2023-06-20 14:41:09,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=734694.0, ans=0.1 2023-06-20 14:41:20,213 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.914e+02 3.934e+02 5.626e+02 1.302e+03, threshold=7.868e+02, percent-clipped=18.0 2023-06-20 14:41:23,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=734754.0, ans=0.125 2023-06-20 14:41:55,811 INFO [train.py:996] (3/4) Epoch 5, batch 500, loss[loss=0.2628, simple_loss=0.3123, pruned_loss=0.1066, over 21708.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3263, pruned_loss=0.09352, over 3930797.71 frames. ], batch size: 112, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:42:06,251 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=15.0 2023-06-20 14:42:07,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=734874.0, ans=0.0 2023-06-20 14:42:15,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=734934.0, ans=0.1 2023-06-20 14:42:17,798 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-20 14:42:23,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=734934.0, ans=0.07 2023-06-20 14:42:42,378 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=15.0 2023-06-20 14:43:04,314 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.72 vs. limit=15.0 2023-06-20 14:43:18,154 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:43:37,132 INFO [train.py:996] (3/4) Epoch 5, batch 550, loss[loss=0.2449, simple_loss=0.3613, pruned_loss=0.06425, over 21209.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3282, pruned_loss=0.09287, over 4002667.15 frames. ], batch size: 548, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:44:36,261 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.996e+02 3.563e+02 4.226e+02 6.619e+02, threshold=7.127e+02, percent-clipped=0.0 2023-06-20 14:44:43,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=735354.0, ans=0.125 2023-06-20 14:44:45,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=735354.0, ans=0.5 2023-06-20 14:45:14,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=735414.0, ans=0.125 2023-06-20 14:45:16,853 INFO [train.py:996] (3/4) Epoch 5, batch 600, loss[loss=0.2355, simple_loss=0.297, pruned_loss=0.08706, over 21746.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3337, pruned_loss=0.09476, over 4056979.49 frames. ], batch size: 371, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:45:50,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=735534.0, ans=0.1 2023-06-20 14:46:47,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=12.0 2023-06-20 14:46:51,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=735714.0, ans=0.0 2023-06-20 14:46:57,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=735774.0, ans=0.2 2023-06-20 14:46:59,122 INFO [train.py:996] (3/4) Epoch 5, batch 650, loss[loss=0.2654, simple_loss=0.3282, pruned_loss=0.1013, over 21844.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.336, pruned_loss=0.09427, over 4107007.12 frames. ], batch size: 371, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:47:03,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=735774.0, ans=0.125 2023-06-20 14:47:17,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=735774.0, ans=0.0 2023-06-20 14:47:24,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=735834.0, ans=0.2 2023-06-20 14:47:30,743 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.69 vs. limit=12.0 2023-06-20 14:47:46,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=735894.0, ans=0.0 2023-06-20 14:47:58,876 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.943e+02 3.470e+02 4.276e+02 7.197e+02, threshold=6.941e+02, percent-clipped=1.0 2023-06-20 14:48:12,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=735954.0, ans=0.125 2023-06-20 14:48:34,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=736014.0, ans=0.125 2023-06-20 14:48:34,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=736014.0, ans=0.125 2023-06-20 14:48:40,842 INFO [train.py:996] (3/4) Epoch 5, batch 700, loss[loss=0.2429, simple_loss=0.2984, pruned_loss=0.09367, over 21676.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3364, pruned_loss=0.09456, over 4150059.09 frames. ], batch size: 230, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:48:46,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=736074.0, ans=0.0 2023-06-20 14:48:49,838 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-20 14:49:23,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=736194.0, ans=0.0 2023-06-20 14:49:57,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=736254.0, ans=0.2 2023-06-20 14:50:13,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=736314.0, ans=0.0 2023-06-20 14:50:21,191 INFO [train.py:996] (3/4) Epoch 5, batch 750, loss[loss=0.33, simple_loss=0.3659, pruned_loss=0.147, over 21822.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3346, pruned_loss=0.09498, over 4170190.04 frames. ], batch size: 508, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:50:29,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=736374.0, ans=10.0 2023-06-20 14:50:54,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=736434.0, ans=0.125 2023-06-20 14:51:22,101 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.354e+02 2.979e+02 3.412e+02 4.334e+02 7.194e+02, threshold=6.824e+02, percent-clipped=1.0 2023-06-20 14:51:51,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=736614.0, ans=0.025 2023-06-20 14:52:04,139 INFO [train.py:996] (3/4) Epoch 5, batch 800, loss[loss=0.2455, simple_loss=0.3103, pruned_loss=0.09039, over 21863.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3292, pruned_loss=0.0944, over 4198831.07 frames. ], batch size: 351, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 14:52:09,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=736674.0, ans=0.125 2023-06-20 14:52:11,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=736674.0, ans=0.0 2023-06-20 14:53:46,641 INFO [train.py:996] (3/4) Epoch 5, batch 850, loss[loss=0.2809, simple_loss=0.332, pruned_loss=0.1149, over 21646.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3273, pruned_loss=0.09472, over 4219018.86 frames. ], batch size: 508, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 14:54:21,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=737034.0, ans=0.2 2023-06-20 14:54:40,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=737094.0, ans=0.0 2023-06-20 14:54:57,017 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.235e+02 2.960e+02 3.459e+02 4.425e+02 7.988e+02, threshold=6.917e+02, percent-clipped=3.0 2023-06-20 14:55:00,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=737154.0, ans=0.125 2023-06-20 14:55:06,573 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-20 14:55:09,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=737154.0, ans=0.125 2023-06-20 14:55:32,815 INFO [train.py:996] (3/4) Epoch 5, batch 900, loss[loss=0.2334, simple_loss=0.3052, pruned_loss=0.08079, over 21111.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3265, pruned_loss=0.09469, over 4229533.54 frames. ], batch size: 143, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 14:56:25,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=737394.0, ans=15.0 2023-06-20 14:57:13,153 INFO [train.py:996] (3/4) Epoch 5, batch 950, loss[loss=0.2306, simple_loss=0.2982, pruned_loss=0.08154, over 21826.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3228, pruned_loss=0.09358, over 4247066.56 frames. ], batch size: 282, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 14:57:13,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=737574.0, ans=0.0 2023-06-20 14:57:19,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=737574.0, ans=0.125 2023-06-20 14:58:18,909 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.716e+02 3.192e+02 3.705e+02 5.586e+02, threshold=6.385e+02, percent-clipped=0.0 2023-06-20 14:58:29,856 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.41 vs. limit=15.0 2023-06-20 14:58:54,032 INFO [train.py:996] (3/4) Epoch 5, batch 1000, loss[loss=0.2445, simple_loss=0.3072, pruned_loss=0.09096, over 21313.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3233, pruned_loss=0.09504, over 4259276.25 frames. ], batch size: 143, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 14:59:02,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=737874.0, ans=0.125 2023-06-20 14:59:13,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=737874.0, ans=0.125 2023-06-20 14:59:47,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=737994.0, ans=0.125 2023-06-20 15:00:37,259 INFO [train.py:996] (3/4) Epoch 5, batch 1050, loss[loss=0.2748, simple_loss=0.3463, pruned_loss=0.1017, over 21531.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3229, pruned_loss=0.0944, over 4264969.82 frames. ], batch size: 471, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 15:00:39,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=738174.0, ans=0.125 2023-06-20 15:00:42,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=738174.0, ans=0.125 2023-06-20 15:00:49,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=738174.0, ans=0.125 2023-06-20 15:00:53,331 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:00:57,211 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.13 vs. limit=15.0 2023-06-20 15:01:26,072 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=22.5 2023-06-20 15:01:40,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=738294.0, ans=0.125 2023-06-20 15:01:44,994 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.880e+02 3.329e+02 4.012e+02 6.640e+02, threshold=6.657e+02, percent-clipped=1.0 2023-06-20 15:01:56,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.54 vs. limit=6.0 2023-06-20 15:02:22,414 INFO [train.py:996] (3/4) Epoch 5, batch 1100, loss[loss=0.2601, simple_loss=0.3318, pruned_loss=0.09424, over 21265.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3237, pruned_loss=0.09416, over 4273485.09 frames. ], batch size: 176, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 15:02:34,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=738474.0, ans=0.125 2023-06-20 15:03:24,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=738594.0, ans=0.0 2023-06-20 15:03:50,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=738714.0, ans=0.125 2023-06-20 15:04:07,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=738714.0, ans=0.1 2023-06-20 15:04:10,610 INFO [train.py:996] (3/4) Epoch 5, batch 1150, loss[loss=0.2241, simple_loss=0.3237, pruned_loss=0.06229, over 21785.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3248, pruned_loss=0.09381, over 4277463.04 frames. ], batch size: 351, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:04:21,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=738774.0, ans=0.07 2023-06-20 15:04:36,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=738834.0, ans=0.1 2023-06-20 15:04:44,087 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-20 15:04:47,420 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-20 15:04:52,848 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-06-20 15:05:04,756 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-20 15:05:07,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=738894.0, ans=0.2 2023-06-20 15:05:18,294 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 2.853e+02 3.547e+02 4.502e+02 9.164e+02, threshold=7.095e+02, percent-clipped=7.0 2023-06-20 15:05:22,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=738954.0, ans=0.125 2023-06-20 15:06:05,918 INFO [train.py:996] (3/4) Epoch 5, batch 1200, loss[loss=0.2366, simple_loss=0.3237, pruned_loss=0.07476, over 21765.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3271, pruned_loss=0.09378, over 4284866.83 frames. ], batch size: 282, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:06:42,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=739194.0, ans=0.0 2023-06-20 15:07:04,497 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-06-20 15:07:08,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=739254.0, ans=0.035 2023-06-20 15:07:48,961 INFO [train.py:996] (3/4) Epoch 5, batch 1250, loss[loss=0.2542, simple_loss=0.3269, pruned_loss=0.09072, over 21877.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3289, pruned_loss=0.0944, over 4284067.01 frames. ], batch size: 118, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:08:03,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=739434.0, ans=0.125 2023-06-20 15:08:17,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=739434.0, ans=0.125 2023-06-20 15:08:40,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=739494.0, ans=0.125 2023-06-20 15:08:45,305 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.897e+02 3.475e+02 4.049e+02 7.365e+02, threshold=6.950e+02, percent-clipped=1.0 2023-06-20 15:09:21,839 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=15.0 2023-06-20 15:09:33,259 INFO [train.py:996] (3/4) Epoch 5, batch 1300, loss[loss=0.2547, simple_loss=0.3175, pruned_loss=0.09594, over 21288.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3303, pruned_loss=0.0955, over 4278509.64 frames. ], batch size: 176, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:09:35,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=739674.0, ans=0.1 2023-06-20 15:09:36,986 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:09:58,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=739734.0, ans=0.1 2023-06-20 15:10:01,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=739734.0, ans=0.125 2023-06-20 15:10:03,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=739734.0, ans=0.1 2023-06-20 15:11:10,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=739914.0, ans=10.0 2023-06-20 15:11:16,850 INFO [train.py:996] (3/4) Epoch 5, batch 1350, loss[loss=0.3363, simple_loss=0.3877, pruned_loss=0.1424, over 21323.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.332, pruned_loss=0.09646, over 4288827.78 frames. ], batch size: 507, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:12:06,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=740094.0, ans=10.0 2023-06-20 15:12:12,794 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.265e+02 2.977e+02 3.601e+02 4.425e+02 6.870e+02, threshold=7.202e+02, percent-clipped=0.0 2023-06-20 15:13:00,154 INFO [train.py:996] (3/4) Epoch 5, batch 1400, loss[loss=0.2847, simple_loss=0.3339, pruned_loss=0.1177, over 21352.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3276, pruned_loss=0.09526, over 4287660.39 frames. ], batch size: 159, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:13:02,742 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2023-06-20 15:13:43,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=740394.0, ans=0.1 2023-06-20 15:14:24,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=740514.0, ans=0.125 2023-06-20 15:14:43,269 INFO [train.py:996] (3/4) Epoch 5, batch 1450, loss[loss=0.3032, simple_loss=0.3746, pruned_loss=0.1159, over 21377.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3277, pruned_loss=0.09606, over 4288430.20 frames. ], batch size: 548, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:14:43,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=740574.0, ans=0.0 2023-06-20 15:14:48,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=740574.0, ans=0.1 2023-06-20 15:14:58,490 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.77 vs. limit=15.0 2023-06-20 15:15:37,764 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.414e+02 2.878e+02 3.343e+02 3.968e+02 7.161e+02, threshold=6.685e+02, percent-clipped=0.0 2023-06-20 15:16:24,274 INFO [train.py:996] (3/4) Epoch 5, batch 1500, loss[loss=0.2635, simple_loss=0.3197, pruned_loss=0.1036, over 21328.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3298, pruned_loss=0.09738, over 4285148.69 frames. ], batch size: 176, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:17:02,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.72 vs. limit=15.0 2023-06-20 15:18:09,998 INFO [train.py:996] (3/4) Epoch 5, batch 1550, loss[loss=0.1439, simple_loss=0.2091, pruned_loss=0.03938, over 16917.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3276, pruned_loss=0.09652, over 4287687.70 frames. ], batch size: 61, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:18:29,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=741234.0, ans=0.0 2023-06-20 15:19:04,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=741354.0, ans=0.125 2023-06-20 15:19:09,205 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.929e+02 3.428e+02 4.021e+02 6.196e+02, threshold=6.855e+02, percent-clipped=0.0 2023-06-20 15:19:39,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=741414.0, ans=0.0 2023-06-20 15:19:47,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=741414.0, ans=0.125 2023-06-20 15:19:51,908 INFO [train.py:996] (3/4) Epoch 5, batch 1600, loss[loss=0.2075, simple_loss=0.3039, pruned_loss=0.05555, over 19838.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3245, pruned_loss=0.09422, over 4273162.88 frames. ], batch size: 702, lr: 6.56e-03, grad_scale: 32.0 2023-06-20 15:19:52,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=741474.0, ans=0.0 2023-06-20 15:19:59,816 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=22.5 2023-06-20 15:20:06,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=741534.0, ans=0.125 2023-06-20 15:20:17,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=741534.0, ans=0.125 2023-06-20 15:20:33,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=741594.0, ans=0.125 2023-06-20 15:21:13,956 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.63 vs. limit=15.0 2023-06-20 15:21:20,791 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-20 15:21:28,717 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-06-20 15:21:36,255 INFO [train.py:996] (3/4) Epoch 5, batch 1650, loss[loss=0.2751, simple_loss=0.3444, pruned_loss=0.1029, over 21627.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3228, pruned_loss=0.09322, over 4272560.64 frames. ], batch size: 389, lr: 6.56e-03, grad_scale: 32.0 2023-06-20 15:21:36,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=741774.0, ans=0.125 2023-06-20 15:21:44,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=741774.0, ans=0.125 2023-06-20 15:21:59,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=741834.0, ans=0.0 2023-06-20 15:22:08,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-20 15:22:51,490 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.331e+02 3.044e+02 3.519e+02 4.349e+02 7.461e+02, threshold=7.039e+02, percent-clipped=1.0 2023-06-20 15:23:15,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=742014.0, ans=0.125 2023-06-20 15:23:19,932 INFO [train.py:996] (3/4) Epoch 5, batch 1700, loss[loss=0.316, simple_loss=0.3977, pruned_loss=0.1171, over 21521.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3268, pruned_loss=0.09529, over 4273810.76 frames. ], batch size: 471, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:23:57,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=742194.0, ans=0.125 2023-06-20 15:25:00,561 INFO [train.py:996] (3/4) Epoch 5, batch 1750, loss[loss=0.2371, simple_loss=0.2974, pruned_loss=0.08839, over 21646.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3283, pruned_loss=0.09487, over 4270663.26 frames. ], batch size: 263, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:25:21,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=742434.0, ans=0.0 2023-06-20 15:26:13,035 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.877e+02 3.732e+02 4.371e+02 8.077e+02, threshold=7.464e+02, percent-clipped=3.0 2023-06-20 15:26:40,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=742614.0, ans=0.04949747468305833 2023-06-20 15:26:46,877 INFO [train.py:996] (3/4) Epoch 5, batch 1800, loss[loss=0.3122, simple_loss=0.3932, pruned_loss=0.1156, over 21461.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3271, pruned_loss=0.0924, over 4275804.75 frames. ], batch size: 507, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:27:33,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=742734.0, ans=0.125 2023-06-20 15:27:56,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=742854.0, ans=0.0 2023-06-20 15:28:18,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=742914.0, ans=0.125 2023-06-20 15:28:18,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=742914.0, ans=0.0 2023-06-20 15:28:18,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=742914.0, ans=0.0 2023-06-20 15:28:30,572 INFO [train.py:996] (3/4) Epoch 5, batch 1850, loss[loss=0.2565, simple_loss=0.3226, pruned_loss=0.09521, over 16206.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3288, pruned_loss=0.09054, over 4269619.24 frames. ], batch size: 60, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:29:40,606 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.730e+02 3.304e+02 4.070e+02 7.005e+02, threshold=6.608e+02, percent-clipped=0.0 2023-06-20 15:30:00,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=743214.0, ans=0.0 2023-06-20 15:30:18,493 INFO [train.py:996] (3/4) Epoch 5, batch 1900, loss[loss=0.2389, simple_loss=0.3147, pruned_loss=0.0815, over 21630.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.329, pruned_loss=0.09102, over 4278117.78 frames. ], batch size: 263, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:30:21,020 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-06-20 15:30:36,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=743274.0, ans=0.0 2023-06-20 15:30:45,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=743334.0, ans=0.125 2023-06-20 15:31:12,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=743394.0, ans=0.1 2023-06-20 15:31:39,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=743514.0, ans=0.125 2023-06-20 15:31:54,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=743514.0, ans=0.05 2023-06-20 15:31:56,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=743514.0, ans=0.04949747468305833 2023-06-20 15:32:02,675 INFO [train.py:996] (3/4) Epoch 5, batch 1950, loss[loss=0.2276, simple_loss=0.2811, pruned_loss=0.08701, over 21521.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.325, pruned_loss=0.09062, over 4279563.12 frames. ], batch size: 441, lr: 6.55e-03, grad_scale: 16.0 2023-06-20 15:32:22,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=743574.0, ans=0.2 2023-06-20 15:32:36,711 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-20 15:32:43,382 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=22.5 2023-06-20 15:32:59,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.59 vs. limit=10.0 2023-06-20 15:33:10,449 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 3.018e+02 3.404e+02 4.237e+02 8.100e+02, threshold=6.807e+02, percent-clipped=3.0 2023-06-20 15:33:49,085 INFO [train.py:996] (3/4) Epoch 5, batch 2000, loss[loss=0.2029, simple_loss=0.2747, pruned_loss=0.06551, over 21586.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3189, pruned_loss=0.08911, over 4274053.65 frames. ], batch size: 230, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:34:21,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=743934.0, ans=0.1 2023-06-20 15:34:28,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=743934.0, ans=0.125 2023-06-20 15:35:00,503 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=22.5 2023-06-20 15:35:28,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=744114.0, ans=0.025 2023-06-20 15:35:34,224 INFO [train.py:996] (3/4) Epoch 5, batch 2050, loss[loss=0.2161, simple_loss=0.2865, pruned_loss=0.07283, over 21392.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3207, pruned_loss=0.08924, over 4268241.54 frames. ], batch size: 194, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:36:07,633 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-20 15:36:13,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=744294.0, ans=0.1 2023-06-20 15:36:24,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=744294.0, ans=0.125 2023-06-20 15:36:39,221 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.193e+02 2.809e+02 3.320e+02 4.171e+02 6.443e+02, threshold=6.640e+02, percent-clipped=0.0 2023-06-20 15:36:39,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=744354.0, ans=0.125 2023-06-20 15:37:04,636 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-20 15:37:11,499 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-20 15:37:12,246 INFO [train.py:996] (3/4) Epoch 5, batch 2100, loss[loss=0.3318, simple_loss=0.3901, pruned_loss=0.1368, over 21737.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3247, pruned_loss=0.09185, over 4274990.73 frames. ], batch size: 441, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:38:41,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=744714.0, ans=0.125 2023-06-20 15:39:06,011 INFO [train.py:996] (3/4) Epoch 5, batch 2150, loss[loss=0.2734, simple_loss=0.324, pruned_loss=0.1114, over 21305.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3276, pruned_loss=0.09226, over 4273904.37 frames. ], batch size: 471, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:39:27,331 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=15.0 2023-06-20 15:39:29,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=744834.0, ans=0.125 2023-06-20 15:40:00,543 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=15.0 2023-06-20 15:40:10,999 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.283e+02 3.145e+02 3.806e+02 5.141e+02 9.299e+02, threshold=7.611e+02, percent-clipped=10.0 2023-06-20 15:40:21,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=744954.0, ans=0.125 2023-06-20 15:40:22,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=744954.0, ans=0.125 2023-06-20 15:40:32,144 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.75 vs. limit=10.0 2023-06-20 15:40:33,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=745014.0, ans=0.0 2023-06-20 15:40:43,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=745074.0, ans=0.125 2023-06-20 15:40:44,289 INFO [train.py:996] (3/4) Epoch 5, batch 2200, loss[loss=0.2896, simple_loss=0.3739, pruned_loss=0.1026, over 21530.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3314, pruned_loss=0.09337, over 4265598.56 frames. ], batch size: 471, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:40:55,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=745074.0, ans=0.2 2023-06-20 15:41:56,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=745254.0, ans=0.2 2023-06-20 15:42:40,438 INFO [train.py:996] (3/4) Epoch 5, batch 2250, loss[loss=0.2302, simple_loss=0.3006, pruned_loss=0.07986, over 21371.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3276, pruned_loss=0.09109, over 4262408.37 frames. ], batch size: 211, lr: 6.55e-03, grad_scale: 16.0 2023-06-20 15:42:55,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=745434.0, ans=0.125 2023-06-20 15:43:12,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=745494.0, ans=0.1 2023-06-20 15:43:28,123 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-20 15:43:46,938 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.841e+02 2.695e+02 3.113e+02 3.722e+02 7.366e+02, threshold=6.226e+02, percent-clipped=0.0 2023-06-20 15:44:04,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=745614.0, ans=0.2 2023-06-20 15:44:23,180 INFO [train.py:996] (3/4) Epoch 5, batch 2300, loss[loss=0.2505, simple_loss=0.2953, pruned_loss=0.1028, over 21210.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3193, pruned_loss=0.08967, over 4264301.27 frames. ], batch size: 471, lr: 6.54e-03, grad_scale: 16.0 2023-06-20 15:46:06,132 INFO [train.py:996] (3/4) Epoch 5, batch 2350, loss[loss=0.2563, simple_loss=0.3199, pruned_loss=0.09631, over 21480.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3161, pruned_loss=0.09054, over 4269712.36 frames. ], batch size: 389, lr: 6.54e-03, grad_scale: 16.0 2023-06-20 15:46:11,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=745974.0, ans=0.07 2023-06-20 15:46:11,806 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.28 vs. limit=10.0 2023-06-20 15:46:28,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=746034.0, ans=0.0 2023-06-20 15:47:12,560 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.470e+02 3.194e+02 3.678e+02 4.653e+02 7.153e+02, threshold=7.356e+02, percent-clipped=3.0 2023-06-20 15:47:38,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=746214.0, ans=0.1 2023-06-20 15:47:50,559 INFO [train.py:996] (3/4) Epoch 5, batch 2400, loss[loss=0.2157, simple_loss=0.3276, pruned_loss=0.05187, over 19694.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3202, pruned_loss=0.09343, over 4269743.48 frames. ], batch size: 702, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:47:56,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=746274.0, ans=0.0 2023-06-20 15:48:20,821 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.39 vs. limit=15.0 2023-06-20 15:49:08,403 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:49:36,778 INFO [train.py:996] (3/4) Epoch 5, batch 2450, loss[loss=0.2696, simple_loss=0.3268, pruned_loss=0.1062, over 21578.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.328, pruned_loss=0.09706, over 4277780.08 frames. ], batch size: 441, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:49:58,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=746634.0, ans=0.0 2023-06-20 15:50:47,927 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 3.072e+02 3.519e+02 4.096e+02 7.474e+02, threshold=7.039e+02, percent-clipped=1.0 2023-06-20 15:51:07,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=746814.0, ans=0.0 2023-06-20 15:51:18,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=746874.0, ans=0.1 2023-06-20 15:51:19,742 INFO [train.py:996] (3/4) Epoch 5, batch 2500, loss[loss=0.2457, simple_loss=0.3117, pruned_loss=0.0898, over 21861.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3275, pruned_loss=0.09659, over 4275921.09 frames. ], batch size: 107, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:51:29,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=746874.0, ans=0.125 2023-06-20 15:51:31,347 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=22.5 2023-06-20 15:51:37,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=746934.0, ans=0.5 2023-06-20 15:51:55,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=746994.0, ans=0.125 2023-06-20 15:51:56,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=746994.0, ans=0.2 2023-06-20 15:52:52,877 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=15.0 2023-06-20 15:53:02,334 INFO [train.py:996] (3/4) Epoch 5, batch 2550, loss[loss=0.2519, simple_loss=0.3417, pruned_loss=0.08101, over 21651.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3245, pruned_loss=0.09538, over 4279315.33 frames. ], batch size: 298, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:54:06,979 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.216e+02 2.818e+02 3.202e+02 3.816e+02 6.010e+02, threshold=6.403e+02, percent-clipped=0.0 2023-06-20 15:54:19,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=747354.0, ans=0.125 2023-06-20 15:54:33,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=747414.0, ans=0.2 2023-06-20 15:54:40,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=747414.0, ans=0.125 2023-06-20 15:54:44,527 INFO [train.py:996] (3/4) Epoch 5, batch 2600, loss[loss=0.3047, simple_loss=0.3625, pruned_loss=0.1234, over 21893.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3241, pruned_loss=0.09612, over 4280992.34 frames. ], batch size: 372, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:54:56,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=747474.0, ans=0.125 2023-06-20 15:54:57,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=747474.0, ans=0.0 2023-06-20 15:55:17,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=747534.0, ans=0.0 2023-06-20 15:56:27,523 INFO [train.py:996] (3/4) Epoch 5, batch 2650, loss[loss=0.2446, simple_loss=0.3315, pruned_loss=0.07887, over 21075.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3272, pruned_loss=0.09717, over 4277579.34 frames. ], batch size: 607, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:56:47,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=747834.0, ans=0.0 2023-06-20 15:57:39,244 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 3.025e+02 3.669e+02 4.481e+02 6.938e+02, threshold=7.338e+02, percent-clipped=2.0 2023-06-20 15:57:42,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=747954.0, ans=0.2 2023-06-20 15:57:46,492 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=12.0 2023-06-20 15:58:06,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=748014.0, ans=0.125 2023-06-20 15:58:10,846 INFO [train.py:996] (3/4) Epoch 5, batch 2700, loss[loss=0.2061, simple_loss=0.2806, pruned_loss=0.06583, over 21753.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3261, pruned_loss=0.09662, over 4278320.01 frames. ], batch size: 282, lr: 6.53e-03, grad_scale: 32.0 2023-06-20 15:58:36,029 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-20 15:58:52,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=748194.0, ans=0.1 2023-06-20 15:59:04,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=748194.0, ans=0.125 2023-06-20 15:59:20,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=748254.0, ans=0.125 2023-06-20 15:59:52,540 INFO [train.py:996] (3/4) Epoch 5, batch 2750, loss[loss=0.2362, simple_loss=0.2949, pruned_loss=0.08872, over 21575.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3238, pruned_loss=0.09567, over 4284814.35 frames. ], batch size: 263, lr: 6.53e-03, grad_scale: 32.0 2023-06-20 16:00:43,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=748494.0, ans=0.04949747468305833 2023-06-20 16:00:49,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=748494.0, ans=0.125 2023-06-20 16:00:49,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=748494.0, ans=0.95 2023-06-20 16:01:06,306 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.348e+02 3.120e+02 3.768e+02 4.829e+02 8.745e+02, threshold=7.536e+02, percent-clipped=5.0 2023-06-20 16:01:31,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=748614.0, ans=0.0 2023-06-20 16:01:36,373 INFO [train.py:996] (3/4) Epoch 5, batch 2800, loss[loss=0.2971, simple_loss=0.3514, pruned_loss=0.1214, over 21312.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.329, pruned_loss=0.09666, over 4284666.57 frames. ], batch size: 176, lr: 6.53e-03, grad_scale: 32.0 2023-06-20 16:02:02,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=748734.0, ans=0.0 2023-06-20 16:02:14,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=748734.0, ans=0.0 2023-06-20 16:02:27,963 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:02:57,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=748854.0, ans=0.125 2023-06-20 16:03:09,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=748914.0, ans=0.125 2023-06-20 16:03:27,342 INFO [train.py:996] (3/4) Epoch 5, batch 2850, loss[loss=0.3423, simple_loss=0.3976, pruned_loss=0.1435, over 21391.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3313, pruned_loss=0.09801, over 4280258.69 frames. ], batch size: 549, lr: 6.53e-03, grad_scale: 16.0 2023-06-20 16:04:04,531 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-06-20 16:04:06,105 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.06 vs. limit=15.0 2023-06-20 16:04:15,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=749094.0, ans=0.2 2023-06-20 16:04:16,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=749094.0, ans=0.07 2023-06-20 16:04:30,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=749154.0, ans=0.125 2023-06-20 16:04:42,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.414e+02 3.197e+02 3.952e+02 5.010e+02 9.652e+02, threshold=7.904e+02, percent-clipped=7.0 2023-06-20 16:04:42,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=749154.0, ans=0.125 2023-06-20 16:04:42,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=749154.0, ans=0.0 2023-06-20 16:04:46,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=749154.0, ans=0.0 2023-06-20 16:04:46,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=749154.0, ans=0.1 2023-06-20 16:04:56,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=749214.0, ans=0.0 2023-06-20 16:05:10,318 INFO [train.py:996] (3/4) Epoch 5, batch 2900, loss[loss=0.2424, simple_loss=0.3083, pruned_loss=0.08822, over 21897.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3262, pruned_loss=0.09604, over 4274832.35 frames. ], batch size: 414, lr: 6.53e-03, grad_scale: 16.0 2023-06-20 16:05:30,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=749334.0, ans=0.125 2023-06-20 16:05:40,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=749334.0, ans=0.2 2023-06-20 16:06:19,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=22.5 2023-06-20 16:06:25,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=749454.0, ans=0.125 2023-06-20 16:06:39,357 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.59 vs. limit=22.5 2023-06-20 16:06:50,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=749514.0, ans=0.025 2023-06-20 16:06:52,772 INFO [train.py:996] (3/4) Epoch 5, batch 2950, loss[loss=0.2709, simple_loss=0.3493, pruned_loss=0.09622, over 21653.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3262, pruned_loss=0.09533, over 4283163.75 frames. ], batch size: 263, lr: 6.53e-03, grad_scale: 16.0 2023-06-20 16:06:54,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=749574.0, ans=0.1 2023-06-20 16:07:13,551 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.55 vs. limit=15.0 2023-06-20 16:07:49,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=749694.0, ans=0.125 2023-06-20 16:08:10,256 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.876e+02 3.268e+02 4.025e+02 7.097e+02, threshold=6.536e+02, percent-clipped=0.0 2023-06-20 16:08:33,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=749814.0, ans=0.0 2023-06-20 16:08:36,210 INFO [train.py:996] (3/4) Epoch 5, batch 3000, loss[loss=0.2636, simple_loss=0.3203, pruned_loss=0.1034, over 21630.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3316, pruned_loss=0.09679, over 4284222.81 frames. ], batch size: 263, lr: 6.53e-03, grad_scale: 8.0 2023-06-20 16:08:36,210 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 16:08:50,959 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7500, 1.7699, 1.6261, 2.1496, 1.6796, 1.9627, 1.8011, 1.7566], device='cuda:3') 2023-06-20 16:08:55,132 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2579, simple_loss=0.3533, pruned_loss=0.08129, over 1796401.00 frames. 2023-06-20 16:08:55,133 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-20 16:09:32,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=749934.0, ans=0.5 2023-06-20 16:09:43,195 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-20 16:09:45,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=749994.0, ans=0.125 2023-06-20 16:10:00,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=750054.0, ans=0.125 2023-06-20 16:10:17,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=750114.0, ans=0.125 2023-06-20 16:10:39,804 INFO [train.py:996] (3/4) Epoch 5, batch 3050, loss[loss=0.2991, simple_loss=0.3724, pruned_loss=0.1129, over 21462.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3314, pruned_loss=0.09459, over 4291378.68 frames. ], batch size: 548, lr: 6.52e-03, grad_scale: 8.0 2023-06-20 16:10:40,716 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=12.0 2023-06-20 16:11:26,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=750294.0, ans=0.125 2023-06-20 16:11:26,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=750294.0, ans=0.125 2023-06-20 16:11:34,165 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-20 16:11:52,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=750354.0, ans=0.125 2023-06-20 16:11:57,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=750354.0, ans=0.1 2023-06-20 16:11:58,924 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 2.749e+02 3.160e+02 3.983e+02 6.617e+02, threshold=6.320e+02, percent-clipped=1.0 2023-06-20 16:12:25,600 INFO [train.py:996] (3/4) Epoch 5, batch 3100, loss[loss=0.2097, simple_loss=0.2874, pruned_loss=0.06597, over 21209.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3319, pruned_loss=0.09395, over 4293165.40 frames. ], batch size: 159, lr: 6.52e-03, grad_scale: 8.0 2023-06-20 16:12:38,823 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:12:59,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2023-06-20 16:13:24,435 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.40 vs. limit=22.5 2023-06-20 16:14:15,814 INFO [train.py:996] (3/4) Epoch 5, batch 3150, loss[loss=0.3685, simple_loss=0.4094, pruned_loss=0.1638, over 21371.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3342, pruned_loss=0.0951, over 4290336.22 frames. ], batch size: 159, lr: 6.52e-03, grad_scale: 8.0 2023-06-20 16:14:31,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=750834.0, ans=0.0 2023-06-20 16:14:51,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=750834.0, ans=0.125 2023-06-20 16:15:01,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=750894.0, ans=0.0 2023-06-20 16:15:06,984 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-20 16:15:19,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=750954.0, ans=0.1 2023-06-20 16:15:26,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=750954.0, ans=0.0 2023-06-20 16:15:28,142 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.671e+02 3.239e+02 3.868e+02 6.706e+02, threshold=6.479e+02, percent-clipped=2.0 2023-06-20 16:15:30,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=750954.0, ans=0.125 2023-06-20 16:15:54,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=751074.0, ans=0.2 2023-06-20 16:15:55,770 INFO [train.py:996] (3/4) Epoch 5, batch 3200, loss[loss=0.3134, simple_loss=0.3894, pruned_loss=0.1187, over 21444.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.337, pruned_loss=0.09635, over 4287597.71 frames. ], batch size: 507, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:16:06,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=751074.0, ans=0.125 2023-06-20 16:16:53,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=751194.0, ans=0.0 2023-06-20 16:17:22,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=751314.0, ans=0.125 2023-06-20 16:17:23,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=751314.0, ans=0.04949747468305833 2023-06-20 16:17:40,162 INFO [train.py:996] (3/4) Epoch 5, batch 3250, loss[loss=0.227, simple_loss=0.3176, pruned_loss=0.06819, over 21682.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3388, pruned_loss=0.09914, over 4285622.69 frames. ], batch size: 263, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:18:07,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=751434.0, ans=0.2 2023-06-20 16:18:21,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=751434.0, ans=15.0 2023-06-20 16:19:02,271 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.071e+02 3.454e+02 4.020e+02 6.852e+02, threshold=6.907e+02, percent-clipped=1.0 2023-06-20 16:19:23,813 INFO [train.py:996] (3/4) Epoch 5, batch 3300, loss[loss=0.2291, simple_loss=0.278, pruned_loss=0.09011, over 15324.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3333, pruned_loss=0.09705, over 4278907.53 frames. ], batch size: 61, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:19:25,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=751674.0, ans=0.0 2023-06-20 16:19:33,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=751674.0, ans=0.125 2023-06-20 16:19:44,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=751734.0, ans=0.2 2023-06-20 16:19:47,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=751734.0, ans=0.2 2023-06-20 16:21:09,009 INFO [train.py:996] (3/4) Epoch 5, batch 3350, loss[loss=0.2893, simple_loss=0.347, pruned_loss=0.1158, over 21391.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3347, pruned_loss=0.09793, over 4277089.63 frames. ], batch size: 159, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:22:31,289 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.245e+02 3.953e+02 4.970e+02 1.057e+03, threshold=7.906e+02, percent-clipped=6.0 2023-06-20 16:22:51,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=752214.0, ans=0.1 2023-06-20 16:22:57,016 INFO [train.py:996] (3/4) Epoch 5, batch 3400, loss[loss=0.2352, simple_loss=0.2931, pruned_loss=0.08871, over 21634.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3352, pruned_loss=0.09857, over 4278857.27 frames. ], batch size: 264, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:23:03,277 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=12.82 vs. limit=15.0 2023-06-20 16:24:41,903 INFO [train.py:996] (3/4) Epoch 5, batch 3450, loss[loss=0.2236, simple_loss=0.2848, pruned_loss=0.08123, over 21732.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3295, pruned_loss=0.09673, over 4263913.64 frames. ], batch size: 316, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:24:57,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=752574.0, ans=0.09899494936611666 2023-06-20 16:24:59,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=752574.0, ans=0.0 2023-06-20 16:25:16,907 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.25 vs. limit=6.0 2023-06-20 16:26:05,737 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 3.288e+02 3.795e+02 4.947e+02 8.128e+02, threshold=7.589e+02, percent-clipped=1.0 2023-06-20 16:26:19,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=752814.0, ans=0.0 2023-06-20 16:26:27,182 INFO [train.py:996] (3/4) Epoch 5, batch 3500, loss[loss=0.3104, simple_loss=0.3765, pruned_loss=0.1222, over 21436.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3374, pruned_loss=0.1002, over 4265444.88 frames. ], batch size: 548, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:27:26,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=752994.0, ans=0.125 2023-06-20 16:27:49,663 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:28:10,867 INFO [train.py:996] (3/4) Epoch 5, batch 3550, loss[loss=0.2355, simple_loss=0.2979, pruned_loss=0.0866, over 21652.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3395, pruned_loss=0.101, over 4270420.74 frames. ], batch size: 282, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:29:07,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=753294.0, ans=0.125 2023-06-20 16:29:07,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=753294.0, ans=0.125 2023-06-20 16:29:07,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=753294.0, ans=0.2 2023-06-20 16:29:35,083 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.272e+02 2.975e+02 3.456e+02 4.245e+02 7.529e+02, threshold=6.912e+02, percent-clipped=0.0 2023-06-20 16:29:35,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=753354.0, ans=0.2 2023-06-20 16:30:01,823 INFO [train.py:996] (3/4) Epoch 5, batch 3600, loss[loss=0.2481, simple_loss=0.3213, pruned_loss=0.08742, over 20644.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3335, pruned_loss=0.1001, over 4271646.21 frames. ], batch size: 607, lr: 6.51e-03, grad_scale: 32.0 2023-06-20 16:30:08,318 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.61 vs. limit=10.0 2023-06-20 16:30:39,080 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-20 16:31:00,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=753594.0, ans=0.125 2023-06-20 16:31:25,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=753714.0, ans=0.0 2023-06-20 16:31:46,513 INFO [train.py:996] (3/4) Epoch 5, batch 3650, loss[loss=0.2752, simple_loss=0.3355, pruned_loss=0.1075, over 21602.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3335, pruned_loss=0.1002, over 4268453.71 frames. ], batch size: 263, lr: 6.51e-03, grad_scale: 32.0 2023-06-20 16:31:51,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=753774.0, ans=0.04949747468305833 2023-06-20 16:32:00,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=753774.0, ans=0.125 2023-06-20 16:32:20,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=753834.0, ans=0.0 2023-06-20 16:32:40,726 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=15.0 2023-06-20 16:32:52,605 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-06-20 16:32:58,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=753954.0, ans=0.125 2023-06-20 16:33:02,483 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 3.073e+02 3.466e+02 4.352e+02 7.872e+02, threshold=6.931e+02, percent-clipped=1.0 2023-06-20 16:33:03,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=753954.0, ans=0.2 2023-06-20 16:33:29,226 INFO [train.py:996] (3/4) Epoch 5, batch 3700, loss[loss=0.2577, simple_loss=0.3367, pruned_loss=0.0894, over 21838.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3337, pruned_loss=0.09919, over 4269997.95 frames. ], batch size: 414, lr: 6.51e-03, grad_scale: 32.0 2023-06-20 16:33:44,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=754074.0, ans=0.125 2023-06-20 16:33:46,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=754074.0, ans=0.2 2023-06-20 16:34:53,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=754254.0, ans=0.0 2023-06-20 16:35:01,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=754314.0, ans=0.0 2023-06-20 16:35:18,237 INFO [train.py:996] (3/4) Epoch 5, batch 3750, loss[loss=0.2867, simple_loss=0.3368, pruned_loss=0.1183, over 21742.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3325, pruned_loss=0.09897, over 4275888.59 frames. ], batch size: 508, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:35:39,341 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-20 16:36:36,439 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.776e+02 3.273e+02 3.853e+02 7.611e+02, threshold=6.547e+02, percent-clipped=1.0 2023-06-20 16:37:07,420 INFO [train.py:996] (3/4) Epoch 5, batch 3800, loss[loss=0.2875, simple_loss=0.3483, pruned_loss=0.1134, over 21649.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3303, pruned_loss=0.09705, over 4277278.77 frames. ], batch size: 351, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:37:08,701 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=12.0 2023-06-20 16:38:12,940 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-20 16:38:43,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=754974.0, ans=0.125 2023-06-20 16:38:44,998 INFO [train.py:996] (3/4) Epoch 5, batch 3850, loss[loss=0.2322, simple_loss=0.2901, pruned_loss=0.0872, over 15348.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3282, pruned_loss=0.09802, over 4273208.78 frames. ], batch size: 61, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:39:10,917 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-20 16:39:25,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=755034.0, ans=0.0 2023-06-20 16:40:01,850 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 3.004e+02 3.563e+02 4.477e+02 7.369e+02, threshold=7.126e+02, percent-clipped=2.0 2023-06-20 16:40:02,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=755154.0, ans=0.1 2023-06-20 16:40:09,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=755214.0, ans=0.0 2023-06-20 16:40:27,654 INFO [train.py:996] (3/4) Epoch 5, batch 3900, loss[loss=0.2638, simple_loss=0.3337, pruned_loss=0.09693, over 16842.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3235, pruned_loss=0.09719, over 4270452.63 frames. ], batch size: 60, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:40:32,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=755274.0, ans=0.1 2023-06-20 16:41:32,876 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=12.0 2023-06-20 16:42:00,656 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-06-20 16:42:11,396 INFO [train.py:996] (3/4) Epoch 5, batch 3950, loss[loss=0.2034, simple_loss=0.2782, pruned_loss=0.06431, over 21615.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3241, pruned_loss=0.09492, over 4278171.23 frames. ], batch size: 230, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:42:11,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=755574.0, ans=0.2 2023-06-20 16:43:20,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=755754.0, ans=0.125 2023-06-20 16:43:33,687 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.896e+02 3.603e+02 4.962e+02 8.484e+02, threshold=7.206e+02, percent-clipped=4.0 2023-06-20 16:43:42,535 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.18 vs. limit=12.0 2023-06-20 16:43:50,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=755814.0, ans=0.2 2023-06-20 16:43:52,906 INFO [train.py:996] (3/4) Epoch 5, batch 4000, loss[loss=0.2556, simple_loss=0.3107, pruned_loss=0.1002, over 21533.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3168, pruned_loss=0.09138, over 4270061.45 frames. ], batch size: 414, lr: 6.50e-03, grad_scale: 32.0 2023-06-20 16:44:38,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=755994.0, ans=0.125 2023-06-20 16:45:15,508 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=22.5 2023-06-20 16:45:20,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=756114.0, ans=0.0 2023-06-20 16:45:41,096 INFO [train.py:996] (3/4) Epoch 5, batch 4050, loss[loss=0.2359, simple_loss=0.3029, pruned_loss=0.08446, over 21275.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3168, pruned_loss=0.08968, over 4271372.87 frames. ], batch size: 159, lr: 6.50e-03, grad_scale: 32.0 2023-06-20 16:45:51,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=756174.0, ans=0.05 2023-06-20 16:46:25,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=756294.0, ans=0.125 2023-06-20 16:46:28,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=756294.0, ans=0.125 2023-06-20 16:46:57,469 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.627e+02 3.094e+02 3.740e+02 6.411e+02, threshold=6.189e+02, percent-clipped=0.0 2023-06-20 16:47:00,113 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-20 16:47:21,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=756474.0, ans=0.125 2023-06-20 16:47:22,990 INFO [train.py:996] (3/4) Epoch 5, batch 4100, loss[loss=0.3013, simple_loss=0.3649, pruned_loss=0.1189, over 20703.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3185, pruned_loss=0.09036, over 4272809.53 frames. ], batch size: 607, lr: 6.50e-03, grad_scale: 32.0 2023-06-20 16:49:06,407 INFO [train.py:996] (3/4) Epoch 5, batch 4150, loss[loss=0.2, simple_loss=0.2893, pruned_loss=0.0554, over 21503.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3198, pruned_loss=0.08747, over 4273748.27 frames. ], batch size: 230, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:49:10,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=756774.0, ans=0.125 2023-06-20 16:49:30,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=756834.0, ans=0.125 2023-06-20 16:49:39,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=756834.0, ans=0.125 2023-06-20 16:50:10,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=756954.0, ans=0.07 2023-06-20 16:50:26,855 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.812e+02 3.283e+02 4.437e+02 7.520e+02, threshold=6.566e+02, percent-clipped=5.0 2023-06-20 16:50:46,738 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.31 vs. limit=22.5 2023-06-20 16:50:49,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=757074.0, ans=0.2 2023-06-20 16:50:55,591 INFO [train.py:996] (3/4) Epoch 5, batch 4200, loss[loss=0.2179, simple_loss=0.2924, pruned_loss=0.07171, over 21524.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3196, pruned_loss=0.08743, over 4266534.52 frames. ], batch size: 230, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:50:56,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=757074.0, ans=0.125 2023-06-20 16:51:49,181 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-20 16:52:13,396 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.34 vs. limit=15.0 2023-06-20 16:52:19,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=757314.0, ans=0.125 2023-06-20 16:52:27,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=757314.0, ans=0.04949747468305833 2023-06-20 16:52:31,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=757314.0, ans=0.0 2023-06-20 16:52:40,362 INFO [train.py:996] (3/4) Epoch 5, batch 4250, loss[loss=0.2814, simple_loss=0.3634, pruned_loss=0.09964, over 19960.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3269, pruned_loss=0.09076, over 4270817.06 frames. ], batch size: 702, lr: 6.49e-03, grad_scale: 16.0 2023-06-20 16:52:42,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=757374.0, ans=0.0 2023-06-20 16:53:53,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=757554.0, ans=0.2 2023-06-20 16:53:59,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=757554.0, ans=0.0 2023-06-20 16:54:03,350 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 2.999e+02 3.547e+02 4.279e+02 1.014e+03, threshold=7.094e+02, percent-clipped=7.0 2023-06-20 16:54:03,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=757614.0, ans=0.2 2023-06-20 16:54:04,481 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-06-20 16:54:11,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=757614.0, ans=0.0 2023-06-20 16:54:14,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=757614.0, ans=0.125 2023-06-20 16:54:22,748 INFO [train.py:996] (3/4) Epoch 5, batch 4300, loss[loss=0.214, simple_loss=0.2912, pruned_loss=0.0684, over 21289.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3318, pruned_loss=0.09197, over 4276712.62 frames. ], batch size: 176, lr: 6.49e-03, grad_scale: 16.0 2023-06-20 16:54:48,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=757734.0, ans=0.1 2023-06-20 16:54:55,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=757734.0, ans=0.125 2023-06-20 16:55:12,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=757794.0, ans=0.025 2023-06-20 16:55:12,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=757794.0, ans=0.125 2023-06-20 16:56:16,492 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=22.5 2023-06-20 16:56:16,768 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.27 vs. limit=15.0 2023-06-20 16:56:17,253 INFO [train.py:996] (3/4) Epoch 5, batch 4350, loss[loss=0.3091, simple_loss=0.3684, pruned_loss=0.1249, over 21365.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3301, pruned_loss=0.09153, over 4282278.89 frames. ], batch size: 507, lr: 6.49e-03, grad_scale: 16.0 2023-06-20 16:56:47,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=758034.0, ans=0.125 2023-06-20 16:57:26,971 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:57:31,423 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.864e+02 3.148e+02 3.712e+02 7.836e+02, threshold=6.297e+02, percent-clipped=1.0 2023-06-20 16:57:57,317 INFO [train.py:996] (3/4) Epoch 5, batch 4400, loss[loss=0.2409, simple_loss=0.3269, pruned_loss=0.07742, over 21606.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.326, pruned_loss=0.09066, over 4273892.61 frames. ], batch size: 263, lr: 6.49e-03, grad_scale: 32.0 2023-06-20 16:58:03,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=758274.0, ans=0.1 2023-06-20 16:58:33,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=758334.0, ans=0.2 2023-06-20 16:58:51,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=758394.0, ans=0.0 2023-06-20 16:59:00,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=758454.0, ans=0.0 2023-06-20 16:59:34,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=758514.0, ans=0.125 2023-06-20 16:59:39,827 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.75 vs. limit=22.5 2023-06-20 16:59:41,599 INFO [train.py:996] (3/4) Epoch 5, batch 4450, loss[loss=0.2359, simple_loss=0.3051, pruned_loss=0.08335, over 21881.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3347, pruned_loss=0.09291, over 4280248.37 frames. ], batch size: 107, lr: 6.49e-03, grad_scale: 32.0 2023-06-20 17:00:02,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=758574.0, ans=0.125 2023-06-20 17:01:08,064 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 2.910e+02 3.386e+02 4.171e+02 6.417e+02, threshold=6.772e+02, percent-clipped=2.0 2023-06-20 17:01:23,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=758814.0, ans=0.125 2023-06-20 17:01:32,514 INFO [train.py:996] (3/4) Epoch 5, batch 4500, loss[loss=0.2835, simple_loss=0.385, pruned_loss=0.091, over 21263.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3385, pruned_loss=0.09583, over 4288802.19 frames. ], batch size: 548, lr: 6.49e-03, grad_scale: 32.0 2023-06-20 17:02:07,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=758934.0, ans=0.0 2023-06-20 17:02:37,238 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.37 vs. limit=15.0 2023-06-20 17:03:17,245 INFO [train.py:996] (3/4) Epoch 5, batch 4550, loss[loss=0.4184, simple_loss=0.5205, pruned_loss=0.1582, over 19766.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3442, pruned_loss=0.09738, over 4288749.42 frames. ], batch size: 702, lr: 6.49e-03, grad_scale: 32.0 2023-06-20 17:03:26,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=759174.0, ans=0.1 2023-06-20 17:03:27,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=759174.0, ans=0.0 2023-06-20 17:03:29,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=759174.0, ans=0.0 2023-06-20 17:04:34,181 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.512e+02 3.034e+02 3.831e+02 5.015e+02 1.154e+03, threshold=7.663e+02, percent-clipped=6.0 2023-06-20 17:04:34,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=759414.0, ans=0.0 2023-06-20 17:05:00,171 INFO [train.py:996] (3/4) Epoch 5, batch 4600, loss[loss=0.2301, simple_loss=0.3023, pruned_loss=0.07897, over 21274.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3448, pruned_loss=0.09765, over 4285906.08 frames. ], batch size: 176, lr: 6.49e-03, grad_scale: 16.0 2023-06-20 17:05:21,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=759534.0, ans=0.0 2023-06-20 17:05:25,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=759534.0, ans=0.125 2023-06-20 17:06:08,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=759714.0, ans=0.0 2023-06-20 17:06:36,964 INFO [train.py:996] (3/4) Epoch 5, batch 4650, loss[loss=0.3073, simple_loss=0.3569, pruned_loss=0.1288, over 21782.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3386, pruned_loss=0.09637, over 4287120.13 frames. ], batch size: 441, lr: 6.48e-03, grad_scale: 16.0 2023-06-20 17:07:57,670 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.721e+02 3.118e+02 3.617e+02 7.093e+02, threshold=6.237e+02, percent-clipped=0.0 2023-06-20 17:08:00,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=760014.0, ans=0.125 2023-06-20 17:08:14,584 INFO [train.py:996] (3/4) Epoch 5, batch 4700, loss[loss=0.2093, simple_loss=0.2708, pruned_loss=0.07391, over 21717.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3286, pruned_loss=0.09397, over 4279233.98 frames. ], batch size: 300, lr: 6.48e-03, grad_scale: 16.0 2023-06-20 17:08:21,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=760074.0, ans=0.0 2023-06-20 17:08:40,223 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=22.5 2023-06-20 17:09:02,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=760194.0, ans=0.125 2023-06-20 17:09:56,835 INFO [train.py:996] (3/4) Epoch 5, batch 4750, loss[loss=0.2691, simple_loss=0.3311, pruned_loss=0.1036, over 21831.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3234, pruned_loss=0.09463, over 4278691.40 frames. ], batch size: 112, lr: 6.48e-03, grad_scale: 16.0 2023-06-20 17:10:01,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=760374.0, ans=0.125 2023-06-20 17:10:10,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=760374.0, ans=0.125 2023-06-20 17:11:18,304 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.439e+02 2.865e+02 3.322e+02 3.733e+02 5.818e+02, threshold=6.645e+02, percent-clipped=0.0 2023-06-20 17:11:23,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=760614.0, ans=0.09899494936611666 2023-06-20 17:11:26,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=760614.0, ans=0.0 2023-06-20 17:11:34,490 INFO [train.py:996] (3/4) Epoch 5, batch 4800, loss[loss=0.2369, simple_loss=0.3311, pruned_loss=0.07133, over 21769.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3253, pruned_loss=0.09483, over 4281225.65 frames. ], batch size: 298, lr: 6.48e-03, grad_scale: 32.0 2023-06-20 17:11:35,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=760674.0, ans=0.125 2023-06-20 17:11:57,094 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.07 vs. limit=6.0 2023-06-20 17:12:43,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=760854.0, ans=0.0 2023-06-20 17:12:44,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=760854.0, ans=0.125 2023-06-20 17:13:15,567 INFO [train.py:996] (3/4) Epoch 5, batch 4850, loss[loss=0.2531, simple_loss=0.3451, pruned_loss=0.08057, over 21712.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3233, pruned_loss=0.09357, over 4281318.71 frames. ], batch size: 441, lr: 6.48e-03, grad_scale: 32.0 2023-06-20 17:13:41,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=761034.0, ans=0.125 2023-06-20 17:14:00,803 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:14:03,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=761094.0, ans=0.125 2023-06-20 17:14:27,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=761154.0, ans=0.125 2023-06-20 17:14:41,977 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.733e+02 3.099e+02 3.561e+02 5.577e+02, threshold=6.198e+02, percent-clipped=0.0 2023-06-20 17:14:42,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=761214.0, ans=0.125 2023-06-20 17:14:46,281 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.03 vs. limit=22.5 2023-06-20 17:14:58,670 INFO [train.py:996] (3/4) Epoch 5, batch 4900, loss[loss=0.2519, simple_loss=0.3426, pruned_loss=0.08064, over 21295.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3231, pruned_loss=0.09402, over 4276176.02 frames. ], batch size: 176, lr: 6.48e-03, grad_scale: 32.0 2023-06-20 17:15:13,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=761334.0, ans=0.125 2023-06-20 17:15:51,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=761394.0, ans=0.0 2023-06-20 17:15:53,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=761394.0, ans=0.0 2023-06-20 17:15:54,091 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-20 17:16:18,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=761454.0, ans=0.0 2023-06-20 17:16:23,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=761514.0, ans=0.125 2023-06-20 17:16:41,586 INFO [train.py:996] (3/4) Epoch 5, batch 4950, loss[loss=0.2065, simple_loss=0.2831, pruned_loss=0.065, over 21217.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.327, pruned_loss=0.09232, over 4275285.58 frames. ], batch size: 176, lr: 6.48e-03, grad_scale: 16.0 2023-06-20 17:17:08,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=761634.0, ans=0.0 2023-06-20 17:17:32,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=761694.0, ans=0.1 2023-06-20 17:17:42,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=761694.0, ans=0.2 2023-06-20 17:18:08,220 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.799e+02 3.225e+02 3.689e+02 6.231e+02, threshold=6.450e+02, percent-clipped=1.0 2023-06-20 17:18:22,779 INFO [train.py:996] (3/4) Epoch 5, batch 5000, loss[loss=0.2314, simple_loss=0.3111, pruned_loss=0.07586, over 21801.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3252, pruned_loss=0.08899, over 4270649.68 frames. ], batch size: 282, lr: 6.47e-03, grad_scale: 16.0 2023-06-20 17:18:24,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=761874.0, ans=0.2 2023-06-20 17:18:33,349 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=12.0 2023-06-20 17:18:41,010 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=12.0 2023-06-20 17:18:54,221 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-06-20 17:19:08,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=761994.0, ans=0.2 2023-06-20 17:20:03,387 INFO [train.py:996] (3/4) Epoch 5, batch 5050, loss[loss=0.2639, simple_loss=0.3292, pruned_loss=0.09933, over 21925.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3264, pruned_loss=0.09167, over 4278229.65 frames. ], batch size: 107, lr: 6.47e-03, grad_scale: 16.0 2023-06-20 17:20:51,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=762294.0, ans=0.0 2023-06-20 17:21:31,158 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.896e+02 3.588e+02 4.285e+02 7.263e+02, threshold=7.176e+02, percent-clipped=2.0 2023-06-20 17:21:45,590 INFO [train.py:996] (3/4) Epoch 5, batch 5100, loss[loss=0.2104, simple_loss=0.2846, pruned_loss=0.0681, over 21328.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.326, pruned_loss=0.09209, over 4281424.83 frames. ], batch size: 176, lr: 6.47e-03, grad_scale: 16.0 2023-06-20 17:22:03,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=762534.0, ans=0.09899494936611666 2023-06-20 17:22:32,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=762594.0, ans=10.0 2023-06-20 17:22:35,507 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:22:40,830 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.76 vs. limit=22.5 2023-06-20 17:23:02,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=762654.0, ans=0.1 2023-06-20 17:23:29,408 INFO [train.py:996] (3/4) Epoch 5, batch 5150, loss[loss=0.2428, simple_loss=0.3052, pruned_loss=0.0902, over 21428.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3237, pruned_loss=0.09264, over 4291122.53 frames. ], batch size: 177, lr: 6.47e-03, grad_scale: 16.0 2023-06-20 17:23:39,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=762774.0, ans=0.125 2023-06-20 17:24:47,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=762954.0, ans=0.2 2023-06-20 17:24:51,978 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-06-20 17:24:57,419 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.960e+02 3.348e+02 3.858e+02 5.752e+02, threshold=6.696e+02, percent-clipped=0.0 2023-06-20 17:25:13,101 INFO [train.py:996] (3/4) Epoch 5, batch 5200, loss[loss=0.2628, simple_loss=0.352, pruned_loss=0.08679, over 21727.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3266, pruned_loss=0.09294, over 4287593.46 frames. ], batch size: 247, lr: 6.47e-03, grad_scale: 32.0 2023-06-20 17:25:51,745 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-06-20 17:25:52,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=763134.0, ans=0.1 2023-06-20 17:25:52,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=763134.0, ans=0.2 2023-06-20 17:25:56,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=763194.0, ans=0.0 2023-06-20 17:26:06,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=763194.0, ans=0.07 2023-06-20 17:26:10,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=763194.0, ans=0.1 2023-06-20 17:26:45,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=763314.0, ans=0.125 2023-06-20 17:26:54,745 INFO [train.py:996] (3/4) Epoch 5, batch 5250, loss[loss=0.3209, simple_loss=0.3889, pruned_loss=0.1264, over 21485.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3272, pruned_loss=0.0913, over 4278846.98 frames. ], batch size: 471, lr: 6.47e-03, grad_scale: 32.0 2023-06-20 17:27:35,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=763494.0, ans=0.125 2023-06-20 17:27:42,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=763494.0, ans=0.1 2023-06-20 17:27:55,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=763494.0, ans=0.0 2023-06-20 17:28:05,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=763554.0, ans=0.0 2023-06-20 17:28:21,656 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.952e+02 3.364e+02 4.524e+02 6.907e+02, threshold=6.729e+02, percent-clipped=2.0 2023-06-20 17:28:36,590 INFO [train.py:996] (3/4) Epoch 5, batch 5300, loss[loss=0.2423, simple_loss=0.3096, pruned_loss=0.0875, over 21972.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3277, pruned_loss=0.09272, over 4285314.88 frames. ], batch size: 415, lr: 6.47e-03, grad_scale: 32.0 2023-06-20 17:28:49,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=763674.0, ans=0.125 2023-06-20 17:28:51,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=763674.0, ans=0.0 2023-06-20 17:29:19,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=763794.0, ans=0.2 2023-06-20 17:29:22,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=763794.0, ans=0.0 2023-06-20 17:30:12,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=763914.0, ans=0.125 2023-06-20 17:30:22,043 INFO [train.py:996] (3/4) Epoch 5, batch 5350, loss[loss=0.2892, simple_loss=0.3457, pruned_loss=0.1163, over 21773.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3286, pruned_loss=0.09516, over 4286050.78 frames. ], batch size: 389, lr: 6.47e-03, grad_scale: 32.0 2023-06-20 17:31:26,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=764154.0, ans=0.0 2023-06-20 17:31:44,351 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 3.105e+02 3.554e+02 4.280e+02 7.043e+02, threshold=7.109e+02, percent-clipped=1.0 2023-06-20 17:31:44,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=764214.0, ans=0.1 2023-06-20 17:31:59,903 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-20 17:32:03,835 INFO [train.py:996] (3/4) Epoch 5, batch 5400, loss[loss=0.2365, simple_loss=0.3106, pruned_loss=0.08122, over 21859.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3276, pruned_loss=0.0954, over 4294101.07 frames. ], batch size: 124, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:32:24,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=764334.0, ans=0.125 2023-06-20 17:33:45,601 INFO [train.py:996] (3/4) Epoch 5, batch 5450, loss[loss=0.2007, simple_loss=0.2856, pruned_loss=0.05789, over 21377.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3304, pruned_loss=0.09373, over 4286845.75 frames. ], batch size: 131, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:33:59,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=764574.0, ans=0.125 2023-06-20 17:33:59,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=764574.0, ans=0.125 2023-06-20 17:34:03,139 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:34:21,007 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:34:32,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=764694.0, ans=0.125 2023-06-20 17:34:34,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=764694.0, ans=0.09899494936611666 2023-06-20 17:35:13,977 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.778e+02 2.553e+02 3.012e+02 3.713e+02 8.478e+02, threshold=6.025e+02, percent-clipped=4.0 2023-06-20 17:35:34,655 INFO [train.py:996] (3/4) Epoch 5, batch 5500, loss[loss=0.2117, simple_loss=0.307, pruned_loss=0.05824, over 21384.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3333, pruned_loss=0.08967, over 4291248.31 frames. ], batch size: 194, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:35:52,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=764874.0, ans=0.125 2023-06-20 17:36:26,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=764994.0, ans=0.0 2023-06-20 17:36:39,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=765054.0, ans=0.0 2023-06-20 17:36:54,856 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.12 vs. limit=15.0 2023-06-20 17:37:17,249 INFO [train.py:996] (3/4) Epoch 5, batch 5550, loss[loss=0.264, simple_loss=0.3574, pruned_loss=0.08528, over 21478.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3319, pruned_loss=0.08658, over 4283165.50 frames. ], batch size: 471, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:38:48,566 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.848e+02 2.754e+02 3.445e+02 4.644e+02 7.344e+02, threshold=6.889e+02, percent-clipped=6.0 2023-06-20 17:39:13,775 INFO [train.py:996] (3/4) Epoch 5, batch 5600, loss[loss=0.3207, simple_loss=0.4045, pruned_loss=0.1185, over 21785.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3303, pruned_loss=0.08431, over 4278707.88 frames. ], batch size: 351, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:39:21,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=765474.0, ans=0.125 2023-06-20 17:39:23,131 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-06-20 17:39:32,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=765534.0, ans=0.125 2023-06-20 17:39:58,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=765594.0, ans=0.125 2023-06-20 17:40:55,602 INFO [train.py:996] (3/4) Epoch 5, batch 5650, loss[loss=0.2505, simple_loss=0.3133, pruned_loss=0.09389, over 21858.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3341, pruned_loss=0.08687, over 4283228.61 frames. ], batch size: 282, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:41:07,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=765774.0, ans=0.0 2023-06-20 17:41:10,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=765834.0, ans=0.07 2023-06-20 17:41:26,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=765834.0, ans=0.125 2023-06-20 17:41:34,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.46 vs. limit=6.0 2023-06-20 17:41:35,737 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.72 vs. limit=10.0 2023-06-20 17:41:37,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=765894.0, ans=0.125 2023-06-20 17:41:40,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=765894.0, ans=0.125 2023-06-20 17:41:53,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=765954.0, ans=0.2 2023-06-20 17:42:17,846 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 3.200e+02 3.767e+02 5.001e+02 8.912e+02, threshold=7.534e+02, percent-clipped=5.0 2023-06-20 17:42:38,878 INFO [train.py:996] (3/4) Epoch 5, batch 5700, loss[loss=0.2543, simple_loss=0.34, pruned_loss=0.08431, over 21799.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3358, pruned_loss=0.09003, over 4287319.17 frames. ], batch size: 371, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:42:42,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=766074.0, ans=0.0 2023-06-20 17:44:01,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=766254.0, ans=0.2 2023-06-20 17:44:06,569 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=15.0 2023-06-20 17:44:07,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=766314.0, ans=0.0 2023-06-20 17:44:28,596 INFO [train.py:996] (3/4) Epoch 5, batch 5750, loss[loss=0.2784, simple_loss=0.3344, pruned_loss=0.1111, over 19978.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3313, pruned_loss=0.08718, over 4272071.99 frames. ], batch size: 702, lr: 6.46e-03, grad_scale: 16.0 2023-06-20 17:45:30,837 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.42 vs. limit=6.0 2023-06-20 17:45:53,251 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.737e+02 3.307e+02 4.353e+02 7.537e+02, threshold=6.613e+02, percent-clipped=1.0 2023-06-20 17:46:11,476 INFO [train.py:996] (3/4) Epoch 5, batch 5800, loss[loss=0.2843, simple_loss=0.3737, pruned_loss=0.09741, over 20876.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3312, pruned_loss=0.08693, over 4262701.16 frames. ], batch size: 608, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:46:13,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=766674.0, ans=0.125 2023-06-20 17:47:07,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=766794.0, ans=0.125 2023-06-20 17:47:39,053 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.01 vs. limit=15.0 2023-06-20 17:47:54,127 INFO [train.py:996] (3/4) Epoch 5, batch 5850, loss[loss=0.2278, simple_loss=0.3322, pruned_loss=0.06167, over 21175.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.327, pruned_loss=0.08215, over 4267289.63 frames. ], batch size: 548, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:48:53,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=767094.0, ans=0.125 2023-06-20 17:49:18,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=767214.0, ans=0.0 2023-06-20 17:49:21,575 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 2.196e+02 2.438e+02 2.861e+02 4.189e+02, threshold=4.877e+02, percent-clipped=0.0 2023-06-20 17:49:22,479 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=22.5 2023-06-20 17:49:34,134 INFO [train.py:996] (3/4) Epoch 5, batch 5900, loss[loss=0.1678, simple_loss=0.2444, pruned_loss=0.04561, over 21456.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3176, pruned_loss=0.07551, over 4273506.97 frames. ], batch size: 211, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:49:39,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=767274.0, ans=0.0 2023-06-20 17:49:39,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2023-06-20 17:50:05,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=767334.0, ans=0.0 2023-06-20 17:50:06,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=767334.0, ans=0.125 2023-06-20 17:50:28,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=767394.0, ans=0.125 2023-06-20 17:50:56,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=767514.0, ans=0.0 2023-06-20 17:51:14,318 INFO [train.py:996] (3/4) Epoch 5, batch 5950, loss[loss=0.2345, simple_loss=0.2902, pruned_loss=0.08945, over 21785.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3169, pruned_loss=0.07936, over 4278055.89 frames. ], batch size: 283, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:51:43,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=767634.0, ans=0.0 2023-06-20 17:51:44,321 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-20 17:51:44,340 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-20 17:52:42,891 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 3.088e+02 3.712e+02 4.428e+02 7.411e+02, threshold=7.424e+02, percent-clipped=12.0 2023-06-20 17:52:45,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=12.0 2023-06-20 17:53:01,215 INFO [train.py:996] (3/4) Epoch 5, batch 6000, loss[loss=0.2474, simple_loss=0.3037, pruned_loss=0.09555, over 21572.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3131, pruned_loss=0.08201, over 4283096.86 frames. ], batch size: 441, lr: 6.45e-03, grad_scale: 32.0 2023-06-20 17:53:01,215 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 17:53:19,510 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2687, simple_loss=0.3621, pruned_loss=0.08766, over 1796401.00 frames. 2023-06-20 17:53:19,511 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-20 17:53:32,710 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.150e-02 2023-06-20 17:53:37,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=767874.0, ans=0.0 2023-06-20 17:53:48,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=767934.0, ans=0.0 2023-06-20 17:53:54,523 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:54:24,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=768054.0, ans=0.125 2023-06-20 17:55:11,120 INFO [train.py:996] (3/4) Epoch 5, batch 6050, loss[loss=0.1931, simple_loss=0.2578, pruned_loss=0.06416, over 21208.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3084, pruned_loss=0.08358, over 4263686.65 frames. ], batch size: 159, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:55:17,919 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:55:32,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=768234.0, ans=0.125 2023-06-20 17:55:55,483 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-20 17:55:56,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=768294.0, ans=0.125 2023-06-20 17:56:17,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=768354.0, ans=0.1 2023-06-20 17:56:30,092 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.566e+02 3.006e+02 3.910e+02 6.691e+02, threshold=6.013e+02, percent-clipped=0.0 2023-06-20 17:56:37,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=768414.0, ans=0.125 2023-06-20 17:56:46,408 INFO [train.py:996] (3/4) Epoch 5, batch 6100, loss[loss=0.2221, simple_loss=0.311, pruned_loss=0.06656, over 21678.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3063, pruned_loss=0.08132, over 4267378.52 frames. ], batch size: 389, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:58:18,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=768714.0, ans=0.125 2023-06-20 17:58:20,820 INFO [train.py:996] (3/4) Epoch 5, batch 6150, loss[loss=0.2268, simple_loss=0.3044, pruned_loss=0.07459, over 21693.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.311, pruned_loss=0.08427, over 4262174.23 frames. ], batch size: 332, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:59:51,300 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.097e+02 2.776e+02 3.188e+02 3.842e+02 5.972e+02, threshold=6.377e+02, percent-clipped=0.0 2023-06-20 18:00:08,571 INFO [train.py:996] (3/4) Epoch 5, batch 6200, loss[loss=0.2981, simple_loss=0.3905, pruned_loss=0.1028, over 21678.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.315, pruned_loss=0.08566, over 4272493.94 frames. ], batch size: 414, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:00:20,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=769074.0, ans=0.1 2023-06-20 18:00:30,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=769134.0, ans=15.0 2023-06-20 18:01:05,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=769254.0, ans=0.0 2023-06-20 18:01:21,523 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.90 vs. limit=10.0 2023-06-20 18:01:47,590 INFO [train.py:996] (3/4) Epoch 5, batch 6250, loss[loss=0.2214, simple_loss=0.3037, pruned_loss=0.06954, over 21452.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3178, pruned_loss=0.08498, over 4266789.91 frames. ], batch size: 211, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:01:50,127 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-20 18:02:08,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=769434.0, ans=0.1 2023-06-20 18:03:00,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=769554.0, ans=0.2 2023-06-20 18:03:07,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=769614.0, ans=0.125 2023-06-20 18:03:11,652 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.664e+02 3.120e+02 3.845e+02 7.013e+02, threshold=6.240e+02, percent-clipped=3.0 2023-06-20 18:03:28,145 INFO [train.py:996] (3/4) Epoch 5, batch 6300, loss[loss=0.2281, simple_loss=0.3079, pruned_loss=0.07414, over 21838.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3202, pruned_loss=0.08378, over 4265608.64 frames. ], batch size: 298, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:03:28,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=769674.0, ans=0.125 2023-06-20 18:04:10,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=769794.0, ans=0.125 2023-06-20 18:05:02,215 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-06-20 18:05:06,341 INFO [train.py:996] (3/4) Epoch 5, batch 6350, loss[loss=0.2906, simple_loss=0.3483, pruned_loss=0.1165, over 21382.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3239, pruned_loss=0.08879, over 4271327.74 frames. ], batch size: 176, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:05:08,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=769974.0, ans=0.0 2023-06-20 18:05:11,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=769974.0, ans=0.04949747468305833 2023-06-20 18:06:21,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=770154.0, ans=0.0 2023-06-20 18:06:34,760 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 2.998e+02 3.510e+02 4.011e+02 7.678e+02, threshold=7.020e+02, percent-clipped=3.0 2023-06-20 18:06:51,034 INFO [train.py:996] (3/4) Epoch 5, batch 6400, loss[loss=0.2946, simple_loss=0.3584, pruned_loss=0.1154, over 21325.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3324, pruned_loss=0.09379, over 4273815.11 frames. ], batch size: 548, lr: 6.44e-03, grad_scale: 32.0 2023-06-20 18:07:54,725 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:07:58,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=770454.0, ans=0.1 2023-06-20 18:08:10,242 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-20 18:08:11,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=770514.0, ans=0.1 2023-06-20 18:08:33,228 INFO [train.py:996] (3/4) Epoch 5, batch 6450, loss[loss=0.2418, simple_loss=0.3069, pruned_loss=0.08835, over 21406.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3339, pruned_loss=0.09241, over 4277087.18 frames. ], batch size: 211, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:09:06,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=770634.0, ans=0.125 2023-06-20 18:09:14,644 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.07 vs. limit=6.0 2023-06-20 18:09:34,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=770694.0, ans=0.125 2023-06-20 18:09:47,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=770754.0, ans=0.0 2023-06-20 18:09:51,655 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-20 18:09:57,225 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.07 vs. limit=22.5 2023-06-20 18:10:05,738 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 2.819e+02 3.379e+02 4.009e+02 7.496e+02, threshold=6.759e+02, percent-clipped=3.0 2023-06-20 18:10:15,966 INFO [train.py:996] (3/4) Epoch 5, batch 6500, loss[loss=0.2544, simple_loss=0.3259, pruned_loss=0.09144, over 21569.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3253, pruned_loss=0.09052, over 4272938.64 frames. ], batch size: 414, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:10:44,226 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-20 18:10:46,222 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=22.5 2023-06-20 18:11:52,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=771114.0, ans=0.125 2023-06-20 18:12:02,371 INFO [train.py:996] (3/4) Epoch 5, batch 6550, loss[loss=0.2512, simple_loss=0.3071, pruned_loss=0.09768, over 21511.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3248, pruned_loss=0.08979, over 4276696.66 frames. ], batch size: 131, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:12:38,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=771234.0, ans=0.1 2023-06-20 18:13:20,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=771354.0, ans=10.0 2023-06-20 18:13:31,236 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.772e+02 3.430e+02 4.140e+02 7.576e+02, threshold=6.860e+02, percent-clipped=2.0 2023-06-20 18:13:49,234 INFO [train.py:996] (3/4) Epoch 5, batch 6600, loss[loss=0.2097, simple_loss=0.2703, pruned_loss=0.0745, over 21693.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3191, pruned_loss=0.08891, over 4275706.77 frames. ], batch size: 282, lr: 6.43e-03, grad_scale: 8.0 2023-06-20 18:14:35,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=771594.0, ans=0.125 2023-06-20 18:15:31,878 INFO [train.py:996] (3/4) Epoch 5, batch 6650, loss[loss=0.24, simple_loss=0.2953, pruned_loss=0.09234, over 21370.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3124, pruned_loss=0.08548, over 4272401.58 frames. ], batch size: 160, lr: 6.43e-03, grad_scale: 8.0 2023-06-20 18:16:20,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=771894.0, ans=0.125 2023-06-20 18:16:44,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=772014.0, ans=0.125 2023-06-20 18:17:04,489 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 2.627e+02 3.141e+02 4.437e+02 8.167e+02, threshold=6.282e+02, percent-clipped=6.0 2023-06-20 18:17:04,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=772014.0, ans=0.125 2023-06-20 18:17:12,607 INFO [train.py:996] (3/4) Epoch 5, batch 6700, loss[loss=0.2156, simple_loss=0.2769, pruned_loss=0.07709, over 21857.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3076, pruned_loss=0.08433, over 4271644.70 frames. ], batch size: 107, lr: 6.43e-03, grad_scale: 8.0 2023-06-20 18:17:20,444 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:17:25,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=772074.0, ans=0.125 2023-06-20 18:18:12,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=772254.0, ans=0.125 2023-06-20 18:18:45,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=772374.0, ans=0.95 2023-06-20 18:18:52,281 INFO [train.py:996] (3/4) Epoch 5, batch 6750, loss[loss=0.2536, simple_loss=0.3117, pruned_loss=0.09776, over 21870.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3059, pruned_loss=0.08512, over 4272532.04 frames. ], batch size: 316, lr: 6.43e-03, grad_scale: 8.0 2023-06-20 18:19:11,884 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:19:14,347 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.76 vs. limit=5.0 2023-06-20 18:19:27,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=772434.0, ans=0.09899494936611666 2023-06-20 18:19:48,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=772554.0, ans=0.0 2023-06-20 18:19:48,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=772554.0, ans=22.5 2023-06-20 18:19:54,493 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:20:15,316 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 2.981e+02 3.524e+02 4.393e+02 7.808e+02, threshold=7.048e+02, percent-clipped=4.0 2023-06-20 18:20:33,902 INFO [train.py:996] (3/4) Epoch 5, batch 6800, loss[loss=0.2575, simple_loss=0.3108, pruned_loss=0.1021, over 21353.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3079, pruned_loss=0.0875, over 4276218.93 frames. ], batch size: 144, lr: 6.43e-03, grad_scale: 16.0 2023-06-20 18:20:40,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=772674.0, ans=0.1 2023-06-20 18:20:54,095 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=15.0 2023-06-20 18:21:40,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=772854.0, ans=0.125 2023-06-20 18:22:04,505 INFO [train.py:996] (3/4) Epoch 5, batch 6850, loss[loss=0.2235, simple_loss=0.2831, pruned_loss=0.08192, over 21659.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3064, pruned_loss=0.08921, over 4278384.22 frames. ], batch size: 230, lr: 6.43e-03, grad_scale: 16.0 2023-06-20 18:22:21,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=772974.0, ans=0.125 2023-06-20 18:23:13,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=773154.0, ans=0.125 2023-06-20 18:23:15,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=773154.0, ans=0.125 2023-06-20 18:23:18,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=773154.0, ans=0.2 2023-06-20 18:23:25,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=773214.0, ans=0.125 2023-06-20 18:23:25,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=773214.0, ans=0.1 2023-06-20 18:23:35,261 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 2.843e+02 3.245e+02 3.952e+02 6.473e+02, threshold=6.489e+02, percent-clipped=0.0 2023-06-20 18:23:53,368 INFO [train.py:996] (3/4) Epoch 5, batch 6900, loss[loss=0.2136, simple_loss=0.2888, pruned_loss=0.06919, over 21220.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3093, pruned_loss=0.08949, over 4284957.28 frames. ], batch size: 159, lr: 6.43e-03, grad_scale: 16.0 2023-06-20 18:24:05,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=773274.0, ans=0.125 2023-06-20 18:24:06,032 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=15.0 2023-06-20 18:24:33,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=773394.0, ans=0.125 2023-06-20 18:24:44,139 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:25:37,564 INFO [train.py:996] (3/4) Epoch 5, batch 6950, loss[loss=0.2802, simple_loss=0.3434, pruned_loss=0.1085, over 21234.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3113, pruned_loss=0.08639, over 4286975.45 frames. ], batch size: 159, lr: 6.43e-03, grad_scale: 16.0 2023-06-20 18:26:07,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=773634.0, ans=0.0 2023-06-20 18:26:11,744 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.08 vs. limit=6.0 2023-06-20 18:26:23,016 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.19 vs. limit=10.0 2023-06-20 18:26:44,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=773754.0, ans=0.1 2023-06-20 18:27:09,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=773814.0, ans=0.1 2023-06-20 18:27:11,057 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.901e+02 2.952e+02 3.293e+02 4.286e+02 8.056e+02, threshold=6.585e+02, percent-clipped=5.0 2023-06-20 18:27:19,071 INFO [train.py:996] (3/4) Epoch 5, batch 7000, loss[loss=0.2726, simple_loss=0.3356, pruned_loss=0.1048, over 21297.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3156, pruned_loss=0.09001, over 4284858.89 frames. ], batch size: 548, lr: 6.42e-03, grad_scale: 16.0 2023-06-20 18:27:36,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=773874.0, ans=0.125 2023-06-20 18:28:00,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=773994.0, ans=0.0 2023-06-20 18:28:24,028 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-20 18:28:28,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=774054.0, ans=0.0 2023-06-20 18:29:08,798 INFO [train.py:996] (3/4) Epoch 5, batch 7050, loss[loss=0.2392, simple_loss=0.3205, pruned_loss=0.07901, over 21026.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3128, pruned_loss=0.0886, over 4281251.93 frames. ], batch size: 607, lr: 6.42e-03, grad_scale: 16.0 2023-06-20 18:29:12,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=774174.0, ans=0.125 2023-06-20 18:29:14,845 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-20 18:29:18,381 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-20 18:29:30,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=774234.0, ans=0.025 2023-06-20 18:30:00,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=774294.0, ans=0.125 2023-06-20 18:30:44,368 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.057e+02 2.864e+02 3.382e+02 4.312e+02 8.915e+02, threshold=6.764e+02, percent-clipped=3.0 2023-06-20 18:30:52,665 INFO [train.py:996] (3/4) Epoch 5, batch 7100, loss[loss=0.2303, simple_loss=0.2989, pruned_loss=0.08081, over 21634.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3186, pruned_loss=0.09052, over 4283021.28 frames. ], batch size: 263, lr: 6.42e-03, grad_scale: 16.0 2023-06-20 18:31:04,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=774474.0, ans=0.125 2023-06-20 18:31:49,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=774594.0, ans=0.0 2023-06-20 18:32:15,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=774654.0, ans=10.0 2023-06-20 18:32:19,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=774714.0, ans=0.125 2023-06-20 18:32:28,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.85 vs. limit=6.0 2023-06-20 18:32:33,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=774774.0, ans=0.1 2023-06-20 18:32:34,547 INFO [train.py:996] (3/4) Epoch 5, batch 7150, loss[loss=0.2658, simple_loss=0.3356, pruned_loss=0.09802, over 21359.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3193, pruned_loss=0.08987, over 4277068.28 frames. ], batch size: 549, lr: 6.42e-03, grad_scale: 16.0 2023-06-20 18:33:02,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=774834.0, ans=0.1 2023-06-20 18:33:07,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=774834.0, ans=0.125 2023-06-20 18:34:08,267 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.836e+02 3.355e+02 3.927e+02 6.037e+02, threshold=6.711e+02, percent-clipped=0.0 2023-06-20 18:34:11,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=775014.0, ans=0.0 2023-06-20 18:34:16,151 INFO [train.py:996] (3/4) Epoch 5, batch 7200, loss[loss=0.252, simple_loss=0.3108, pruned_loss=0.09655, over 21547.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3224, pruned_loss=0.09249, over 4268076.19 frames. ], batch size: 391, lr: 6.42e-03, grad_scale: 32.0 2023-06-20 18:35:34,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=775254.0, ans=0.1 2023-06-20 18:35:50,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=775314.0, ans=0.0 2023-06-20 18:35:58,230 INFO [train.py:996] (3/4) Epoch 5, batch 7250, loss[loss=0.2642, simple_loss=0.3023, pruned_loss=0.1131, over 21367.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3177, pruned_loss=0.09059, over 4270294.28 frames. ], batch size: 509, lr: 6.42e-03, grad_scale: 32.0 2023-06-20 18:36:07,247 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=12.0 2023-06-20 18:36:27,189 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=15.0 2023-06-20 18:36:56,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=775494.0, ans=0.125 2023-06-20 18:37:16,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=775554.0, ans=0.125 2023-06-20 18:37:22,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=775614.0, ans=0.0 2023-06-20 18:37:32,030 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 2.761e+02 3.389e+02 4.055e+02 6.932e+02, threshold=6.778e+02, percent-clipped=1.0 2023-06-20 18:37:40,432 INFO [train.py:996] (3/4) Epoch 5, batch 7300, loss[loss=0.2051, simple_loss=0.2616, pruned_loss=0.07433, over 21493.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3108, pruned_loss=0.08943, over 4264678.18 frames. ], batch size: 230, lr: 6.42e-03, grad_scale: 32.0 2023-06-20 18:38:36,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=775794.0, ans=0.0 2023-06-20 18:38:40,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=775794.0, ans=0.025 2023-06-20 18:38:58,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=775854.0, ans=0.2 2023-06-20 18:39:25,159 INFO [train.py:996] (3/4) Epoch 5, batch 7350, loss[loss=0.2656, simple_loss=0.3096, pruned_loss=0.1108, over 21423.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3081, pruned_loss=0.08972, over 4259120.85 frames. ], batch size: 476, lr: 6.42e-03, grad_scale: 32.0 2023-06-20 18:39:35,211 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.11 vs. limit=15.0 2023-06-20 18:39:39,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=775974.0, ans=0.125 2023-06-20 18:40:20,382 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.45 vs. limit=22.5 2023-06-20 18:40:43,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=776154.0, ans=0.1 2023-06-20 18:41:01,632 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 3.040e+02 3.793e+02 4.434e+02 6.655e+02, threshold=7.586e+02, percent-clipped=0.0 2023-06-20 18:41:09,276 INFO [train.py:996] (3/4) Epoch 5, batch 7400, loss[loss=0.1848, simple_loss=0.2815, pruned_loss=0.04406, over 19712.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3144, pruned_loss=0.09255, over 4267665.55 frames. ], batch size: 702, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:41:23,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=776274.0, ans=0.125 2023-06-20 18:41:42,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=776334.0, ans=0.125 2023-06-20 18:41:44,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=776334.0, ans=0.125 2023-06-20 18:41:55,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=776334.0, ans=0.125 2023-06-20 18:42:11,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=776394.0, ans=0.0 2023-06-20 18:42:29,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=776454.0, ans=0.125 2023-06-20 18:42:55,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=776514.0, ans=0.1 2023-06-20 18:42:57,195 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:42:58,403 INFO [train.py:996] (3/4) Epoch 5, batch 7450, loss[loss=0.1982, simple_loss=0.247, pruned_loss=0.07467, over 20816.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3128, pruned_loss=0.09032, over 4273674.38 frames. ], batch size: 609, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:44:17,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=776754.0, ans=0.1 2023-06-20 18:44:25,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=776814.0, ans=0.125 2023-06-20 18:44:34,828 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.318e+02 2.956e+02 3.297e+02 4.311e+02 7.109e+02, threshold=6.593e+02, percent-clipped=0.0 2023-06-20 18:44:48,779 INFO [train.py:996] (3/4) Epoch 5, batch 7500, loss[loss=0.2534, simple_loss=0.3457, pruned_loss=0.08055, over 21538.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3175, pruned_loss=0.09254, over 4273767.81 frames. ], batch size: 230, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:45:07,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=776874.0, ans=0.1 2023-06-20 18:45:26,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=776934.0, ans=0.0 2023-06-20 18:46:05,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=777054.0, ans=0.125 2023-06-20 18:46:12,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=777114.0, ans=0.1 2023-06-20 18:46:22,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=777114.0, ans=0.125 2023-06-20 18:46:33,773 INFO [train.py:996] (3/4) Epoch 5, batch 7550, loss[loss=0.2508, simple_loss=0.3084, pruned_loss=0.0966, over 21122.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3284, pruned_loss=0.09327, over 4273503.02 frames. ], batch size: 608, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:47:54,170 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-20 18:48:01,289 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.164e+02 2.804e+02 3.228e+02 4.067e+02 8.299e+02, threshold=6.455e+02, percent-clipped=1.0 2023-06-20 18:48:14,625 INFO [train.py:996] (3/4) Epoch 5, batch 7600, loss[loss=0.2792, simple_loss=0.3341, pruned_loss=0.1121, over 21305.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3256, pruned_loss=0.09155, over 4277052.19 frames. ], batch size: 159, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:49:07,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=777594.0, ans=0.1 2023-06-20 18:49:23,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=777654.0, ans=0.125 2023-06-20 18:49:33,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=777714.0, ans=0.2 2023-06-20 18:49:53,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=777714.0, ans=0.125 2023-06-20 18:50:01,075 INFO [train.py:996] (3/4) Epoch 5, batch 7650, loss[loss=0.2458, simple_loss=0.3159, pruned_loss=0.08788, over 21824.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3241, pruned_loss=0.09308, over 4276731.46 frames. ], batch size: 112, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:50:12,779 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-06-20 18:50:23,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=777834.0, ans=0.1 2023-06-20 18:50:53,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=777894.0, ans=0.1 2023-06-20 18:51:36,496 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.282e+02 3.076e+02 3.625e+02 4.419e+02 8.627e+02, threshold=7.249e+02, percent-clipped=2.0 2023-06-20 18:51:43,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=778074.0, ans=0.125 2023-06-20 18:51:44,638 INFO [train.py:996] (3/4) Epoch 5, batch 7700, loss[loss=0.3535, simple_loss=0.422, pruned_loss=0.1424, over 21790.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3268, pruned_loss=0.09638, over 4277233.04 frames. ], batch size: 118, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:52:11,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=778134.0, ans=0.1 2023-06-20 18:52:24,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=778194.0, ans=0.2 2023-06-20 18:53:07,335 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=15.0 2023-06-20 18:53:14,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=778314.0, ans=0.125 2023-06-20 18:53:19,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=778314.0, ans=0.125 2023-06-20 18:53:24,636 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:53:28,586 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=15.0 2023-06-20 18:53:35,123 INFO [train.py:996] (3/4) Epoch 5, batch 7750, loss[loss=0.4321, simple_loss=0.5015, pruned_loss=0.1813, over 21439.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3321, pruned_loss=0.09676, over 4272479.74 frames. ], batch size: 507, lr: 6.41e-03, grad_scale: 16.0 2023-06-20 18:54:01,970 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-20 18:54:43,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=778554.0, ans=0.125 2023-06-20 18:54:57,927 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.01 vs. limit=10.0 2023-06-20 18:55:13,709 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 2.997e+02 3.414e+02 4.185e+02 6.372e+02, threshold=6.827e+02, percent-clipped=0.0 2023-06-20 18:55:19,846 INFO [train.py:996] (3/4) Epoch 5, batch 7800, loss[loss=0.2362, simple_loss=0.3055, pruned_loss=0.08348, over 21647.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3342, pruned_loss=0.0966, over 4265790.34 frames. ], batch size: 263, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 18:55:57,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=778794.0, ans=0.0 2023-06-20 18:56:34,672 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.58 vs. limit=22.5 2023-06-20 18:57:02,639 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-20 18:57:03,281 INFO [train.py:996] (3/4) Epoch 5, batch 7850, loss[loss=0.2415, simple_loss=0.2931, pruned_loss=0.09494, over 22025.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3272, pruned_loss=0.09416, over 4258078.01 frames. ], batch size: 103, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 18:57:05,760 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.29 vs. limit=10.0 2023-06-20 18:57:13,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=778974.0, ans=0.0 2023-06-20 18:57:13,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=778974.0, ans=0.125 2023-06-20 18:57:13,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=778974.0, ans=0.125 2023-06-20 18:57:15,108 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:57:40,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=779094.0, ans=0.0 2023-06-20 18:57:59,072 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:58:04,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=779154.0, ans=0.0 2023-06-20 18:58:32,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=779214.0, ans=0.2 2023-06-20 18:58:41,651 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 2.807e+02 3.191e+02 4.000e+02 6.084e+02, threshold=6.382e+02, percent-clipped=0.0 2023-06-20 18:58:48,766 INFO [train.py:996] (3/4) Epoch 5, batch 7900, loss[loss=0.2739, simple_loss=0.3526, pruned_loss=0.09755, over 21804.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3224, pruned_loss=0.09269, over 4255222.52 frames. ], batch size: 282, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 18:59:48,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=779454.0, ans=0.1 2023-06-20 19:00:17,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=779514.0, ans=0.125 2023-06-20 19:00:26,580 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=22.5 2023-06-20 19:00:34,064 INFO [train.py:996] (3/4) Epoch 5, batch 7950, loss[loss=0.3015, simple_loss=0.3685, pruned_loss=0.1172, over 21738.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3261, pruned_loss=0.09253, over 4253729.68 frames. ], batch size: 441, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:01:37,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=779694.0, ans=0.0 2023-06-20 19:02:04,206 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=22.5 2023-06-20 19:02:08,269 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 3.095e+02 3.551e+02 4.583e+02 8.567e+02, threshold=7.102e+02, percent-clipped=4.0 2023-06-20 19:02:14,709 INFO [train.py:996] (3/4) Epoch 5, batch 8000, loss[loss=0.3798, simple_loss=0.429, pruned_loss=0.1653, over 21365.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3306, pruned_loss=0.09579, over 4254422.67 frames. ], batch size: 507, lr: 6.40e-03, grad_scale: 32.0 2023-06-20 19:02:27,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=779874.0, ans=0.0 2023-06-20 19:02:29,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=779874.0, ans=0.2 2023-06-20 19:03:30,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=780054.0, ans=0.1 2023-06-20 19:04:02,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=780114.0, ans=0.125 2023-06-20 19:04:04,310 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-20 19:04:05,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=780114.0, ans=0.125 2023-06-20 19:04:08,262 INFO [train.py:996] (3/4) Epoch 5, batch 8050, loss[loss=0.2316, simple_loss=0.3022, pruned_loss=0.08046, over 21484.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3316, pruned_loss=0.09613, over 4255985.25 frames. ], batch size: 211, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:05:35,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=780414.0, ans=0.125 2023-06-20 19:05:42,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=780414.0, ans=0.07 2023-06-20 19:05:46,953 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 3.429e+02 3.996e+02 5.275e+02 1.132e+03, threshold=7.992e+02, percent-clipped=3.0 2023-06-20 19:05:52,173 INFO [train.py:996] (3/4) Epoch 5, batch 8100, loss[loss=0.2564, simple_loss=0.3116, pruned_loss=0.1006, over 21382.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3295, pruned_loss=0.09595, over 4254771.29 frames. ], batch size: 159, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:06:29,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=780534.0, ans=0.2 2023-06-20 19:07:18,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=780654.0, ans=15.0 2023-06-20 19:07:49,535 INFO [train.py:996] (3/4) Epoch 5, batch 8150, loss[loss=0.3571, simple_loss=0.4378, pruned_loss=0.1382, over 21474.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3403, pruned_loss=0.09822, over 4257401.99 frames. ], batch size: 507, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:08:06,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=780834.0, ans=0.0 2023-06-20 19:08:07,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=780834.0, ans=0.125 2023-06-20 19:08:19,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=780834.0, ans=0.125 2023-06-20 19:08:54,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=780954.0, ans=0.125 2023-06-20 19:09:25,300 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-20 19:09:27,390 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 3.029e+02 3.506e+02 4.177e+02 7.466e+02, threshold=7.011e+02, percent-clipped=0.0 2023-06-20 19:09:32,491 INFO [train.py:996] (3/4) Epoch 5, batch 8200, loss[loss=0.2474, simple_loss=0.3023, pruned_loss=0.09619, over 21812.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3334, pruned_loss=0.09576, over 4268932.36 frames. ], batch size: 102, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:10:05,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=781194.0, ans=0.125 2023-06-20 19:10:29,059 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-06-20 19:11:03,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=781314.0, ans=0.125 2023-06-20 19:11:15,872 INFO [train.py:996] (3/4) Epoch 5, batch 8250, loss[loss=0.2146, simple_loss=0.302, pruned_loss=0.06362, over 21659.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3322, pruned_loss=0.09514, over 4272578.50 frames. ], batch size: 247, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:11:48,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=781434.0, ans=0.125 2023-06-20 19:12:37,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=781554.0, ans=0.2 2023-06-20 19:12:38,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=781554.0, ans=15.0 2023-06-20 19:12:55,622 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-20 19:12:56,433 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 2.737e+02 3.159e+02 4.146e+02 7.904e+02, threshold=6.318e+02, percent-clipped=1.0 2023-06-20 19:12:59,855 INFO [train.py:996] (3/4) Epoch 5, batch 8300, loss[loss=0.24, simple_loss=0.3162, pruned_loss=0.08189, over 21656.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3313, pruned_loss=0.09383, over 4274334.71 frames. ], batch size: 230, lr: 6.39e-03, grad_scale: 8.0 2023-06-20 19:13:03,936 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-20 19:13:16,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=781734.0, ans=0.1 2023-06-20 19:13:29,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=781734.0, ans=0.125 2023-06-20 19:13:40,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=781734.0, ans=0.025 2023-06-20 19:14:36,387 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.21 vs. limit=12.0 2023-06-20 19:14:43,839 INFO [train.py:996] (3/4) Epoch 5, batch 8350, loss[loss=0.2245, simple_loss=0.3016, pruned_loss=0.07369, over 21357.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3278, pruned_loss=0.09087, over 4273454.30 frames. ], batch size: 131, lr: 6.39e-03, grad_scale: 8.0 2023-06-20 19:14:56,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=781974.0, ans=0.125 2023-06-20 19:15:29,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=782094.0, ans=0.0 2023-06-20 19:15:33,499 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.79 vs. limit=10.0 2023-06-20 19:16:15,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=782214.0, ans=0.2 2023-06-20 19:16:18,273 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.671e+02 3.158e+02 4.298e+02 8.367e+02, threshold=6.316e+02, percent-clipped=9.0 2023-06-20 19:16:21,592 INFO [train.py:996] (3/4) Epoch 5, batch 8400, loss[loss=0.2339, simple_loss=0.3119, pruned_loss=0.0779, over 21736.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3242, pruned_loss=0.08766, over 4266109.31 frames. ], batch size: 316, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:16:24,508 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.10 vs. limit=22.5 2023-06-20 19:16:34,624 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=12.0 2023-06-20 19:16:34,723 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.65 vs. limit=22.5 2023-06-20 19:16:42,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=782334.0, ans=0.2 2023-06-20 19:17:12,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=782394.0, ans=0.125 2023-06-20 19:18:06,484 INFO [train.py:996] (3/4) Epoch 5, batch 8450, loss[loss=0.2255, simple_loss=0.3083, pruned_loss=0.07136, over 21613.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3218, pruned_loss=0.08674, over 4269383.98 frames. ], batch size: 263, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:18:11,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=782574.0, ans=0.0 2023-06-20 19:18:30,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=782634.0, ans=0.125 2023-06-20 19:19:27,719 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-20 19:19:44,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=782814.0, ans=0.1 2023-06-20 19:19:46,759 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 2.787e+02 3.267e+02 4.084e+02 6.258e+02, threshold=6.534e+02, percent-clipped=0.0 2023-06-20 19:19:49,896 INFO [train.py:996] (3/4) Epoch 5, batch 8500, loss[loss=0.2584, simple_loss=0.3232, pruned_loss=0.09675, over 21802.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3179, pruned_loss=0.08766, over 4265923.24 frames. ], batch size: 124, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:20:09,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=782934.0, ans=0.125 2023-06-20 19:20:19,063 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-20 19:21:35,053 INFO [train.py:996] (3/4) Epoch 5, batch 8550, loss[loss=0.2592, simple_loss=0.3282, pruned_loss=0.09517, over 21110.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3243, pruned_loss=0.09127, over 4268780.00 frames. ], batch size: 143, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:22:02,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=783234.0, ans=0.07 2023-06-20 19:22:26,864 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.03 vs. limit=15.0 2023-06-20 19:22:41,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=783294.0, ans=0.1 2023-06-20 19:23:04,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=783414.0, ans=0.1 2023-06-20 19:23:08,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.25 vs. limit=15.0 2023-06-20 19:23:16,127 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.348e+02 2.988e+02 3.421e+02 4.230e+02 6.048e+02, threshold=6.842e+02, percent-clipped=0.0 2023-06-20 19:23:18,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=783474.0, ans=0.5 2023-06-20 19:23:19,460 INFO [train.py:996] (3/4) Epoch 5, batch 8600, loss[loss=0.3181, simple_loss=0.3854, pruned_loss=0.1254, over 21523.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3309, pruned_loss=0.09344, over 4268118.22 frames. ], batch size: 414, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:24:18,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=783594.0, ans=0.2 2023-06-20 19:24:19,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=783594.0, ans=0.125 2023-06-20 19:24:34,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=783654.0, ans=0.0 2023-06-20 19:24:45,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=783714.0, ans=0.1 2023-06-20 19:25:14,162 INFO [train.py:996] (3/4) Epoch 5, batch 8650, loss[loss=0.2266, simple_loss=0.3192, pruned_loss=0.06695, over 21638.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3384, pruned_loss=0.09517, over 4270033.44 frames. ], batch size: 263, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:25:52,129 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.36 vs. limit=15.0 2023-06-20 19:25:57,919 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:26:48,609 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.009e+02 2.897e+02 3.428e+02 4.241e+02 7.600e+02, threshold=6.856e+02, percent-clipped=1.0 2023-06-20 19:26:51,748 INFO [train.py:996] (3/4) Epoch 5, batch 8700, loss[loss=0.2086, simple_loss=0.2979, pruned_loss=0.05963, over 21656.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3299, pruned_loss=0.09166, over 4274457.85 frames. ], batch size: 263, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:27:22,894 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:27:48,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=784194.0, ans=0.07 2023-06-20 19:28:34,530 INFO [train.py:996] (3/4) Epoch 5, batch 8750, loss[loss=0.2542, simple_loss=0.3093, pruned_loss=0.0995, over 21631.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3259, pruned_loss=0.09168, over 4276151.70 frames. ], batch size: 263, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:28:34,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=784374.0, ans=0.125 2023-06-20 19:29:30,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=784494.0, ans=0.1 2023-06-20 19:29:36,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=784554.0, ans=0.125 2023-06-20 19:29:38,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=784554.0, ans=0.0 2023-06-20 19:30:01,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=784614.0, ans=0.125 2023-06-20 19:30:14,569 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 3.095e+02 3.629e+02 5.257e+02 8.550e+02, threshold=7.257e+02, percent-clipped=6.0 2023-06-20 19:30:18,134 INFO [train.py:996] (3/4) Epoch 5, batch 8800, loss[loss=0.3239, simple_loss=0.3932, pruned_loss=0.1273, over 21765.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3348, pruned_loss=0.09595, over 4283098.86 frames. ], batch size: 332, lr: 6.38e-03, grad_scale: 32.0 2023-06-20 19:30:28,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=784674.0, ans=0.125 2023-06-20 19:30:38,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=784734.0, ans=0.025 2023-06-20 19:31:09,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=784794.0, ans=0.1 2023-06-20 19:31:15,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=784794.0, ans=0.0 2023-06-20 19:32:01,524 INFO [train.py:996] (3/4) Epoch 5, batch 8850, loss[loss=0.2788, simple_loss=0.3502, pruned_loss=0.1038, over 21663.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.343, pruned_loss=0.09799, over 4279689.44 frames. ], batch size: 332, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:32:37,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=785034.0, ans=0.125 2023-06-20 19:32:49,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=785094.0, ans=0.0 2023-06-20 19:32:59,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=785094.0, ans=0.025 2023-06-20 19:33:02,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=785154.0, ans=0.0 2023-06-20 19:33:28,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=785214.0, ans=0.1 2023-06-20 19:33:32,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=785214.0, ans=0.125 2023-06-20 19:33:34,761 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.43 vs. limit=12.0 2023-06-20 19:33:45,530 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 3.064e+02 3.643e+02 4.711e+02 6.430e+02, threshold=7.286e+02, percent-clipped=0.0 2023-06-20 19:33:47,238 INFO [train.py:996] (3/4) Epoch 5, batch 8900, loss[loss=0.2484, simple_loss=0.3, pruned_loss=0.09838, over 21377.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3383, pruned_loss=0.09657, over 4279153.17 frames. ], batch size: 194, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:33:54,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=785274.0, ans=0.0 2023-06-20 19:34:19,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=785334.0, ans=0.1 2023-06-20 19:34:21,738 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=15.0 2023-06-20 19:34:44,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=785394.0, ans=0.125 2023-06-20 19:35:16,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=785514.0, ans=0.1 2023-06-20 19:35:35,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=785514.0, ans=0.0 2023-06-20 19:35:36,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=785574.0, ans=0.125 2023-06-20 19:35:37,863 INFO [train.py:996] (3/4) Epoch 5, batch 8950, loss[loss=0.2826, simple_loss=0.3858, pruned_loss=0.08969, over 20838.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3382, pruned_loss=0.09508, over 4272918.83 frames. ], batch size: 608, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:35:38,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=785574.0, ans=0.0 2023-06-20 19:35:38,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=785574.0, ans=0.125 2023-06-20 19:37:10,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=785814.0, ans=0.1 2023-06-20 19:37:18,311 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 3.139e+02 3.697e+02 4.422e+02 7.989e+02, threshold=7.395e+02, percent-clipped=1.0 2023-06-20 19:37:19,876 INFO [train.py:996] (3/4) Epoch 5, batch 9000, loss[loss=0.1956, simple_loss=0.2588, pruned_loss=0.06622, over 21356.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3308, pruned_loss=0.09415, over 4278123.52 frames. ], batch size: 131, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:37:19,876 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 19:37:36,468 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2656, simple_loss=0.3627, pruned_loss=0.0843, over 1796401.00 frames. 2023-06-20 19:37:36,468 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-20 19:37:40,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=785874.0, ans=0.125 2023-06-20 19:38:05,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=785934.0, ans=0.04949747468305833 2023-06-20 19:38:42,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=786054.0, ans=0.125 2023-06-20 19:39:20,730 INFO [train.py:996] (3/4) Epoch 5, batch 9050, loss[loss=0.3204, simple_loss=0.39, pruned_loss=0.1254, over 21434.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3252, pruned_loss=0.08973, over 4276384.77 frames. ], batch size: 131, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:39:24,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=786174.0, ans=0.125 2023-06-20 19:39:25,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=786174.0, ans=0.0 2023-06-20 19:39:58,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=786294.0, ans=0.0 2023-06-20 19:40:58,939 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.870e+02 2.876e+02 3.189e+02 3.875e+02 6.604e+02, threshold=6.378e+02, percent-clipped=0.0 2023-06-20 19:41:00,072 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-20 19:41:00,726 INFO [train.py:996] (3/4) Epoch 5, batch 9100, loss[loss=0.2251, simple_loss=0.317, pruned_loss=0.06657, over 21194.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3318, pruned_loss=0.09195, over 4273687.62 frames. ], batch size: 143, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:41:02,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=786474.0, ans=0.0 2023-06-20 19:41:12,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=786474.0, ans=0.125 2023-06-20 19:42:21,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-20 19:42:45,916 INFO [train.py:996] (3/4) Epoch 5, batch 9150, loss[loss=0.2383, simple_loss=0.3224, pruned_loss=0.07707, over 21722.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3318, pruned_loss=0.08913, over 4276908.72 frames. ], batch size: 247, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:43:55,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=786954.0, ans=0.125 2023-06-20 19:44:18,684 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-20 19:44:38,230 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.853e+02 3.445e+02 4.216e+02 8.485e+02, threshold=6.890e+02, percent-clipped=4.0 2023-06-20 19:44:39,960 INFO [train.py:996] (3/4) Epoch 5, batch 9200, loss[loss=0.2806, simple_loss=0.3528, pruned_loss=0.1042, over 21695.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3332, pruned_loss=0.08829, over 4278271.58 frames. ], batch size: 298, lr: 6.37e-03, grad_scale: 32.0 2023-06-20 19:44:42,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=787074.0, ans=0.2 2023-06-20 19:44:44,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=787074.0, ans=0.125 2023-06-20 19:45:17,532 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:45:20,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=787134.0, ans=0.0 2023-06-20 19:45:20,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=787134.0, ans=0.2 2023-06-20 19:45:22,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=787194.0, ans=0.125 2023-06-20 19:46:23,469 INFO [train.py:996] (3/4) Epoch 5, batch 9250, loss[loss=0.2285, simple_loss=0.308, pruned_loss=0.07449, over 16079.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.337, pruned_loss=0.09262, over 4271152.80 frames. ], batch size: 60, lr: 6.37e-03, grad_scale: 32.0 2023-06-20 19:46:43,382 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-20 19:46:57,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=787434.0, ans=0.2 2023-06-20 19:47:50,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=787614.0, ans=0.2 2023-06-20 19:48:05,892 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 3.152e+02 3.844e+02 5.008e+02 7.995e+02, threshold=7.688e+02, percent-clipped=5.0 2023-06-20 19:48:12,327 INFO [train.py:996] (3/4) Epoch 5, batch 9300, loss[loss=0.2212, simple_loss=0.2861, pruned_loss=0.07814, over 21773.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3319, pruned_loss=0.09245, over 4267764.67 frames. ], batch size: 112, lr: 6.37e-03, grad_scale: 32.0 2023-06-20 19:48:31,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=787734.0, ans=0.0 2023-06-20 19:48:42,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=787734.0, ans=0.0 2023-06-20 19:48:45,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=787734.0, ans=0.125 2023-06-20 19:49:22,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-06-20 19:49:57,397 INFO [train.py:996] (3/4) Epoch 5, batch 9350, loss[loss=0.2907, simple_loss=0.3579, pruned_loss=0.1117, over 21285.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3401, pruned_loss=0.09407, over 4266900.28 frames. ], batch size: 143, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:50:26,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=788034.0, ans=0.0 2023-06-20 19:50:28,815 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-20 19:50:56,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=788154.0, ans=0.125 2023-06-20 19:50:58,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=788154.0, ans=0.05 2023-06-20 19:50:59,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=788154.0, ans=0.09899494936611666 2023-06-20 19:51:32,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=788214.0, ans=0.125 2023-06-20 19:51:41,579 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.821e+02 3.195e+02 3.693e+02 6.028e+02, threshold=6.390e+02, percent-clipped=0.0 2023-06-20 19:51:41,600 INFO [train.py:996] (3/4) Epoch 5, batch 9400, loss[loss=0.2155, simple_loss=0.2784, pruned_loss=0.07634, over 21560.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3391, pruned_loss=0.09487, over 4270412.28 frames. ], batch size: 263, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:51:50,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=788274.0, ans=0.125 2023-06-20 19:51:50,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=788274.0, ans=0.1 2023-06-20 19:52:31,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=788394.0, ans=0.125 2023-06-20 19:53:19,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=788514.0, ans=0.125 2023-06-20 19:53:31,428 INFO [train.py:996] (3/4) Epoch 5, batch 9450, loss[loss=0.1887, simple_loss=0.2499, pruned_loss=0.06375, over 21253.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3316, pruned_loss=0.09367, over 4260164.57 frames. ], batch size: 549, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 19:53:33,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=788574.0, ans=0.125 2023-06-20 19:54:01,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=788634.0, ans=0.125 2023-06-20 19:54:35,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=788754.0, ans=0.0 2023-06-20 19:55:15,209 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.274e+02 3.307e+02 4.000e+02 4.833e+02 8.255e+02, threshold=8.000e+02, percent-clipped=8.0 2023-06-20 19:55:15,230 INFO [train.py:996] (3/4) Epoch 5, batch 9500, loss[loss=0.2668, simple_loss=0.3366, pruned_loss=0.09846, over 21464.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3243, pruned_loss=0.09182, over 4263472.06 frames. ], batch size: 471, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 19:56:30,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=789054.0, ans=0.125 2023-06-20 19:56:40,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=789114.0, ans=0.125 2023-06-20 19:56:58,662 INFO [train.py:996] (3/4) Epoch 5, batch 9550, loss[loss=0.2716, simple_loss=0.3574, pruned_loss=0.09295, over 21741.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3269, pruned_loss=0.0937, over 4262299.14 frames. ], batch size: 298, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 19:57:16,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=789234.0, ans=0.1 2023-06-20 19:57:20,703 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=12.0 2023-06-20 19:57:45,273 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:57:50,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=789294.0, ans=0.125 2023-06-20 19:58:10,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=789354.0, ans=0.125 2023-06-20 19:58:15,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-06-20 19:58:42,204 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 2.917e+02 3.406e+02 3.917e+02 8.592e+02, threshold=6.812e+02, percent-clipped=1.0 2023-06-20 19:58:42,226 INFO [train.py:996] (3/4) Epoch 5, batch 9600, loss[loss=0.2481, simple_loss=0.318, pruned_loss=0.08913, over 21867.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3301, pruned_loss=0.0962, over 4274304.99 frames. ], batch size: 391, lr: 6.36e-03, grad_scale: 32.0 2023-06-20 19:59:04,710 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-20 19:59:29,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=789594.0, ans=0.5 2023-06-20 19:59:54,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=789654.0, ans=0.125 2023-06-20 20:00:01,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=789654.0, ans=0.0 2023-06-20 20:00:27,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=789774.0, ans=0.125 2023-06-20 20:00:28,514 INFO [train.py:996] (3/4) Epoch 5, batch 9650, loss[loss=0.3041, simple_loss=0.3665, pruned_loss=0.1209, over 21769.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3288, pruned_loss=0.09543, over 4272679.47 frames. ], batch size: 441, lr: 6.36e-03, grad_scale: 32.0 2023-06-20 20:01:39,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=789954.0, ans=0.125 2023-06-20 20:01:42,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=789954.0, ans=0.125 2023-06-20 20:02:13,489 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.873e+02 3.424e+02 4.153e+02 6.817e+02, threshold=6.847e+02, percent-clipped=1.0 2023-06-20 20:02:13,509 INFO [train.py:996] (3/4) Epoch 5, batch 9700, loss[loss=0.2156, simple_loss=0.2905, pruned_loss=0.07031, over 21425.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3308, pruned_loss=0.0945, over 4273811.24 frames. ], batch size: 211, lr: 6.36e-03, grad_scale: 32.0 2023-06-20 20:02:30,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=790074.0, ans=0.0 2023-06-20 20:03:00,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=790194.0, ans=0.125 2023-06-20 20:03:56,223 INFO [train.py:996] (3/4) Epoch 5, batch 9750, loss[loss=0.2534, simple_loss=0.3208, pruned_loss=0.09296, over 15109.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3248, pruned_loss=0.09349, over 4272071.58 frames. ], batch size: 60, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 20:04:04,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=790374.0, ans=10.0 2023-06-20 20:04:30,425 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.84 vs. limit=10.0 2023-06-20 20:04:49,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=790494.0, ans=0.125 2023-06-20 20:04:54,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=790554.0, ans=0.0 2023-06-20 20:04:57,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=790554.0, ans=0.05 2023-06-20 20:05:34,113 INFO [train.py:996] (3/4) Epoch 5, batch 9800, loss[loss=0.2755, simple_loss=0.3268, pruned_loss=0.1121, over 21292.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3253, pruned_loss=0.09374, over 4279287.16 frames. ], batch size: 143, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 20:05:35,778 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 3.031e+02 3.637e+02 4.540e+02 9.363e+02, threshold=7.274e+02, percent-clipped=7.0 2023-06-20 20:05:36,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=790674.0, ans=0.125 2023-06-20 20:05:39,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=790674.0, ans=0.125 2023-06-20 20:06:02,807 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-20 20:06:30,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=790794.0, ans=0.125 2023-06-20 20:06:33,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=790854.0, ans=0.035 2023-06-20 20:06:43,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=790854.0, ans=0.125 2023-06-20 20:07:12,176 INFO [train.py:996] (3/4) Epoch 5, batch 9850, loss[loss=0.2198, simple_loss=0.2691, pruned_loss=0.08523, over 21261.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.323, pruned_loss=0.09371, over 4260946.80 frames. ], batch size: 160, lr: 6.35e-03, grad_scale: 16.0 2023-06-20 20:07:27,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=791034.0, ans=0.07 2023-06-20 20:07:40,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=791034.0, ans=0.2 2023-06-20 20:08:51,601 INFO [train.py:996] (3/4) Epoch 5, batch 9900, loss[loss=0.3086, simple_loss=0.3706, pruned_loss=0.1233, over 21798.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3196, pruned_loss=0.09388, over 4265879.82 frames. ], batch size: 118, lr: 6.35e-03, grad_scale: 16.0 2023-06-20 20:08:53,111 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.770e+02 3.187e+02 3.744e+02 7.656e+02, threshold=6.375e+02, percent-clipped=1.0 2023-06-20 20:08:55,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=791274.0, ans=0.1 2023-06-20 20:09:37,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=791394.0, ans=0.125 2023-06-20 20:09:52,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=791454.0, ans=0.125 2023-06-20 20:10:10,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=791514.0, ans=0.125 2023-06-20 20:10:29,341 INFO [train.py:996] (3/4) Epoch 5, batch 9950, loss[loss=0.2508, simple_loss=0.3128, pruned_loss=0.09445, over 21641.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3222, pruned_loss=0.09585, over 4269154.62 frames. ], batch size: 332, lr: 6.35e-03, grad_scale: 16.0 2023-06-20 20:10:46,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=791574.0, ans=0.05 2023-06-20 20:10:54,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=791634.0, ans=0.1 2023-06-20 20:10:57,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=791634.0, ans=0.125 2023-06-20 20:10:59,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=791634.0, ans=0.125 2023-06-20 20:11:33,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=791754.0, ans=0.0 2023-06-20 20:11:37,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=791754.0, ans=0.125 2023-06-20 20:12:13,162 INFO [train.py:996] (3/4) Epoch 5, batch 10000, loss[loss=0.2441, simple_loss=0.3166, pruned_loss=0.08583, over 21559.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3178, pruned_loss=0.09442, over 4262554.87 frames. ], batch size: 414, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:12:13,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=791874.0, ans=0.1 2023-06-20 20:12:14,857 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 2.860e+02 3.256e+02 3.834e+02 6.756e+02, threshold=6.512e+02, percent-clipped=1.0 2023-06-20 20:12:25,109 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-20 20:12:36,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=791934.0, ans=0.125 2023-06-20 20:13:13,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=791994.0, ans=0.0 2023-06-20 20:13:46,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=792114.0, ans=0.125 2023-06-20 20:14:04,466 INFO [train.py:996] (3/4) Epoch 5, batch 10050, loss[loss=0.2804, simple_loss=0.3347, pruned_loss=0.1131, over 21519.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3203, pruned_loss=0.09507, over 4262241.80 frames. ], batch size: 441, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:14:06,879 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=12.0 2023-06-20 20:14:16,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=792174.0, ans=0.1 2023-06-20 20:14:42,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=792234.0, ans=0.05 2023-06-20 20:15:43,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=792414.0, ans=0.0 2023-06-20 20:15:44,627 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-20 20:15:48,182 INFO [train.py:996] (3/4) Epoch 5, batch 10100, loss[loss=0.1903, simple_loss=0.2747, pruned_loss=0.05288, over 20786.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3145, pruned_loss=0.09193, over 4254364.85 frames. ], batch size: 608, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:15:49,940 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.949e+02 3.459e+02 3.930e+02 6.580e+02, threshold=6.918e+02, percent-clipped=2.0 2023-06-20 20:16:32,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=792594.0, ans=0.0 2023-06-20 20:16:41,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=792594.0, ans=0.1 2023-06-20 20:16:43,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=792594.0, ans=0.0 2023-06-20 20:17:23,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=792714.0, ans=0.2 2023-06-20 20:17:30,450 INFO [train.py:996] (3/4) Epoch 5, batch 10150, loss[loss=0.2518, simple_loss=0.3001, pruned_loss=0.1018, over 21830.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3209, pruned_loss=0.09461, over 4264073.68 frames. ], batch size: 98, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:17:33,071 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=12.0 2023-06-20 20:17:44,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.71 vs. limit=15.0 2023-06-20 20:18:08,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=792834.0, ans=0.0 2023-06-20 20:19:19,857 INFO [train.py:996] (3/4) Epoch 5, batch 10200, loss[loss=0.2657, simple_loss=0.3459, pruned_loss=0.09276, over 21550.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3211, pruned_loss=0.0926, over 4257174.74 frames. ], batch size: 389, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:19:21,486 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.868e+02 3.276e+02 4.054e+02 7.472e+02, threshold=6.552e+02, percent-clipped=1.0 2023-06-20 20:19:33,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=793074.0, ans=0.1 2023-06-20 20:20:11,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=793194.0, ans=0.2 2023-06-20 20:20:16,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=793194.0, ans=0.125 2023-06-20 20:21:02,673 INFO [train.py:996] (3/4) Epoch 5, batch 10250, loss[loss=0.2255, simple_loss=0.3053, pruned_loss=0.07287, over 21794.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3145, pruned_loss=0.08554, over 4261787.50 frames. ], batch size: 282, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:21:06,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=793374.0, ans=0.5 2023-06-20 20:21:16,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=793374.0, ans=0.125 2023-06-20 20:21:29,592 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-20 20:21:30,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=793434.0, ans=0.0 2023-06-20 20:22:26,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=793614.0, ans=0.125 2023-06-20 20:22:43,310 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:22:52,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=793674.0, ans=0.125 2023-06-20 20:22:53,279 INFO [train.py:996] (3/4) Epoch 5, batch 10300, loss[loss=0.2779, simple_loss=0.3454, pruned_loss=0.1052, over 21491.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3186, pruned_loss=0.08687, over 4263904.25 frames. ], batch size: 194, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:22:54,765 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 2.621e+02 3.179e+02 4.486e+02 7.082e+02, threshold=6.359e+02, percent-clipped=5.0 2023-06-20 20:23:25,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=793734.0, ans=0.07 2023-06-20 20:23:32,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=793794.0, ans=0.2 2023-06-20 20:24:11,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=793854.0, ans=0.0 2023-06-20 20:24:27,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=793914.0, ans=0.125 2023-06-20 20:24:35,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=793974.0, ans=0.125 2023-06-20 20:24:35,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=793974.0, ans=0.07 2023-06-20 20:24:37,239 INFO [train.py:996] (3/4) Epoch 5, batch 10350, loss[loss=0.1962, simple_loss=0.2459, pruned_loss=0.07327, over 21188.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3206, pruned_loss=0.08765, over 4261894.01 frames. ], batch size: 143, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:24:55,582 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-20 20:24:56,924 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-20 20:24:57,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=794034.0, ans=0.125 2023-06-20 20:26:22,033 INFO [train.py:996] (3/4) Epoch 5, batch 10400, loss[loss=0.2718, simple_loss=0.3437, pruned_loss=0.09992, over 21693.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3138, pruned_loss=0.0862, over 4265501.04 frames. ], batch size: 415, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:26:23,725 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.345e+02 3.112e+02 3.947e+02 5.007e+02 1.010e+03, threshold=7.895e+02, percent-clipped=9.0 2023-06-20 20:26:49,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=794334.0, ans=0.0 2023-06-20 20:27:11,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=794394.0, ans=0.04949747468305833 2023-06-20 20:27:13,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=794394.0, ans=0.125 2023-06-20 20:27:48,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=794454.0, ans=0.2 2023-06-20 20:27:52,576 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-20 20:28:05,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=794514.0, ans=0.015 2023-06-20 20:28:13,675 INFO [train.py:996] (3/4) Epoch 5, batch 10450, loss[loss=0.3215, simple_loss=0.3772, pruned_loss=0.1329, over 21404.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3189, pruned_loss=0.08972, over 4269857.90 frames. ], batch size: 549, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:28:41,254 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:29:51,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=794814.0, ans=0.0 2023-06-20 20:29:57,592 INFO [train.py:996] (3/4) Epoch 5, batch 10500, loss[loss=0.2615, simple_loss=0.3166, pruned_loss=0.1032, over 21471.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3201, pruned_loss=0.08937, over 4262485.73 frames. ], batch size: 441, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:29:59,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.789e+02 3.342e+02 3.915e+02 9.640e+02, threshold=6.684e+02, percent-clipped=1.0 2023-06-20 20:30:28,331 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-06-20 20:30:47,614 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=12.0 2023-06-20 20:31:41,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=795174.0, ans=0.125 2023-06-20 20:31:42,454 INFO [train.py:996] (3/4) Epoch 5, batch 10550, loss[loss=0.2531, simple_loss=0.3275, pruned_loss=0.08936, over 20736.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3155, pruned_loss=0.08877, over 4251560.64 frames. ], batch size: 607, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:31:55,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=795174.0, ans=0.04949747468305833 2023-06-20 20:32:13,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=795234.0, ans=0.125 2023-06-20 20:32:31,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=795294.0, ans=0.125 2023-06-20 20:32:37,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=795294.0, ans=0.125 2023-06-20 20:33:18,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=795414.0, ans=0.0 2023-06-20 20:33:18,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=795414.0, ans=0.0 2023-06-20 20:33:20,470 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.70 vs. limit=15.0 2023-06-20 20:33:26,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=795474.0, ans=0.125 2023-06-20 20:33:28,201 INFO [train.py:996] (3/4) Epoch 5, batch 10600, loss[loss=0.2242, simple_loss=0.2999, pruned_loss=0.0743, over 21670.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3114, pruned_loss=0.08787, over 4242148.77 frames. ], batch size: 332, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:33:29,823 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 3.099e+02 3.993e+02 4.741e+02 9.586e+02, threshold=7.985e+02, percent-clipped=4.0 2023-06-20 20:33:33,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=795474.0, ans=0.2 2023-06-20 20:33:43,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=795474.0, ans=0.1 2023-06-20 20:34:00,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=795534.0, ans=0.07 2023-06-20 20:34:14,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.80 vs. limit=15.0 2023-06-20 20:34:28,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=795594.0, ans=0.2 2023-06-20 20:35:15,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=795714.0, ans=0.0 2023-06-20 20:35:17,569 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-20 20:35:23,415 INFO [train.py:996] (3/4) Epoch 5, batch 10650, loss[loss=0.1968, simple_loss=0.2651, pruned_loss=0.06422, over 21406.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3141, pruned_loss=0.08638, over 4249921.27 frames. ], batch size: 211, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:35:51,616 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-20 20:36:26,871 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-20 20:36:29,937 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.84 vs. limit=10.0 2023-06-20 20:36:56,210 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:37:06,857 INFO [train.py:996] (3/4) Epoch 5, batch 10700, loss[loss=0.3212, simple_loss=0.3767, pruned_loss=0.1329, over 21514.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3126, pruned_loss=0.08675, over 4253463.68 frames. ], batch size: 131, lr: 6.33e-03, grad_scale: 32.0 2023-06-20 20:37:08,778 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.769e+02 3.121e+02 4.008e+02 5.487e+02, threshold=6.241e+02, percent-clipped=0.0 2023-06-20 20:37:12,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=796074.0, ans=0.125 2023-06-20 20:37:50,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=796194.0, ans=0.125 2023-06-20 20:37:52,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=796194.0, ans=0.125 2023-06-20 20:38:13,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=796254.0, ans=0.125 2023-06-20 20:38:53,060 INFO [train.py:996] (3/4) Epoch 5, batch 10750, loss[loss=0.2834, simple_loss=0.3632, pruned_loss=0.1019, over 21602.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3226, pruned_loss=0.09115, over 4261458.31 frames. ], batch size: 263, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:39:12,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=796374.0, ans=0.125 2023-06-20 20:39:20,696 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:39:22,760 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-06-20 20:40:07,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=796554.0, ans=0.0 2023-06-20 20:40:10,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=796554.0, ans=0.2 2023-06-20 20:40:11,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.97 vs. limit=10.0 2023-06-20 20:40:17,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=796614.0, ans=0.05 2023-06-20 20:40:44,423 INFO [train.py:996] (3/4) Epoch 5, batch 10800, loss[loss=0.3151, simple_loss=0.3723, pruned_loss=0.129, over 21570.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3282, pruned_loss=0.0919, over 4260640.15 frames. ], batch size: 389, lr: 6.33e-03, grad_scale: 32.0 2023-06-20 20:40:47,898 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.174e+02 3.089e+02 3.673e+02 4.196e+02 7.308e+02, threshold=7.346e+02, percent-clipped=3.0 2023-06-20 20:40:56,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=796674.0, ans=0.125 2023-06-20 20:42:29,741 INFO [train.py:996] (3/4) Epoch 5, batch 10850, loss[loss=0.2535, simple_loss=0.3057, pruned_loss=0.1006, over 21813.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3296, pruned_loss=0.09231, over 4264436.02 frames. ], batch size: 98, lr: 6.33e-03, grad_scale: 32.0 2023-06-20 20:42:43,996 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.85 vs. limit=5.0 2023-06-20 20:43:32,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=797154.0, ans=0.125 2023-06-20 20:43:43,865 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:43:53,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=797214.0, ans=0.09899494936611666 2023-06-20 20:44:11,618 INFO [train.py:996] (3/4) Epoch 5, batch 10900, loss[loss=0.2448, simple_loss=0.344, pruned_loss=0.07278, over 21778.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3231, pruned_loss=0.09055, over 4270578.63 frames. ], batch size: 351, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:44:16,173 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.440e+02 2.886e+02 3.394e+02 4.121e+02 7.095e+02, threshold=6.789e+02, percent-clipped=0.0 2023-06-20 20:44:20,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=22.5 2023-06-20 20:44:24,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=797274.0, ans=0.0 2023-06-20 20:44:37,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=797334.0, ans=0.05 2023-06-20 20:45:22,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=797454.0, ans=0.125 2023-06-20 20:45:32,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=797514.0, ans=0.125 2023-06-20 20:45:53,498 INFO [train.py:996] (3/4) Epoch 5, batch 10950, loss[loss=0.2206, simple_loss=0.2803, pruned_loss=0.08049, over 21626.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3176, pruned_loss=0.08791, over 4266184.12 frames. ], batch size: 298, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:46:29,008 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.56 vs. limit=15.0 2023-06-20 20:47:11,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=797814.0, ans=0.125 2023-06-20 20:47:35,102 INFO [train.py:996] (3/4) Epoch 5, batch 11000, loss[loss=0.2165, simple_loss=0.2861, pruned_loss=0.07347, over 21709.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3158, pruned_loss=0.08852, over 4277487.93 frames. ], batch size: 263, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:47:39,985 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.761e+02 3.277e+02 3.770e+02 5.855e+02, threshold=6.553e+02, percent-clipped=0.0 2023-06-20 20:47:40,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=797874.0, ans=0.0 2023-06-20 20:47:57,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=797934.0, ans=0.1 2023-06-20 20:48:03,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=797934.0, ans=0.0 2023-06-20 20:48:40,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=798054.0, ans=0.0 2023-06-20 20:49:17,445 INFO [train.py:996] (3/4) Epoch 5, batch 11050, loss[loss=0.2553, simple_loss=0.3042, pruned_loss=0.1032, over 21841.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3122, pruned_loss=0.0888, over 4270159.65 frames. ], batch size: 98, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:50:22,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=798354.0, ans=0.125 2023-06-20 20:50:22,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=798354.0, ans=0.02 2023-06-20 20:50:59,767 INFO [train.py:996] (3/4) Epoch 5, batch 11100, loss[loss=0.2314, simple_loss=0.2928, pruned_loss=0.08493, over 21817.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3121, pruned_loss=0.08972, over 4262837.77 frames. ], batch size: 98, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:51:04,510 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.421e+02 2.972e+02 3.396e+02 4.031e+02 6.791e+02, threshold=6.791e+02, percent-clipped=1.0 2023-06-20 20:51:09,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=798474.0, ans=0.125 2023-06-20 20:51:23,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=798534.0, ans=0.1 2023-06-20 20:51:23,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=798534.0, ans=0.125 2023-06-20 20:51:58,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=798654.0, ans=0.125 2023-06-20 20:52:08,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=798654.0, ans=0.2 2023-06-20 20:52:40,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=798714.0, ans=0.1 2023-06-20 20:52:43,895 INFO [train.py:996] (3/4) Epoch 5, batch 11150, loss[loss=0.2495, simple_loss=0.3302, pruned_loss=0.0844, over 21786.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3105, pruned_loss=0.08996, over 4261450.61 frames. ], batch size: 371, lr: 6.32e-03, grad_scale: 16.0 2023-06-20 20:53:01,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=798834.0, ans=0.2 2023-06-20 20:53:25,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=798894.0, ans=0.125 2023-06-20 20:53:27,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=798894.0, ans=0.0 2023-06-20 20:54:23,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=799014.0, ans=0.2 2023-06-20 20:54:27,924 INFO [train.py:996] (3/4) Epoch 5, batch 11200, loss[loss=0.2338, simple_loss=0.3006, pruned_loss=0.08354, over 21800.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3087, pruned_loss=0.08883, over 4263587.62 frames. ], batch size: 352, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 20:54:33,045 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.645e+02 2.968e+02 3.525e+02 6.155e+02, threshold=5.936e+02, percent-clipped=0.0 2023-06-20 20:54:40,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=799074.0, ans=0.125 2023-06-20 20:54:48,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=799134.0, ans=0.125 2023-06-20 20:55:45,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=799254.0, ans=0.0 2023-06-20 20:55:58,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=799314.0, ans=0.0 2023-06-20 20:56:11,040 INFO [train.py:996] (3/4) Epoch 5, batch 11250, loss[loss=0.275, simple_loss=0.3394, pruned_loss=0.1054, over 21438.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3093, pruned_loss=0.0886, over 4267920.24 frames. ], batch size: 473, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 20:57:42,640 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.16 vs. limit=15.0 2023-06-20 20:57:52,793 INFO [train.py:996] (3/4) Epoch 5, batch 11300, loss[loss=0.2465, simple_loss=0.3226, pruned_loss=0.08519, over 21830.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3128, pruned_loss=0.09032, over 4274079.05 frames. ], batch size: 414, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 20:57:57,533 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 2.794e+02 3.076e+02 3.460e+02 4.900e+02, threshold=6.152e+02, percent-clipped=0.0 2023-06-20 20:58:50,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=799794.0, ans=0.0 2023-06-20 20:59:03,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=799854.0, ans=0.1 2023-06-20 20:59:11,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=799854.0, ans=10.0 2023-06-20 20:59:38,213 INFO [train.py:996] (3/4) Epoch 5, batch 11350, loss[loss=0.2326, simple_loss=0.3004, pruned_loss=0.08241, over 21848.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3147, pruned_loss=0.09006, over 4271104.72 frames. ], batch size: 107, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 21:00:09,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=800034.0, ans=0.1 2023-06-20 21:00:12,990 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-20 21:00:32,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=800094.0, ans=0.0 2023-06-20 21:00:57,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=800154.0, ans=0.0 2023-06-20 21:01:00,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=800154.0, ans=0.125 2023-06-20 21:01:09,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=800214.0, ans=0.125 2023-06-20 21:01:09,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=800214.0, ans=0.0 2023-06-20 21:01:21,796 INFO [train.py:996] (3/4) Epoch 5, batch 11400, loss[loss=0.2893, simple_loss=0.3535, pruned_loss=0.1125, over 21353.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3193, pruned_loss=0.09232, over 4274717.19 frames. ], batch size: 131, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 21:01:26,750 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.351e+02 3.159e+02 3.655e+02 4.619e+02 8.867e+02, threshold=7.309e+02, percent-clipped=8.0 2023-06-20 21:01:27,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=800274.0, ans=0.125 2023-06-20 21:02:12,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=800394.0, ans=0.04949747468305833 2023-06-20 21:02:27,926 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:03:10,007 INFO [train.py:996] (3/4) Epoch 5, batch 11450, loss[loss=0.2466, simple_loss=0.325, pruned_loss=0.0841, over 21862.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3198, pruned_loss=0.09089, over 4273531.43 frames. ], batch size: 371, lr: 6.32e-03, grad_scale: 16.0 2023-06-20 21:03:29,700 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-20 21:03:32,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=800634.0, ans=0.0 2023-06-20 21:03:44,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=800634.0, ans=0.0 2023-06-20 21:04:15,391 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-20 21:04:50,151 INFO [train.py:996] (3/4) Epoch 5, batch 11500, loss[loss=0.2082, simple_loss=0.301, pruned_loss=0.05767, over 21608.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3249, pruned_loss=0.09303, over 4277989.03 frames. ], batch size: 263, lr: 6.32e-03, grad_scale: 16.0 2023-06-20 21:04:56,799 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.797e+02 3.137e+02 3.643e+02 6.427e+02, threshold=6.273e+02, percent-clipped=0.0 2023-06-20 21:06:39,669 INFO [train.py:996] (3/4) Epoch 5, batch 11550, loss[loss=0.3032, simple_loss=0.3942, pruned_loss=0.1061, over 21644.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3296, pruned_loss=0.09242, over 4273252.51 frames. ], batch size: 441, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:06:47,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=801174.0, ans=0.0 2023-06-20 21:07:30,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=801294.0, ans=0.2 2023-06-20 21:08:24,258 INFO [train.py:996] (3/4) Epoch 5, batch 11600, loss[loss=0.2554, simple_loss=0.3468, pruned_loss=0.08195, over 21360.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3457, pruned_loss=0.09467, over 4266558.15 frames. ], batch size: 194, lr: 6.31e-03, grad_scale: 32.0 2023-06-20 21:08:26,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=801474.0, ans=0.125 2023-06-20 21:08:30,824 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.937e+02 3.426e+02 4.222e+02 6.279e+02, threshold=6.853e+02, percent-clipped=1.0 2023-06-20 21:09:09,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=801594.0, ans=0.1 2023-06-20 21:09:38,387 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-06-20 21:10:08,654 INFO [train.py:996] (3/4) Epoch 5, batch 11650, loss[loss=0.2562, simple_loss=0.323, pruned_loss=0.09471, over 21158.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3508, pruned_loss=0.0952, over 4258697.42 frames. ], batch size: 143, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:10:26,148 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:10:46,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=801894.0, ans=0.2 2023-06-20 21:11:09,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=801954.0, ans=0.1 2023-06-20 21:11:09,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=801954.0, ans=0.125 2023-06-20 21:11:51,298 INFO [train.py:996] (3/4) Epoch 5, batch 11700, loss[loss=0.2672, simple_loss=0.3091, pruned_loss=0.1126, over 21435.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3408, pruned_loss=0.09392, over 4259352.79 frames. ], batch size: 441, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:12:03,971 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.215e+02 2.808e+02 3.203e+02 3.779e+02 6.898e+02, threshold=6.406e+02, percent-clipped=1.0 2023-06-20 21:12:13,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-20 21:12:24,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=802134.0, ans=0.05 2023-06-20 21:12:55,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=802254.0, ans=0.125 2023-06-20 21:13:16,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=802314.0, ans=0.1 2023-06-20 21:13:28,453 INFO [train.py:996] (3/4) Epoch 5, batch 11750, loss[loss=0.2372, simple_loss=0.2907, pruned_loss=0.09185, over 21881.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3322, pruned_loss=0.09363, over 4263527.71 frames. ], batch size: 98, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:15:03,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=802614.0, ans=0.1 2023-06-20 21:15:18,409 INFO [train.py:996] (3/4) Epoch 5, batch 11800, loss[loss=0.246, simple_loss=0.3241, pruned_loss=0.08398, over 21591.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3341, pruned_loss=0.09577, over 4267662.45 frames. ], batch size: 230, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:15:26,775 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.280e+02 2.905e+02 3.538e+02 4.499e+02 8.498e+02, threshold=7.075e+02, percent-clipped=3.0 2023-06-20 21:15:57,495 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-06-20 21:16:26,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=802854.0, ans=0.125 2023-06-20 21:16:53,712 INFO [train.py:996] (3/4) Epoch 5, batch 11850, loss[loss=0.2395, simple_loss=0.333, pruned_loss=0.07298, over 21808.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3365, pruned_loss=0.09535, over 4275582.21 frames. ], batch size: 282, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:16:59,760 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.10 vs. limit=22.5 2023-06-20 21:17:21,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=803034.0, ans=0.0 2023-06-20 21:17:26,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=803094.0, ans=0.0 2023-06-20 21:17:50,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=803094.0, ans=0.0 2023-06-20 21:17:57,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=803154.0, ans=0.125 2023-06-20 21:18:40,290 INFO [train.py:996] (3/4) Epoch 5, batch 11900, loss[loss=0.239, simple_loss=0.3304, pruned_loss=0.07378, over 21618.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3346, pruned_loss=0.09221, over 4273924.83 frames. ], batch size: 414, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:18:48,880 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.635e+02 2.949e+02 3.429e+02 6.903e+02, threshold=5.899e+02, percent-clipped=0.0 2023-06-20 21:19:37,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=803394.0, ans=0.1 2023-06-20 21:19:39,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=803394.0, ans=0.2 2023-06-20 21:20:04,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=803454.0, ans=0.0 2023-06-20 21:20:04,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=803454.0, ans=15.0 2023-06-20 21:20:08,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=803514.0, ans=0.0 2023-06-20 21:20:19,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=803514.0, ans=0.1 2023-06-20 21:20:25,263 INFO [train.py:996] (3/4) Epoch 5, batch 11950, loss[loss=0.3276, simple_loss=0.463, pruned_loss=0.09609, over 20753.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3361, pruned_loss=0.0884, over 4266550.38 frames. ], batch size: 607, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:20:27,864 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-06-20 21:20:33,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=803574.0, ans=0.0 2023-06-20 21:22:08,829 INFO [train.py:996] (3/4) Epoch 5, batch 12000, loss[loss=0.2031, simple_loss=0.2674, pruned_loss=0.06935, over 21497.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3315, pruned_loss=0.08665, over 4265972.75 frames. ], batch size: 195, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:22:08,830 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 21:22:18,045 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.3872, 2.8932, 1.4045, 1.5037], device='cuda:3') 2023-06-20 21:22:26,108 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2641, simple_loss=0.3594, pruned_loss=0.08443, over 1796401.00 frames. 2023-06-20 21:22:26,109 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-20 21:22:34,437 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 3.012e+02 3.779e+02 4.599e+02 7.953e+02, threshold=7.557e+02, percent-clipped=8.0 2023-06-20 21:22:34,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=803874.0, ans=0.0 2023-06-20 21:23:29,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=804054.0, ans=0.0 2023-06-20 21:23:41,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=804054.0, ans=0.125 2023-06-20 21:23:45,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=804054.0, ans=0.2 2023-06-20 21:24:10,190 INFO [train.py:996] (3/4) Epoch 5, batch 12050, loss[loss=0.3202, simple_loss=0.3576, pruned_loss=0.1414, over 21793.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3279, pruned_loss=0.08981, over 4269696.11 frames. ], batch size: 508, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:24:44,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=804234.0, ans=0.0 2023-06-20 21:24:51,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=804234.0, ans=0.125 2023-06-20 21:25:38,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=804414.0, ans=0.1 2023-06-20 21:25:55,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=804414.0, ans=0.125 2023-06-20 21:26:00,495 INFO [train.py:996] (3/4) Epoch 5, batch 12100, loss[loss=0.2373, simple_loss=0.2999, pruned_loss=0.08735, over 20650.00 frames. ], tot_loss[loss=0.262, simple_loss=0.333, pruned_loss=0.09552, over 4269959.35 frames. ], batch size: 607, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:26:07,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=804474.0, ans=0.125 2023-06-20 21:26:14,367 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 2.887e+02 3.241e+02 3.794e+02 5.961e+02, threshold=6.483e+02, percent-clipped=0.0 2023-06-20 21:26:27,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=804534.0, ans=0.125 2023-06-20 21:26:40,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=804534.0, ans=0.125 2023-06-20 21:26:44,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=804594.0, ans=0.0 2023-06-20 21:27:40,958 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:27:47,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=804714.0, ans=0.0 2023-06-20 21:27:54,329 INFO [train.py:996] (3/4) Epoch 5, batch 12150, loss[loss=0.2636, simple_loss=0.3788, pruned_loss=0.07418, over 19828.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3361, pruned_loss=0.09447, over 4266964.44 frames. ], batch size: 702, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:27:59,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=804774.0, ans=0.0 2023-06-20 21:29:21,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=805014.0, ans=0.125 2023-06-20 21:29:37,988 INFO [train.py:996] (3/4) Epoch 5, batch 12200, loss[loss=0.232, simple_loss=0.292, pruned_loss=0.08598, over 21731.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3315, pruned_loss=0.09264, over 4264258.78 frames. ], batch size: 316, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:29:38,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=805074.0, ans=0.035 2023-06-20 21:29:51,302 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.288e+02 2.953e+02 3.466e+02 4.601e+02 9.385e+02, threshold=6.933e+02, percent-clipped=9.0 2023-06-20 21:30:20,792 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-20 21:31:02,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=805254.0, ans=0.125 2023-06-20 21:31:22,563 INFO [train.py:996] (3/4) Epoch 5, batch 12250, loss[loss=0.2362, simple_loss=0.3128, pruned_loss=0.07982, over 21406.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3222, pruned_loss=0.08841, over 4265358.95 frames. ], batch size: 471, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:32:48,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=805614.0, ans=0.1 2023-06-20 21:33:06,440 INFO [train.py:996] (3/4) Epoch 5, batch 12300, loss[loss=0.2649, simple_loss=0.3422, pruned_loss=0.09375, over 21738.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3128, pruned_loss=0.08266, over 4246318.57 frames. ], batch size: 247, lr: 6.30e-03, grad_scale: 16.0 2023-06-20 21:33:20,982 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 2.636e+02 3.177e+02 4.024e+02 7.253e+02, threshold=6.354e+02, percent-clipped=1.0 2023-06-20 21:33:43,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=805734.0, ans=0.0 2023-06-20 21:33:57,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=805794.0, ans=0.2 2023-06-20 21:34:49,560 INFO [train.py:996] (3/4) Epoch 5, batch 12350, loss[loss=0.2383, simple_loss=0.3189, pruned_loss=0.07887, over 21379.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3174, pruned_loss=0.08356, over 4254538.30 frames. ], batch size: 211, lr: 6.30e-03, grad_scale: 16.0 2023-06-20 21:35:24,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=806034.0, ans=0.125 2023-06-20 21:35:31,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=806034.0, ans=0.125 2023-06-20 21:36:08,477 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-06-20 21:36:34,905 INFO [train.py:996] (3/4) Epoch 5, batch 12400, loss[loss=0.3162, simple_loss=0.354, pruned_loss=0.1392, over 21787.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3217, pruned_loss=0.08877, over 4272467.45 frames. ], batch size: 508, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:36:49,182 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.789e+02 3.195e+02 3.703e+02 5.558e+02, threshold=6.391e+02, percent-clipped=0.0 2023-06-20 21:36:59,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=806274.0, ans=0.125 2023-06-20 21:37:29,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-20 21:37:32,331 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=22.5 2023-06-20 21:38:10,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=806514.0, ans=0.125 2023-06-20 21:38:25,350 INFO [train.py:996] (3/4) Epoch 5, batch 12450, loss[loss=0.2436, simple_loss=0.3649, pruned_loss=0.06113, over 20787.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3258, pruned_loss=0.09182, over 4275527.96 frames. ], batch size: 607, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:39:15,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=806694.0, ans=0.125 2023-06-20 21:39:46,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=806754.0, ans=0.0 2023-06-20 21:40:17,342 INFO [train.py:996] (3/4) Epoch 5, batch 12500, loss[loss=0.2819, simple_loss=0.3673, pruned_loss=0.09818, over 21796.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.336, pruned_loss=0.09446, over 4276196.22 frames. ], batch size: 118, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:40:27,261 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 2.818e+02 3.353e+02 4.175e+02 5.969e+02, threshold=6.707e+02, percent-clipped=0.0 2023-06-20 21:41:08,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=806994.0, ans=0.0 2023-06-20 21:41:30,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=807054.0, ans=0.0 2023-06-20 21:41:49,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=807114.0, ans=0.04949747468305833 2023-06-20 21:41:55,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=807114.0, ans=0.125 2023-06-20 21:42:04,530 INFO [train.py:996] (3/4) Epoch 5, batch 12550, loss[loss=0.2407, simple_loss=0.2839, pruned_loss=0.09875, over 20120.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3404, pruned_loss=0.09762, over 4277936.74 frames. ], batch size: 703, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:42:47,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=807294.0, ans=0.125 2023-06-20 21:43:28,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=807354.0, ans=0.1 2023-06-20 21:43:31,492 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:43:36,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=807414.0, ans=0.0 2023-06-20 21:43:53,029 INFO [train.py:996] (3/4) Epoch 5, batch 12600, loss[loss=0.232, simple_loss=0.3204, pruned_loss=0.07175, over 21685.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3394, pruned_loss=0.0954, over 4267858.31 frames. ], batch size: 351, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:43:55,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=807474.0, ans=0.0 2023-06-20 21:44:08,400 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.175e+02 2.858e+02 3.271e+02 3.869e+02 6.376e+02, threshold=6.541e+02, percent-clipped=0.0 2023-06-20 21:44:30,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=807594.0, ans=0.1 2023-06-20 21:45:35,583 INFO [train.py:996] (3/4) Epoch 5, batch 12650, loss[loss=0.2437, simple_loss=0.3131, pruned_loss=0.08719, over 21892.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3332, pruned_loss=0.09158, over 4272993.64 frames. ], batch size: 124, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:47:20,042 INFO [train.py:996] (3/4) Epoch 5, batch 12700, loss[loss=0.3012, simple_loss=0.3565, pruned_loss=0.1229, over 21234.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3323, pruned_loss=0.09462, over 4281674.94 frames. ], batch size: 143, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:47:27,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=808074.0, ans=0.125 2023-06-20 21:47:35,839 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 2.923e+02 3.430e+02 4.123e+02 8.274e+02, threshold=6.860e+02, percent-clipped=2.0 2023-06-20 21:48:01,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=808134.0, ans=0.125 2023-06-20 21:48:31,159 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.51 vs. limit=10.0 2023-06-20 21:48:42,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=808254.0, ans=0.1 2023-06-20 21:48:55,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=808314.0, ans=0.125 2023-06-20 21:49:02,571 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.70 vs. limit=15.0 2023-06-20 21:49:03,346 INFO [train.py:996] (3/4) Epoch 5, batch 12750, loss[loss=0.2295, simple_loss=0.3082, pruned_loss=0.07544, over 21532.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.334, pruned_loss=0.09497, over 4278298.68 frames. ], batch size: 212, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:49:38,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=808434.0, ans=0.0 2023-06-20 21:49:42,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=808434.0, ans=0.125 2023-06-20 21:50:07,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=808554.0, ans=0.125 2023-06-20 21:50:32,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=808614.0, ans=0.0 2023-06-20 21:50:52,090 INFO [train.py:996] (3/4) Epoch 5, batch 12800, loss[loss=0.2832, simple_loss=0.3477, pruned_loss=0.1093, over 21240.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3342, pruned_loss=0.09581, over 4279459.33 frames. ], batch size: 143, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:51:03,858 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 2.761e+02 3.177e+02 3.732e+02 6.852e+02, threshold=6.353e+02, percent-clipped=0.0 2023-06-20 21:51:48,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=808794.0, ans=0.09899494936611666 2023-06-20 21:51:52,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=808854.0, ans=0.0 2023-06-20 21:52:10,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=808854.0, ans=0.1 2023-06-20 21:52:12,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=808854.0, ans=0.0 2023-06-20 21:52:35,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=808974.0, ans=0.2 2023-06-20 21:52:37,033 INFO [train.py:996] (3/4) Epoch 5, batch 12850, loss[loss=0.2747, simple_loss=0.3381, pruned_loss=0.1056, over 21103.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3367, pruned_loss=0.09773, over 4285701.44 frames. ], batch size: 143, lr: 6.28e-03, grad_scale: 32.0 2023-06-20 21:52:39,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=808974.0, ans=0.0 2023-06-20 21:53:21,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=809094.0, ans=0.125 2023-06-20 21:53:23,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=809094.0, ans=0.05 2023-06-20 21:53:23,719 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=12.0 2023-06-20 21:53:31,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=809094.0, ans=0.0 2023-06-20 21:53:48,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=809154.0, ans=0.125 2023-06-20 21:54:27,282 INFO [train.py:996] (3/4) Epoch 5, batch 12900, loss[loss=0.1967, simple_loss=0.2701, pruned_loss=0.06167, over 21733.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3343, pruned_loss=0.09425, over 4284083.54 frames. ], batch size: 124, lr: 6.28e-03, grad_scale: 32.0 2023-06-20 21:54:45,409 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.796e+02 3.210e+02 3.770e+02 8.746e+02, threshold=6.419e+02, percent-clipped=1.0 2023-06-20 21:55:15,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=809394.0, ans=0.125 2023-06-20 21:55:22,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=809394.0, ans=0.1 2023-06-20 21:56:17,447 INFO [train.py:996] (3/4) Epoch 5, batch 12950, loss[loss=0.2631, simple_loss=0.3377, pruned_loss=0.09428, over 21608.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3321, pruned_loss=0.09179, over 4280206.12 frames. ], batch size: 263, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 21:56:24,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=809574.0, ans=0.125 2023-06-20 21:56:26,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=809574.0, ans=0.0 2023-06-20 21:56:31,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=809574.0, ans=0.125 2023-06-20 21:57:06,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=809694.0, ans=0.125 2023-06-20 21:57:39,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=809814.0, ans=0.125 2023-06-20 21:57:50,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=809814.0, ans=0.0 2023-06-20 21:58:00,360 INFO [train.py:996] (3/4) Epoch 5, batch 13000, loss[loss=0.2154, simple_loss=0.2958, pruned_loss=0.06745, over 21705.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3319, pruned_loss=0.09205, over 4282383.42 frames. ], batch size: 298, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 21:58:15,210 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.939e+02 3.382e+02 4.149e+02 6.832e+02, threshold=6.764e+02, percent-clipped=3.0 2023-06-20 21:58:18,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=809934.0, ans=0.2 2023-06-20 21:58:38,872 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=22.5 2023-06-20 21:59:45,268 INFO [train.py:996] (3/4) Epoch 5, batch 13050, loss[loss=0.2254, simple_loss=0.2946, pruned_loss=0.07807, over 21795.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3267, pruned_loss=0.08853, over 4275170.15 frames. ], batch size: 247, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 22:00:00,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=810234.0, ans=0.2 2023-06-20 22:00:08,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=810234.0, ans=0.1 2023-06-20 22:00:16,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=810234.0, ans=0.125 2023-06-20 22:00:42,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=810294.0, ans=0.2 2023-06-20 22:01:29,082 INFO [train.py:996] (3/4) Epoch 5, batch 13100, loss[loss=0.2716, simple_loss=0.347, pruned_loss=0.09811, over 21735.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3286, pruned_loss=0.08897, over 4272373.83 frames. ], batch size: 332, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 22:01:39,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=810474.0, ans=0.2 2023-06-20 22:01:49,413 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.828e+02 3.544e+02 4.567e+02 8.084e+02, threshold=7.089e+02, percent-clipped=1.0 2023-06-20 22:02:01,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=810534.0, ans=0.0 2023-06-20 22:02:31,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=810594.0, ans=0.1 2023-06-20 22:02:41,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-06-20 22:03:01,932 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-20 22:03:19,519 INFO [train.py:996] (3/4) Epoch 5, batch 13150, loss[loss=0.1956, simple_loss=0.2719, pruned_loss=0.05962, over 21593.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3301, pruned_loss=0.09221, over 4275179.84 frames. ], batch size: 230, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 22:03:39,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=810834.0, ans=0.2 2023-06-20 22:03:55,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=810834.0, ans=0.125 2023-06-20 22:04:04,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=810894.0, ans=0.2 2023-06-20 22:04:12,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.99 vs. limit=15.0 2023-06-20 22:04:45,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=811014.0, ans=0.125 2023-06-20 22:04:46,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=811014.0, ans=0.1 2023-06-20 22:04:46,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=811014.0, ans=0.2 2023-06-20 22:05:03,338 INFO [train.py:996] (3/4) Epoch 5, batch 13200, loss[loss=0.2415, simple_loss=0.3103, pruned_loss=0.08635, over 22010.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3293, pruned_loss=0.09182, over 4275244.62 frames. ], batch size: 317, lr: 6.28e-03, grad_scale: 16.0 2023-06-20 22:05:08,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=811074.0, ans=0.0 2023-06-20 22:05:18,458 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 2.901e+02 3.344e+02 4.017e+02 7.205e+02, threshold=6.688e+02, percent-clipped=1.0 2023-06-20 22:05:42,247 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=22.5 2023-06-20 22:06:11,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=811254.0, ans=0.0 2023-06-20 22:06:32,030 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:06:45,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=811314.0, ans=0.2 2023-06-20 22:06:48,140 INFO [train.py:996] (3/4) Epoch 5, batch 13250, loss[loss=0.3062, simple_loss=0.3672, pruned_loss=0.1226, over 21798.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3305, pruned_loss=0.09449, over 4276486.46 frames. ], batch size: 441, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:06:50,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=811374.0, ans=0.125 2023-06-20 22:07:01,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=811374.0, ans=0.125 2023-06-20 22:07:29,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=811434.0, ans=0.0 2023-06-20 22:07:43,874 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-20 22:07:45,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=811494.0, ans=0.0 2023-06-20 22:08:15,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=811614.0, ans=0.04949747468305833 2023-06-20 22:08:29,437 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.81 vs. limit=15.0 2023-06-20 22:08:39,110 INFO [train.py:996] (3/4) Epoch 5, batch 13300, loss[loss=0.3116, simple_loss=0.38, pruned_loss=0.1216, over 21504.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3331, pruned_loss=0.09425, over 4277401.60 frames. ], batch size: 471, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:08:59,084 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.093e+02 2.849e+02 3.375e+02 4.056e+02 7.431e+02, threshold=6.749e+02, percent-clipped=1.0 2023-06-20 22:10:19,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=811914.0, ans=0.0 2023-06-20 22:10:28,578 INFO [train.py:996] (3/4) Epoch 5, batch 13350, loss[loss=0.2874, simple_loss=0.364, pruned_loss=0.1055, over 21623.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3374, pruned_loss=0.09695, over 4275111.99 frames. ], batch size: 263, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:10:51,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=812034.0, ans=0.0 2023-06-20 22:11:06,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=812094.0, ans=10.0 2023-06-20 22:11:28,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=812154.0, ans=0.2 2023-06-20 22:11:42,809 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-06-20 22:12:08,362 INFO [train.py:996] (3/4) Epoch 5, batch 13400, loss[loss=0.2592, simple_loss=0.308, pruned_loss=0.1052, over 21865.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3373, pruned_loss=0.09817, over 4275425.28 frames. ], batch size: 98, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:12:22,372 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.312e+02 2.994e+02 3.475e+02 4.105e+02 5.675e+02, threshold=6.951e+02, percent-clipped=0.0 2023-06-20 22:12:31,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=812334.0, ans=0.125 2023-06-20 22:12:46,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=812334.0, ans=0.125 2023-06-20 22:12:53,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=812394.0, ans=0.125 2023-06-20 22:13:46,630 INFO [train.py:996] (3/4) Epoch 5, batch 13450, loss[loss=0.3173, simple_loss=0.3686, pruned_loss=0.133, over 21372.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3376, pruned_loss=0.1002, over 4265481.30 frames. ], batch size: 471, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:14:34,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=812694.0, ans=0.0 2023-06-20 22:15:00,469 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.28 vs. limit=15.0 2023-06-20 22:15:10,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=812814.0, ans=0.125 2023-06-20 22:15:33,796 INFO [train.py:996] (3/4) Epoch 5, batch 13500, loss[loss=0.2202, simple_loss=0.2745, pruned_loss=0.08295, over 21517.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3288, pruned_loss=0.09685, over 4269294.31 frames. ], batch size: 195, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:15:53,676 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.173e+02 3.616e+02 4.505e+02 8.152e+02, threshold=7.232e+02, percent-clipped=1.0 2023-06-20 22:16:16,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=812994.0, ans=0.0 2023-06-20 22:16:19,268 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-20 22:17:15,883 INFO [train.py:996] (3/4) Epoch 5, batch 13550, loss[loss=0.3056, simple_loss=0.3846, pruned_loss=0.1134, over 21714.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.332, pruned_loss=0.09564, over 4266686.79 frames. ], batch size: 247, lr: 6.27e-03, grad_scale: 8.0 2023-06-20 22:17:57,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=813294.0, ans=0.125 2023-06-20 22:18:34,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=813414.0, ans=0.1 2023-06-20 22:18:55,144 INFO [train.py:996] (3/4) Epoch 5, batch 13600, loss[loss=0.2483, simple_loss=0.311, pruned_loss=0.09278, over 21643.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3334, pruned_loss=0.09566, over 4270954.53 frames. ], batch size: 263, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:19:06,904 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=22.5 2023-06-20 22:19:10,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=813474.0, ans=0.125 2023-06-20 22:19:16,168 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.895e+02 3.506e+02 4.425e+02 7.285e+02, threshold=7.012e+02, percent-clipped=2.0 2023-06-20 22:19:22,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=813534.0, ans=0.1 2023-06-20 22:19:45,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=813594.0, ans=0.125 2023-06-20 22:20:22,926 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-20 22:20:39,907 INFO [train.py:996] (3/4) Epoch 5, batch 13650, loss[loss=0.2711, simple_loss=0.3273, pruned_loss=0.1075, over 21529.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3282, pruned_loss=0.09295, over 4272277.06 frames. ], batch size: 414, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:20:57,568 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:21:22,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=813894.0, ans=0.125 2023-06-20 22:22:19,139 INFO [train.py:996] (3/4) Epoch 5, batch 13700, loss[loss=0.2146, simple_loss=0.2694, pruned_loss=0.07988, over 21281.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3245, pruned_loss=0.09235, over 4269974.98 frames. ], batch size: 144, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:22:28,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=814074.0, ans=0.125 2023-06-20 22:22:30,372 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=22.5 2023-06-20 22:22:31,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=814074.0, ans=0.2 2023-06-20 22:22:41,393 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.806e+02 3.342e+02 4.306e+02 8.545e+02, threshold=6.684e+02, percent-clipped=2.0 2023-06-20 22:23:29,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=814254.0, ans=0.2 2023-06-20 22:23:56,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=814314.0, ans=0.0 2023-06-20 22:24:01,291 INFO [train.py:996] (3/4) Epoch 5, batch 13750, loss[loss=0.22, simple_loss=0.287, pruned_loss=0.07657, over 21322.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3188, pruned_loss=0.09088, over 4267517.56 frames. ], batch size: 176, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:25:49,698 INFO [train.py:996] (3/4) Epoch 5, batch 13800, loss[loss=0.367, simple_loss=0.4508, pruned_loss=0.1416, over 21443.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3261, pruned_loss=0.09101, over 4266082.99 frames. ], batch size: 507, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:25:54,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=814674.0, ans=0.1 2023-06-20 22:25:58,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=814674.0, ans=0.125 2023-06-20 22:26:06,014 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.285e+02 2.993e+02 3.321e+02 4.024e+02 5.976e+02, threshold=6.643e+02, percent-clipped=0.0 2023-06-20 22:26:23,265 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.97 vs. limit=15.0 2023-06-20 22:26:45,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=814854.0, ans=0.0 2023-06-20 22:27:18,485 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-20 22:27:26,013 INFO [train.py:996] (3/4) Epoch 5, batch 13850, loss[loss=0.2767, simple_loss=0.345, pruned_loss=0.1041, over 21593.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3328, pruned_loss=0.09208, over 4273148.62 frames. ], batch size: 230, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:28:02,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=815094.0, ans=0.05 2023-06-20 22:28:52,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=815214.0, ans=0.125 2023-06-20 22:29:01,708 INFO [train.py:996] (3/4) Epoch 5, batch 13900, loss[loss=0.2546, simple_loss=0.3222, pruned_loss=0.09353, over 21854.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3387, pruned_loss=0.09645, over 4276276.11 frames. ], batch size: 332, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:29:27,668 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.490e+02 2.909e+02 3.378e+02 3.977e+02 7.082e+02, threshold=6.756e+02, percent-clipped=1.0 2023-06-20 22:29:28,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=815334.0, ans=0.125 2023-06-20 22:29:32,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=815334.0, ans=0.0 2023-06-20 22:30:35,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=815514.0, ans=0.125 2023-06-20 22:30:41,761 INFO [train.py:996] (3/4) Epoch 5, batch 13950, loss[loss=0.2044, simple_loss=0.2949, pruned_loss=0.05697, over 20849.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3377, pruned_loss=0.09804, over 4281692.76 frames. ], batch size: 608, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:31:52,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=815754.0, ans=0.125 2023-06-20 22:31:53,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=815754.0, ans=0.1 2023-06-20 22:31:53,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=815754.0, ans=0.125 2023-06-20 22:31:54,349 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=19.20 vs. limit=22.5 2023-06-20 22:32:06,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=815814.0, ans=0.2 2023-06-20 22:32:06,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=815814.0, ans=0.125 2023-06-20 22:32:18,622 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.33 vs. limit=5.0 2023-06-20 22:32:24,842 INFO [train.py:996] (3/4) Epoch 5, batch 14000, loss[loss=0.2026, simple_loss=0.2774, pruned_loss=0.06393, over 21176.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3311, pruned_loss=0.09441, over 4273312.49 frames. ], batch size: 143, lr: 6.26e-03, grad_scale: 32.0 2023-06-20 22:32:45,986 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.619e+02 3.139e+02 3.881e+02 6.690e+02, threshold=6.278e+02, percent-clipped=0.0 2023-06-20 22:32:55,167 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=15.0 2023-06-20 22:33:02,684 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=22.5 2023-06-20 22:33:21,271 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-20 22:33:24,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=816054.0, ans=0.1 2023-06-20 22:33:40,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=816054.0, ans=0.2 2023-06-20 22:34:05,228 INFO [train.py:996] (3/4) Epoch 5, batch 14050, loss[loss=0.2478, simple_loss=0.3082, pruned_loss=0.09373, over 21671.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3262, pruned_loss=0.09048, over 4272293.04 frames. ], batch size: 282, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:34:31,938 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.00 vs. limit=6.0 2023-06-20 22:34:51,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=816294.0, ans=0.125 2023-06-20 22:35:22,150 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2023-06-20 22:35:26,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=816414.0, ans=0.125 2023-06-20 22:35:44,093 INFO [train.py:996] (3/4) Epoch 5, batch 14100, loss[loss=0.2624, simple_loss=0.3263, pruned_loss=0.09929, over 21683.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3212, pruned_loss=0.09021, over 4256153.78 frames. ], batch size: 332, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:36:07,013 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.771e+02 3.154e+02 4.028e+02 6.108e+02, threshold=6.308e+02, percent-clipped=0.0 2023-06-20 22:37:18,621 INFO [train.py:996] (3/4) Epoch 5, batch 14150, loss[loss=0.2332, simple_loss=0.3186, pruned_loss=0.07388, over 21584.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3258, pruned_loss=0.091, over 4251785.08 frames. ], batch size: 230, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:37:37,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=816834.0, ans=0.2 2023-06-20 22:38:27,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=816954.0, ans=0.0 2023-06-20 22:38:32,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=816954.0, ans=0.0 2023-06-20 22:38:45,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=817014.0, ans=0.125 2023-06-20 22:38:57,640 INFO [train.py:996] (3/4) Epoch 5, batch 14200, loss[loss=0.2368, simple_loss=0.3136, pruned_loss=0.08003, over 21823.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.324, pruned_loss=0.08931, over 4249647.60 frames. ], batch size: 282, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:39:19,846 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.484e+02 2.963e+02 3.702e+02 8.044e+02, threshold=5.927e+02, percent-clipped=3.0 2023-06-20 22:40:19,709 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.92 vs. limit=15.0 2023-06-20 22:40:36,146 INFO [train.py:996] (3/4) Epoch 5, batch 14250, loss[loss=0.24, simple_loss=0.2967, pruned_loss=0.09166, over 21840.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3185, pruned_loss=0.08915, over 4252991.88 frames. ], batch size: 107, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:40:51,506 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=15.0 2023-06-20 22:40:55,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=817434.0, ans=0.1 2023-06-20 22:41:50,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=817554.0, ans=0.2 2023-06-20 22:42:02,303 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-06-20 22:42:22,699 INFO [train.py:996] (3/4) Epoch 5, batch 14300, loss[loss=0.3091, simple_loss=0.394, pruned_loss=0.1121, over 21743.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3182, pruned_loss=0.08784, over 4244395.93 frames. ], batch size: 351, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:42:30,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=817674.0, ans=0.125 2023-06-20 22:42:31,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=817674.0, ans=0.125 2023-06-20 22:42:46,150 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.228e+02 3.147e+02 3.840e+02 5.009e+02 9.347e+02, threshold=7.680e+02, percent-clipped=16.0 2023-06-20 22:42:55,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=817734.0, ans=0.1 2023-06-20 22:43:46,231 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:43:46,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=817914.0, ans=0.125 2023-06-20 22:44:02,688 INFO [train.py:996] (3/4) Epoch 5, batch 14350, loss[loss=0.226, simple_loss=0.3005, pruned_loss=0.07574, over 21824.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3259, pruned_loss=0.08961, over 4236953.67 frames. ], batch size: 247, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:44:27,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=818034.0, ans=0.0 2023-06-20 22:45:38,707 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.09 vs. limit=15.0 2023-06-20 22:45:40,619 INFO [train.py:996] (3/4) Epoch 5, batch 14400, loss[loss=0.2972, simple_loss=0.3506, pruned_loss=0.1219, over 21753.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.325, pruned_loss=0.09139, over 4250041.54 frames. ], batch size: 112, lr: 6.25e-03, grad_scale: 32.0 2023-06-20 22:45:58,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.774e+02 3.108e+02 3.689e+02 4.790e+02, threshold=6.217e+02, percent-clipped=0.0 2023-06-20 22:46:21,862 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-06-20 22:46:35,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=818394.0, ans=0.2 2023-06-20 22:47:20,542 INFO [train.py:996] (3/4) Epoch 5, batch 14450, loss[loss=0.2205, simple_loss=0.2864, pruned_loss=0.07725, over 21781.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3195, pruned_loss=0.09164, over 4256792.45 frames. ], batch size: 351, lr: 6.25e-03, grad_scale: 32.0 2023-06-20 22:47:27,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=818574.0, ans=0.2 2023-06-20 22:47:28,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=818574.0, ans=0.0 2023-06-20 22:47:30,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=818574.0, ans=0.0 2023-06-20 22:47:43,704 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-20 22:48:00,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=818694.0, ans=0.0 2023-06-20 22:48:12,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=818694.0, ans=0.125 2023-06-20 22:48:15,174 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-20 22:48:17,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=818754.0, ans=0.125 2023-06-20 22:48:36,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=818814.0, ans=0.0 2023-06-20 22:48:58,538 INFO [train.py:996] (3/4) Epoch 5, batch 14500, loss[loss=0.2256, simple_loss=0.3155, pruned_loss=0.06787, over 21777.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3161, pruned_loss=0.09128, over 4251620.98 frames. ], batch size: 371, lr: 6.25e-03, grad_scale: 32.0 2023-06-20 22:49:13,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=818934.0, ans=0.125 2023-06-20 22:49:16,433 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.793e+02 3.259e+02 3.991e+02 5.427e+02, threshold=6.518e+02, percent-clipped=0.0 2023-06-20 22:49:45,384 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=22.5 2023-06-20 22:50:00,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=819054.0, ans=0.125 2023-06-20 22:50:12,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=819114.0, ans=0.125 2023-06-20 22:50:40,031 INFO [train.py:996] (3/4) Epoch 5, batch 14550, loss[loss=0.2223, simple_loss=0.2914, pruned_loss=0.07659, over 20048.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.321, pruned_loss=0.09287, over 4256153.33 frames. ], batch size: 703, lr: 6.24e-03, grad_scale: 32.0 2023-06-20 22:51:05,986 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=12.0 2023-06-20 22:51:20,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=819294.0, ans=0.1 2023-06-20 22:51:33,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=819294.0, ans=0.02 2023-06-20 22:52:20,164 INFO [train.py:996] (3/4) Epoch 5, batch 14600, loss[loss=0.2805, simple_loss=0.3597, pruned_loss=0.1006, over 21413.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3324, pruned_loss=0.09762, over 4266540.50 frames. ], batch size: 211, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 22:52:44,005 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.146e+02 3.577e+02 4.655e+02 8.854e+02, threshold=7.154e+02, percent-clipped=8.0 2023-06-20 22:53:03,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=819594.0, ans=0.0 2023-06-20 22:54:00,086 INFO [train.py:996] (3/4) Epoch 5, batch 14650, loss[loss=0.1963, simple_loss=0.2655, pruned_loss=0.06357, over 21263.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3333, pruned_loss=0.09626, over 4258484.43 frames. ], batch size: 144, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 22:54:03,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=819774.0, ans=0.04949747468305833 2023-06-20 22:54:23,904 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=22.5 2023-06-20 22:54:25,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=819834.0, ans=0.125 2023-06-20 22:54:41,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=819894.0, ans=0.0 2023-06-20 22:54:41,815 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=15.0 2023-06-20 22:55:41,427 INFO [train.py:996] (3/4) Epoch 5, batch 14700, loss[loss=0.2098, simple_loss=0.2871, pruned_loss=0.06621, over 21784.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3256, pruned_loss=0.0894, over 4264760.49 frames. ], batch size: 124, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 22:56:11,177 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 2.525e+02 2.974e+02 3.979e+02 6.680e+02, threshold=5.948e+02, percent-clipped=0.0 2023-06-20 22:57:29,217 INFO [train.py:996] (3/4) Epoch 5, batch 14750, loss[loss=0.2991, simple_loss=0.3741, pruned_loss=0.1121, over 21740.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3312, pruned_loss=0.09245, over 4271540.42 frames. ], batch size: 247, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 22:57:44,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=820374.0, ans=0.125 2023-06-20 22:57:51,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=820434.0, ans=0.125 2023-06-20 22:57:55,535 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-20 22:57:59,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=820434.0, ans=0.125 2023-06-20 22:58:42,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=820554.0, ans=0.125 2023-06-20 22:59:11,129 INFO [train.py:996] (3/4) Epoch 5, batch 14800, loss[loss=0.2457, simple_loss=0.3046, pruned_loss=0.0934, over 21127.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3434, pruned_loss=0.09934, over 4274064.19 frames. ], batch size: 176, lr: 6.24e-03, grad_scale: 32.0 2023-06-20 22:59:13,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=820674.0, ans=0.1 2023-06-20 22:59:30,584 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.410e+02 3.152e+02 3.633e+02 4.425e+02 1.058e+03, threshold=7.266e+02, percent-clipped=3.0 2023-06-20 23:00:55,535 INFO [train.py:996] (3/4) Epoch 5, batch 14850, loss[loss=0.2657, simple_loss=0.3201, pruned_loss=0.1056, over 21534.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3389, pruned_loss=0.0989, over 4268640.43 frames. ], batch size: 230, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 23:01:36,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=821094.0, ans=0.2 2023-06-20 23:02:01,997 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:02:09,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=821154.0, ans=0.125 2023-06-20 23:02:37,256 INFO [train.py:996] (3/4) Epoch 5, batch 14900, loss[loss=0.3086, simple_loss=0.3683, pruned_loss=0.1245, over 21468.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3411, pruned_loss=0.1004, over 4270237.74 frames. ], batch size: 471, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 23:02:55,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=821274.0, ans=0.125 2023-06-20 23:03:00,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=821274.0, ans=0.125 2023-06-20 23:03:01,255 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.70 vs. limit=10.0 2023-06-20 23:03:08,770 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.285e+02 3.108e+02 3.722e+02 4.348e+02 7.688e+02, threshold=7.444e+02, percent-clipped=1.0 2023-06-20 23:03:12,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=821334.0, ans=0.1 2023-06-20 23:03:32,736 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:03:32,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=821394.0, ans=0.0 2023-06-20 23:03:45,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=821454.0, ans=0.125 2023-06-20 23:03:51,505 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.48 vs. limit=10.0 2023-06-20 23:04:07,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=821514.0, ans=0.125 2023-06-20 23:04:29,709 INFO [train.py:996] (3/4) Epoch 5, batch 14950, loss[loss=0.2614, simple_loss=0.3409, pruned_loss=0.09096, over 21587.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3394, pruned_loss=0.0988, over 4263785.77 frames. ], batch size: 389, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 23:04:41,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=821574.0, ans=0.0 2023-06-20 23:04:53,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=821634.0, ans=0.125 2023-06-20 23:05:27,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=821754.0, ans=15.0 2023-06-20 23:05:48,761 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-06-20 23:06:10,023 INFO [train.py:996] (3/4) Epoch 5, batch 15000, loss[loss=0.2588, simple_loss=0.3339, pruned_loss=0.09184, over 20683.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.341, pruned_loss=0.1002, over 4263754.02 frames. ], batch size: 607, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:06:10,023 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-20 23:06:26,232 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2595, simple_loss=0.3578, pruned_loss=0.08055, over 1796401.00 frames. 2023-06-20 23:06:26,233 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-20 23:07:00,527 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.991e+02 3.617e+02 4.837e+02 7.610e+02, threshold=7.234e+02, percent-clipped=2.0 2023-06-20 23:08:12,412 INFO [train.py:996] (3/4) Epoch 5, batch 15050, loss[loss=0.2383, simple_loss=0.3138, pruned_loss=0.08138, over 21442.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3415, pruned_loss=0.1004, over 4262392.00 frames. ], batch size: 194, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:08:26,276 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=12.0 2023-06-20 23:08:59,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=822294.0, ans=0.125 2023-06-20 23:09:02,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=822294.0, ans=0.09899494936611666 2023-06-20 23:09:39,600 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.665e-03 2023-06-20 23:09:58,945 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-20 23:09:59,482 INFO [train.py:996] (3/4) Epoch 5, batch 15100, loss[loss=0.2703, simple_loss=0.3342, pruned_loss=0.1032, over 21833.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3444, pruned_loss=0.09964, over 4263382.15 frames. ], batch size: 247, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:10:25,328 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.469e+02 3.218e+02 4.050e+02 5.256e+02 8.500e+02, threshold=8.100e+02, percent-clipped=5.0 2023-06-20 23:10:59,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=822594.0, ans=0.0 2023-06-20 23:11:22,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=822714.0, ans=0.2 2023-06-20 23:11:44,943 INFO [train.py:996] (3/4) Epoch 5, batch 15150, loss[loss=0.2693, simple_loss=0.3147, pruned_loss=0.112, over 21180.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.34, pruned_loss=0.1002, over 4263480.34 frames. ], batch size: 143, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:12:06,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=822834.0, ans=0.1 2023-06-20 23:12:06,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=822834.0, ans=0.0 2023-06-20 23:12:09,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=822834.0, ans=0.2 2023-06-20 23:12:49,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=822954.0, ans=0.1 2023-06-20 23:13:21,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=823014.0, ans=0.125 2023-06-20 23:13:24,109 INFO [train.py:996] (3/4) Epoch 5, batch 15200, loss[loss=0.2026, simple_loss=0.2666, pruned_loss=0.06927, over 21824.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3302, pruned_loss=0.09606, over 4263835.22 frames. ], batch size: 118, lr: 6.23e-03, grad_scale: 32.0 2023-06-20 23:13:42,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=823134.0, ans=0.125 2023-06-20 23:13:45,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.736e+02 3.206e+02 4.003e+02 7.087e+02, threshold=6.412e+02, percent-clipped=0.0 2023-06-20 23:13:47,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=823134.0, ans=0.125 2023-06-20 23:13:52,556 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.03 vs. limit=12.0 2023-06-20 23:14:33,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=823254.0, ans=0.2 2023-06-20 23:14:56,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=823314.0, ans=0.125 2023-06-20 23:15:01,153 INFO [train.py:996] (3/4) Epoch 5, batch 15250, loss[loss=0.2822, simple_loss=0.3358, pruned_loss=0.1143, over 21807.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3231, pruned_loss=0.09389, over 4266586.28 frames. ], batch size: 118, lr: 6.23e-03, grad_scale: 32.0 2023-06-20 23:15:08,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=823374.0, ans=0.125 2023-06-20 23:15:37,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=823494.0, ans=0.0 2023-06-20 23:15:47,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=823494.0, ans=0.1 2023-06-20 23:15:55,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=823494.0, ans=0.0 2023-06-20 23:15:57,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=823494.0, ans=0.2 2023-06-20 23:16:33,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=823614.0, ans=0.1 2023-06-20 23:16:42,961 INFO [train.py:996] (3/4) Epoch 5, batch 15300, loss[loss=0.3015, simple_loss=0.3559, pruned_loss=0.1235, over 21368.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3272, pruned_loss=0.09709, over 4269189.29 frames. ], batch size: 176, lr: 6.23e-03, grad_scale: 32.0 2023-06-20 23:17:01,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=823734.0, ans=0.1 2023-06-20 23:17:04,268 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.998e+02 3.594e+02 4.256e+02 7.669e+02, threshold=7.187e+02, percent-clipped=3.0 2023-06-20 23:17:26,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=823794.0, ans=0.125 2023-06-20 23:17:38,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=823794.0, ans=0.0 2023-06-20 23:18:14,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=823914.0, ans=0.05 2023-06-20 23:18:23,669 INFO [train.py:996] (3/4) Epoch 5, batch 15350, loss[loss=0.2483, simple_loss=0.3365, pruned_loss=0.07999, over 21467.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.334, pruned_loss=0.09923, over 4268402.91 frames. ], batch size: 194, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:19:23,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=824154.0, ans=0.2 2023-06-20 23:19:50,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=824214.0, ans=0.0 2023-06-20 23:20:03,470 INFO [train.py:996] (3/4) Epoch 5, batch 15400, loss[loss=0.2468, simple_loss=0.3134, pruned_loss=0.09014, over 21673.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3358, pruned_loss=0.09779, over 4274005.39 frames. ], batch size: 230, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:20:15,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=824274.0, ans=0.1 2023-06-20 23:20:24,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=824334.0, ans=0.2 2023-06-20 23:20:25,807 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 2.899e+02 3.241e+02 4.047e+02 6.361e+02, threshold=6.483e+02, percent-clipped=0.0 2023-06-20 23:21:17,118 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.02 vs. limit=15.0 2023-06-20 23:21:31,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=824514.0, ans=0.2 2023-06-20 23:21:39,514 INFO [train.py:996] (3/4) Epoch 5, batch 15450, loss[loss=0.2287, simple_loss=0.289, pruned_loss=0.08421, over 21596.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3339, pruned_loss=0.09697, over 4266085.93 frames. ], batch size: 212, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:21:49,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=824574.0, ans=0.125 2023-06-20 23:21:52,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=824574.0, ans=0.125 2023-06-20 23:22:33,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=824694.0, ans=0.125 2023-06-20 23:22:44,749 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:23:01,501 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.95 vs. limit=12.0 2023-06-20 23:23:20,708 INFO [train.py:996] (3/4) Epoch 5, batch 15500, loss[loss=0.3384, simple_loss=0.3919, pruned_loss=0.1425, over 21405.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3359, pruned_loss=0.09714, over 4260769.41 frames. ], batch size: 471, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:23:52,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=824934.0, ans=0.125 2023-06-20 23:23:54,385 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 2.818e+02 3.290e+02 3.883e+02 6.635e+02, threshold=6.579e+02, percent-clipped=1.0 2023-06-20 23:24:22,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=824994.0, ans=0.125 2023-06-20 23:24:45,407 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-20 23:24:57,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=825114.0, ans=0.2 2023-06-20 23:24:58,027 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=12.0 2023-06-20 23:25:02,094 INFO [train.py:996] (3/4) Epoch 5, batch 15550, loss[loss=0.2154, simple_loss=0.2982, pruned_loss=0.06626, over 21720.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3325, pruned_loss=0.09479, over 4259635.79 frames. ], batch size: 298, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:25:34,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=825234.0, ans=0.2 2023-06-20 23:25:45,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=825294.0, ans=0.0 2023-06-20 23:25:50,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=825294.0, ans=0.2 2023-06-20 23:26:18,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=825354.0, ans=0.0 2023-06-20 23:26:21,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=825354.0, ans=0.0 2023-06-20 23:26:27,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=825414.0, ans=0.0 2023-06-20 23:26:42,173 INFO [train.py:996] (3/4) Epoch 5, batch 15600, loss[loss=0.2449, simple_loss=0.3279, pruned_loss=0.0809, over 21492.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3255, pruned_loss=0.09257, over 4267176.41 frames. ], batch size: 389, lr: 6.22e-03, grad_scale: 32.0 2023-06-20 23:27:09,683 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.848e+02 3.319e+02 3.887e+02 5.745e+02, threshold=6.638e+02, percent-clipped=0.0 2023-06-20 23:27:55,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=825654.0, ans=0.125 2023-06-20 23:28:17,191 INFO [train.py:996] (3/4) Epoch 5, batch 15650, loss[loss=0.2615, simple_loss=0.3216, pruned_loss=0.1007, over 21416.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3232, pruned_loss=0.09131, over 4267287.46 frames. ], batch size: 389, lr: 6.22e-03, grad_scale: 32.0 2023-06-20 23:29:16,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=825894.0, ans=0.0 2023-06-20 23:29:39,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=826014.0, ans=0.0 2023-06-20 23:29:53,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=826014.0, ans=0.0 2023-06-20 23:30:01,338 INFO [train.py:996] (3/4) Epoch 5, batch 15700, loss[loss=0.2387, simple_loss=0.3178, pruned_loss=0.07984, over 21772.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3188, pruned_loss=0.09053, over 4265565.10 frames. ], batch size: 371, lr: 6.22e-03, grad_scale: 32.0 2023-06-20 23:30:29,890 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.764e+02 3.253e+02 4.322e+02 6.346e+02, threshold=6.507e+02, percent-clipped=0.0 2023-06-20 23:30:46,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=826194.0, ans=0.0 2023-06-20 23:31:13,641 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.38 vs. limit=10.0 2023-06-20 23:31:25,021 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.91 vs. limit=5.0 2023-06-20 23:31:25,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=826314.0, ans=0.07 2023-06-20 23:31:41,389 INFO [train.py:996] (3/4) Epoch 5, batch 15750, loss[loss=0.1953, simple_loss=0.2579, pruned_loss=0.06636, over 21603.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3154, pruned_loss=0.09087, over 4255221.73 frames. ], batch size: 247, lr: 6.22e-03, grad_scale: 32.0 2023-06-20 23:31:55,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=826374.0, ans=0.035 2023-06-20 23:32:42,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.36 vs. limit=10.0 2023-06-20 23:33:21,174 INFO [train.py:996] (3/4) Epoch 5, batch 15800, loss[loss=0.2364, simple_loss=0.2893, pruned_loss=0.09177, over 21759.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3106, pruned_loss=0.08979, over 4251312.94 frames. ], batch size: 112, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:33:21,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=826674.0, ans=0.125 2023-06-20 23:33:50,784 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.927e+02 3.607e+02 4.746e+02 7.598e+02, threshold=7.214e+02, percent-clipped=2.0 2023-06-20 23:34:35,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=826854.0, ans=0.125 2023-06-20 23:35:01,893 INFO [train.py:996] (3/4) Epoch 5, batch 15850, loss[loss=0.2854, simple_loss=0.3396, pruned_loss=0.1156, over 21335.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3154, pruned_loss=0.09225, over 4251700.38 frames. ], batch size: 471, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:36:01,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=827094.0, ans=0.07 2023-06-20 23:36:41,678 INFO [train.py:996] (3/4) Epoch 5, batch 15900, loss[loss=0.2517, simple_loss=0.2971, pruned_loss=0.1032, over 21457.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3148, pruned_loss=0.09275, over 4251957.42 frames. ], batch size: 212, lr: 6.21e-03, grad_scale: 16.0 2023-06-20 23:36:43,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=827274.0, ans=0.0 2023-06-20 23:37:11,676 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 2.862e+02 3.189e+02 4.240e+02 8.969e+02, threshold=6.379e+02, percent-clipped=1.0 2023-06-20 23:37:12,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=827334.0, ans=0.0 2023-06-20 23:38:09,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=827514.0, ans=0.125 2023-06-20 23:38:09,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=827514.0, ans=0.125 2023-06-20 23:38:22,868 INFO [train.py:996] (3/4) Epoch 5, batch 15950, loss[loss=0.2398, simple_loss=0.3187, pruned_loss=0.08043, over 21736.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3142, pruned_loss=0.09048, over 4241403.37 frames. ], batch size: 282, lr: 6.21e-03, grad_scale: 16.0 2023-06-20 23:38:23,999 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.23 vs. limit=15.0 2023-06-20 23:38:37,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=827574.0, ans=0.125 2023-06-20 23:38:55,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=827634.0, ans=0.125 2023-06-20 23:39:03,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=827694.0, ans=0.125 2023-06-20 23:39:03,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=827694.0, ans=0.07 2023-06-20 23:39:30,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=827754.0, ans=0.125 2023-06-20 23:39:57,217 INFO [train.py:996] (3/4) Epoch 5, batch 16000, loss[loss=0.2254, simple_loss=0.2796, pruned_loss=0.08561, over 20705.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3147, pruned_loss=0.08809, over 4249491.44 frames. ], batch size: 608, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:39:57,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=827874.0, ans=0.2 2023-06-20 23:40:20,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=827874.0, ans=0.125 2023-06-20 23:40:30,939 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.522e+02 3.012e+02 3.700e+02 7.317e+02, threshold=6.025e+02, percent-clipped=2.0 2023-06-20 23:40:58,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=827994.0, ans=0.125 2023-06-20 23:41:18,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=828054.0, ans=0.125 2023-06-20 23:41:43,628 INFO [train.py:996] (3/4) Epoch 5, batch 16050, loss[loss=0.2461, simple_loss=0.337, pruned_loss=0.07761, over 21298.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3171, pruned_loss=0.08608, over 4254589.58 frames. ], batch size: 159, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:42:21,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=828294.0, ans=0.0 2023-06-20 23:42:58,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=828354.0, ans=0.07 2023-06-20 23:43:08,266 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:43:22,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=828474.0, ans=0.125 2023-06-20 23:43:23,960 INFO [train.py:996] (3/4) Epoch 5, batch 16100, loss[loss=0.2787, simple_loss=0.3343, pruned_loss=0.1115, over 21767.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3214, pruned_loss=0.089, over 4265728.90 frames. ], batch size: 441, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:43:35,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=828474.0, ans=0.0 2023-06-20 23:43:46,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=828534.0, ans=0.1 2023-06-20 23:43:52,550 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 2.759e+02 3.248e+02 4.030e+02 6.532e+02, threshold=6.496e+02, percent-clipped=1.0 2023-06-20 23:44:36,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=828654.0, ans=0.0 2023-06-20 23:44:45,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=828714.0, ans=0.2 2023-06-20 23:44:49,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=828714.0, ans=0.125 2023-06-20 23:44:57,667 INFO [train.py:996] (3/4) Epoch 5, batch 16150, loss[loss=0.2462, simple_loss=0.3341, pruned_loss=0.07917, over 21808.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3225, pruned_loss=0.09236, over 4275537.09 frames. ], batch size: 282, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:46:19,092 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-20 23:46:32,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=829014.0, ans=0.125 2023-06-20 23:46:40,194 INFO [train.py:996] (3/4) Epoch 5, batch 16200, loss[loss=0.302, simple_loss=0.3651, pruned_loss=0.1194, over 21804.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3277, pruned_loss=0.09409, over 4273120.96 frames. ], batch size: 441, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:46:59,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=829134.0, ans=0.125 2023-06-20 23:47:00,587 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=22.5 2023-06-20 23:47:09,185 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 2.854e+02 3.310e+02 3.979e+02 8.024e+02, threshold=6.619e+02, percent-clipped=1.0 2023-06-20 23:47:11,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=829134.0, ans=0.09899494936611666 2023-06-20 23:48:04,541 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.70 vs. limit=10.0 2023-06-20 23:48:19,422 INFO [train.py:996] (3/4) Epoch 5, batch 16250, loss[loss=0.1939, simple_loss=0.2814, pruned_loss=0.0532, over 21635.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3252, pruned_loss=0.09144, over 4274111.66 frames. ], batch size: 263, lr: 6.21e-03, grad_scale: 16.0 2023-06-20 23:48:34,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=829374.0, ans=0.125 2023-06-20 23:48:38,678 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.05 vs. limit=22.5 2023-06-20 23:48:52,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=829434.0, ans=0.0 2023-06-20 23:49:05,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=829494.0, ans=15.0 2023-06-20 23:49:27,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=829554.0, ans=0.0 2023-06-20 23:49:28,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=829554.0, ans=0.0 2023-06-20 23:50:03,638 INFO [train.py:996] (3/4) Epoch 5, batch 16300, loss[loss=0.2294, simple_loss=0.3009, pruned_loss=0.07893, over 21697.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3196, pruned_loss=0.08694, over 4268373.56 frames. ], batch size: 332, lr: 6.21e-03, grad_scale: 16.0 2023-06-20 23:50:12,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=829674.0, ans=0.125 2023-06-20 23:50:27,371 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-20 23:50:29,421 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.495e+02 2.799e+02 3.333e+02 5.849e+02, threshold=5.597e+02, percent-clipped=0.0 2023-06-20 23:51:44,330 INFO [train.py:996] (3/4) Epoch 5, batch 16350, loss[loss=0.3155, simple_loss=0.3793, pruned_loss=0.1258, over 21783.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3215, pruned_loss=0.08886, over 4256892.09 frames. ], batch size: 441, lr: 6.20e-03, grad_scale: 16.0 2023-06-20 23:53:23,349 INFO [train.py:996] (3/4) Epoch 5, batch 16400, loss[loss=0.2704, simple_loss=0.3258, pruned_loss=0.1076, over 21878.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.326, pruned_loss=0.09127, over 4261238.68 frames. ], batch size: 107, lr: 6.20e-03, grad_scale: 32.0 2023-06-20 23:53:42,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=830334.0, ans=0.0 2023-06-20 23:53:52,854 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.252e+02 2.889e+02 3.302e+02 3.961e+02 7.962e+02, threshold=6.603e+02, percent-clipped=4.0 2023-06-20 23:54:06,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=830394.0, ans=0.125 2023-06-20 23:54:10,683 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2023-06-20 23:54:53,039 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:54:53,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=830514.0, ans=0.125 2023-06-20 23:55:02,577 INFO [train.py:996] (3/4) Epoch 5, batch 16450, loss[loss=0.2192, simple_loss=0.2889, pruned_loss=0.07475, over 21777.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3262, pruned_loss=0.09295, over 4272140.21 frames. ], batch size: 247, lr: 6.20e-03, grad_scale: 16.0 2023-06-20 23:56:16,792 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.18 vs. limit=12.0 2023-06-20 23:56:21,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=830814.0, ans=0.0 2023-06-20 23:56:21,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=830814.0, ans=0.125 2023-06-20 23:56:41,414 INFO [train.py:996] (3/4) Epoch 5, batch 16500, loss[loss=0.2214, simple_loss=0.2807, pruned_loss=0.0811, over 21423.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3258, pruned_loss=0.09346, over 4277216.26 frames. ], batch size: 211, lr: 6.20e-03, grad_scale: 16.0 2023-06-20 23:57:19,056 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 3.018e+02 3.661e+02 4.243e+02 1.006e+03, threshold=7.323e+02, percent-clipped=9.0 2023-06-20 23:58:21,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=831174.0, ans=0.2 2023-06-20 23:58:23,176 INFO [train.py:996] (3/4) Epoch 5, batch 16550, loss[loss=0.281, simple_loss=0.3355, pruned_loss=0.1133, over 21320.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3257, pruned_loss=0.09104, over 4272400.12 frames. ], batch size: 159, lr: 6.20e-03, grad_scale: 16.0 2023-06-20 23:58:56,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=831234.0, ans=0.125 2023-06-20 23:58:56,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=831234.0, ans=0.0 2023-06-20 23:59:03,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=831234.0, ans=0.0 2023-06-20 23:59:11,872 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=22.5 2023-06-20 23:59:13,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=831294.0, ans=0.125 2023-06-20 23:59:21,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=831294.0, ans=0.125 2023-06-21 00:00:14,131 INFO [train.py:996] (3/4) Epoch 5, batch 16600, loss[loss=0.2797, simple_loss=0.3841, pruned_loss=0.08767, over 21826.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3327, pruned_loss=0.09379, over 4263357.45 frames. ], batch size: 316, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 00:00:28,591 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-21 00:00:42,887 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.262e+02 3.858e+02 4.542e+02 8.769e+02, threshold=7.716e+02, percent-clipped=2.0 2023-06-21 00:00:50,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=831594.0, ans=0.1 2023-06-21 00:00:52,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=831594.0, ans=0.125 2023-06-21 00:01:06,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=831594.0, ans=0.125 2023-06-21 00:01:31,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=831654.0, ans=0.2 2023-06-21 00:01:44,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=831714.0, ans=0.125 2023-06-21 00:01:49,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=831714.0, ans=0.125 2023-06-21 00:01:56,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=831774.0, ans=0.07 2023-06-21 00:01:57,512 INFO [train.py:996] (3/4) Epoch 5, batch 16650, loss[loss=0.3296, simple_loss=0.391, pruned_loss=0.1341, over 21783.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3408, pruned_loss=0.09685, over 4265720.82 frames. ], batch size: 441, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 00:02:06,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=831774.0, ans=0.2 2023-06-21 00:03:32,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=832014.0, ans=0.125 2023-06-21 00:03:40,741 INFO [train.py:996] (3/4) Epoch 5, batch 16700, loss[loss=0.2744, simple_loss=0.3542, pruned_loss=0.09734, over 21625.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3409, pruned_loss=0.09734, over 4271866.97 frames. ], batch size: 414, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 00:03:41,893 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.99 vs. limit=15.0 2023-06-21 00:04:14,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=832134.0, ans=0.0 2023-06-21 00:04:18,913 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.321e+02 2.932e+02 3.507e+02 4.315e+02 8.242e+02, threshold=7.013e+02, percent-clipped=1.0 2023-06-21 00:04:39,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=832194.0, ans=0.0 2023-06-21 00:04:43,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=832194.0, ans=0.0 2023-06-21 00:04:43,822 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-06-21 00:05:30,376 INFO [train.py:996] (3/4) Epoch 5, batch 16750, loss[loss=0.2425, simple_loss=0.3262, pruned_loss=0.07943, over 20738.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3437, pruned_loss=0.09959, over 4259146.61 frames. ], batch size: 607, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 00:05:32,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=832374.0, ans=0.0 2023-06-21 00:06:08,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=832434.0, ans=0.125 2023-06-21 00:07:01,050 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=12.0 2023-06-21 00:07:16,824 INFO [train.py:996] (3/4) Epoch 5, batch 16800, loss[loss=0.2592, simple_loss=0.3256, pruned_loss=0.09644, over 21501.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3465, pruned_loss=0.09858, over 4255625.04 frames. ], batch size: 131, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:07:48,837 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.643e+02 3.479e+02 3.935e+02 4.857e+02 8.503e+02, threshold=7.870e+02, percent-clipped=2.0 2023-06-21 00:07:50,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=832734.0, ans=0.125 2023-06-21 00:08:41,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=832914.0, ans=0.0 2023-06-21 00:08:55,653 INFO [train.py:996] (3/4) Epoch 5, batch 16850, loss[loss=0.2728, simple_loss=0.3306, pruned_loss=0.1075, over 21771.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3415, pruned_loss=0.09821, over 4264990.31 frames. ], batch size: 441, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:09:16,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=833034.0, ans=0.125 2023-06-21 00:09:42,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=833094.0, ans=0.05 2023-06-21 00:10:17,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=833214.0, ans=0.125 2023-06-21 00:10:23,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=833214.0, ans=0.0 2023-06-21 00:10:35,986 INFO [train.py:996] (3/4) Epoch 5, batch 16900, loss[loss=0.211, simple_loss=0.2775, pruned_loss=0.07223, over 21550.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3361, pruned_loss=0.09623, over 4268864.49 frames. ], batch size: 230, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:11:07,325 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.951e+02 3.440e+02 4.010e+02 6.855e+02, threshold=6.879e+02, percent-clipped=0.0 2023-06-21 00:11:22,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=833394.0, ans=0.125 2023-06-21 00:12:07,668 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-21 00:12:09,925 INFO [train.py:996] (3/4) Epoch 5, batch 16950, loss[loss=0.2778, simple_loss=0.3325, pruned_loss=0.1115, over 21842.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3282, pruned_loss=0.0938, over 4276418.65 frames. ], batch size: 107, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:12:18,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=833574.0, ans=0.0 2023-06-21 00:12:45,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=833634.0, ans=0.1 2023-06-21 00:13:12,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=833754.0, ans=0.125 2023-06-21 00:13:24,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=833754.0, ans=0.04949747468305833 2023-06-21 00:13:27,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=833754.0, ans=0.125 2023-06-21 00:13:30,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=833754.0, ans=0.0 2023-06-21 00:13:38,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=833814.0, ans=0.125 2023-06-21 00:13:59,415 INFO [train.py:996] (3/4) Epoch 5, batch 17000, loss[loss=0.272, simple_loss=0.3283, pruned_loss=0.1078, over 21258.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3283, pruned_loss=0.09525, over 4273662.81 frames. ], batch size: 176, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:14:02,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=833874.0, ans=0.125 2023-06-21 00:14:12,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=833874.0, ans=0.1 2023-06-21 00:14:27,674 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 2.869e+02 3.423e+02 4.013e+02 9.065e+02, threshold=6.846e+02, percent-clipped=1.0 2023-06-21 00:14:45,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=833994.0, ans=0.125 2023-06-21 00:14:58,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=834054.0, ans=0.125 2023-06-21 00:15:05,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=834054.0, ans=0.125 2023-06-21 00:15:36,739 INFO [train.py:996] (3/4) Epoch 5, batch 17050, loss[loss=0.2627, simple_loss=0.3566, pruned_loss=0.08436, over 21626.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3343, pruned_loss=0.09713, over 4274313.43 frames. ], batch size: 230, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 00:15:37,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=834174.0, ans=0.125 2023-06-21 00:15:48,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=834174.0, ans=0.125 2023-06-21 00:16:21,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=834294.0, ans=0.0 2023-06-21 00:16:51,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=834414.0, ans=0.5 2023-06-21 00:17:14,707 INFO [train.py:996] (3/4) Epoch 5, batch 17100, loss[loss=0.2388, simple_loss=0.3025, pruned_loss=0.08756, over 21863.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3329, pruned_loss=0.09782, over 4279897.91 frames. ], batch size: 247, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 00:17:31,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=834534.0, ans=0.0 2023-06-21 00:17:43,337 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.304e+02 3.091e+02 3.634e+02 4.796e+02 1.009e+03, threshold=7.268e+02, percent-clipped=8.0 2023-06-21 00:17:49,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=834594.0, ans=0.0 2023-06-21 00:18:21,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=834654.0, ans=0.2 2023-06-21 00:18:40,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=834714.0, ans=0.125 2023-06-21 00:18:53,470 INFO [train.py:996] (3/4) Epoch 5, batch 17150, loss[loss=0.1871, simple_loss=0.2613, pruned_loss=0.05652, over 21456.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3287, pruned_loss=0.09725, over 4291489.51 frames. ], batch size: 131, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 00:19:50,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=834954.0, ans=0.125 2023-06-21 00:20:33,446 INFO [train.py:996] (3/4) Epoch 5, batch 17200, loss[loss=0.2536, simple_loss=0.3197, pruned_loss=0.09379, over 21463.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3285, pruned_loss=0.09732, over 4289885.00 frames. ], batch size: 211, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:20:56,567 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:21:12,564 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.241e+02 2.764e+02 3.023e+02 3.387e+02 5.035e+02, threshold=6.046e+02, percent-clipped=0.0 2023-06-21 00:22:03,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=835314.0, ans=0.1 2023-06-21 00:22:19,266 INFO [train.py:996] (3/4) Epoch 5, batch 17250, loss[loss=0.289, simple_loss=0.358, pruned_loss=0.11, over 21685.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3319, pruned_loss=0.09817, over 4282135.51 frames. ], batch size: 298, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:23:57,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=835614.0, ans=0.125 2023-06-21 00:24:02,202 INFO [train.py:996] (3/4) Epoch 5, batch 17300, loss[loss=0.2933, simple_loss=0.3744, pruned_loss=0.1061, over 17450.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3393, pruned_loss=0.1025, over 4276071.85 frames. ], batch size: 60, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:24:04,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=835674.0, ans=0.1 2023-06-21 00:24:41,694 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.738e+02 3.630e+02 4.657e+02 6.212e+02 1.066e+03, threshold=9.314e+02, percent-clipped=26.0 2023-06-21 00:25:26,367 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-21 00:25:40,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=835914.0, ans=0.125 2023-06-21 00:25:48,508 INFO [train.py:996] (3/4) Epoch 5, batch 17350, loss[loss=0.2648, simple_loss=0.3335, pruned_loss=0.09801, over 20732.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.3409, pruned_loss=0.1023, over 4281052.64 frames. ], batch size: 607, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:26:11,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=836034.0, ans=0.125 2023-06-21 00:26:29,334 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:26:30,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=836094.0, ans=0.125 2023-06-21 00:26:46,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=836154.0, ans=0.0 2023-06-21 00:27:01,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=836154.0, ans=0.125 2023-06-21 00:27:24,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=836214.0, ans=0.0 2023-06-21 00:27:29,133 INFO [train.py:996] (3/4) Epoch 5, batch 17400, loss[loss=0.253, simple_loss=0.3274, pruned_loss=0.08925, over 21788.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.338, pruned_loss=0.09849, over 4279256.03 frames. ], batch size: 316, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:27:47,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=836274.0, ans=0.125 2023-06-21 00:28:02,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=836334.0, ans=0.1 2023-06-21 00:28:06,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=836334.0, ans=0.0 2023-06-21 00:28:10,149 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 2.783e+02 3.227e+02 3.615e+02 5.491e+02, threshold=6.454e+02, percent-clipped=0.0 2023-06-21 00:28:33,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=836454.0, ans=0.125 2023-06-21 00:28:53,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=836514.0, ans=0.0 2023-06-21 00:29:16,078 INFO [train.py:996] (3/4) Epoch 5, batch 17450, loss[loss=0.2672, simple_loss=0.3413, pruned_loss=0.09658, over 20635.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3326, pruned_loss=0.09479, over 4268217.64 frames. ], batch size: 607, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 00:29:56,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=836694.0, ans=0.05 2023-06-21 00:30:42,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=836814.0, ans=0.07 2023-06-21 00:31:00,460 INFO [train.py:996] (3/4) Epoch 5, batch 17500, loss[loss=0.2969, simple_loss=0.3582, pruned_loss=0.1178, over 21839.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3283, pruned_loss=0.09206, over 4266271.48 frames. ], batch size: 107, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 00:31:28,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=836934.0, ans=0.1 2023-06-21 00:31:34,501 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.915e+02 2.759e+02 3.126e+02 4.015e+02 6.726e+02, threshold=6.252e+02, percent-clipped=1.0 2023-06-21 00:32:32,756 INFO [train.py:996] (3/4) Epoch 5, batch 17550, loss[loss=0.2381, simple_loss=0.3194, pruned_loss=0.07842, over 21363.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3279, pruned_loss=0.09101, over 4266474.98 frames. ], batch size: 176, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 00:32:50,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=837174.0, ans=0.125 2023-06-21 00:33:00,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=837234.0, ans=0.0 2023-06-21 00:33:02,377 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=15.0 2023-06-21 00:33:13,402 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=22.5 2023-06-21 00:33:14,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=837294.0, ans=0.125 2023-06-21 00:33:32,508 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-06-21 00:33:43,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=837354.0, ans=0.125 2023-06-21 00:34:02,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=837414.0, ans=0.07 2023-06-21 00:34:18,778 INFO [train.py:996] (3/4) Epoch 5, batch 17600, loss[loss=0.2894, simple_loss=0.3565, pruned_loss=0.1111, over 21822.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3301, pruned_loss=0.09184, over 4266378.26 frames. ], batch size: 441, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:34:29,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=837474.0, ans=0.95 2023-06-21 00:34:46,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=837534.0, ans=0.1 2023-06-21 00:34:53,978 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.862e+02 3.527e+02 4.406e+02 6.176e+02, threshold=7.053e+02, percent-clipped=0.0 2023-06-21 00:35:01,566 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=22.5 2023-06-21 00:35:22,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=837654.0, ans=0.125 2023-06-21 00:35:28,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=837654.0, ans=0.0 2023-06-21 00:35:44,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=837714.0, ans=0.125 2023-06-21 00:35:47,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=837714.0, ans=0.1 2023-06-21 00:35:59,199 INFO [train.py:996] (3/4) Epoch 5, batch 17650, loss[loss=0.242, simple_loss=0.3197, pruned_loss=0.08215, over 21584.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3285, pruned_loss=0.09146, over 4263099.92 frames. ], batch size: 441, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:36:02,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=837774.0, ans=0.125 2023-06-21 00:36:06,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=837774.0, ans=0.125 2023-06-21 00:36:28,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=837834.0, ans=0.07 2023-06-21 00:37:19,801 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=15.0 2023-06-21 00:37:25,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=838014.0, ans=0.0 2023-06-21 00:37:27,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=838014.0, ans=0.125 2023-06-21 00:37:41,998 INFO [train.py:996] (3/4) Epoch 5, batch 17700, loss[loss=0.2877, simple_loss=0.356, pruned_loss=0.1097, over 21435.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3241, pruned_loss=0.08937, over 4259903.73 frames. ], batch size: 131, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 00:37:50,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=838074.0, ans=0.125 2023-06-21 00:38:03,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=838134.0, ans=0.2 2023-06-21 00:38:16,709 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=12.0 2023-06-21 00:38:17,255 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.231e+02 2.950e+02 3.482e+02 4.668e+02 9.100e+02, threshold=6.963e+02, percent-clipped=4.0 2023-06-21 00:38:42,547 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.86 vs. limit=10.0 2023-06-21 00:38:45,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=12.0 2023-06-21 00:38:49,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=838254.0, ans=0.05 2023-06-21 00:38:55,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=838254.0, ans=0.125 2023-06-21 00:38:57,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=838254.0, ans=0.1 2023-06-21 00:39:08,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=838314.0, ans=0.2 2023-06-21 00:39:13,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=838314.0, ans=0.1 2023-06-21 00:39:21,470 INFO [train.py:996] (3/4) Epoch 5, batch 17750, loss[loss=0.2483, simple_loss=0.3229, pruned_loss=0.08688, over 21798.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3307, pruned_loss=0.09205, over 4261083.64 frames. ], batch size: 247, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 00:39:44,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=838434.0, ans=0.125 2023-06-21 00:39:44,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=838434.0, ans=0.125 2023-06-21 00:40:41,531 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=15.0 2023-06-21 00:41:07,658 INFO [train.py:996] (3/4) Epoch 5, batch 17800, loss[loss=0.2511, simple_loss=0.3188, pruned_loss=0.09172, over 21593.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.331, pruned_loss=0.09208, over 4264065.22 frames. ], batch size: 230, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 00:41:18,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=838674.0, ans=0.125 2023-06-21 00:41:28,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=838734.0, ans=0.035 2023-06-21 00:41:49,218 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 2.927e+02 3.424e+02 3.955e+02 9.585e+02, threshold=6.848e+02, percent-clipped=3.0 2023-06-21 00:41:51,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=838794.0, ans=0.125 2023-06-21 00:41:58,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=838794.0, ans=0.05 2023-06-21 00:42:49,084 INFO [train.py:996] (3/4) Epoch 5, batch 17850, loss[loss=0.3432, simple_loss=0.4002, pruned_loss=0.1431, over 21791.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3306, pruned_loss=0.09276, over 4264223.20 frames. ], batch size: 441, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:43:51,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=839154.0, ans=0.05 2023-06-21 00:44:03,591 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=22.5 2023-06-21 00:44:29,931 INFO [train.py:996] (3/4) Epoch 5, batch 17900, loss[loss=0.343, simple_loss=0.42, pruned_loss=0.133, over 21471.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3354, pruned_loss=0.09432, over 4264379.83 frames. ], batch size: 471, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:45:19,610 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.205e+02 2.900e+02 3.378e+02 3.906e+02 6.654e+02, threshold=6.756e+02, percent-clipped=0.0 2023-06-21 00:46:22,383 INFO [train.py:996] (3/4) Epoch 5, batch 17950, loss[loss=0.2559, simple_loss=0.3463, pruned_loss=0.08277, over 21486.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3342, pruned_loss=0.0902, over 4265957.61 frames. ], batch size: 507, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:46:22,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=839574.0, ans=0.0 2023-06-21 00:46:58,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=839694.0, ans=0.125 2023-06-21 00:47:04,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=839694.0, ans=10.0 2023-06-21 00:48:01,508 INFO [train.py:996] (3/4) Epoch 5, batch 18000, loss[loss=0.2074, simple_loss=0.2801, pruned_loss=0.06735, over 21597.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3285, pruned_loss=0.08931, over 4264920.62 frames. ], batch size: 298, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 00:48:01,508 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 00:48:17,785 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2664, simple_loss=0.3658, pruned_loss=0.08353, over 1796401.00 frames. 2023-06-21 00:48:17,785 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-21 00:48:18,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=839874.0, ans=0.1 2023-06-21 00:48:37,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=839934.0, ans=0.2 2023-06-21 00:49:02,447 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.602e+02 3.109e+02 3.503e+02 6.028e+02, threshold=6.218e+02, percent-clipped=0.0 2023-06-21 00:49:27,418 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.55 vs. limit=10.0 2023-06-21 00:49:30,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=840054.0, ans=0.0 2023-06-21 00:49:30,678 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=15.0 2023-06-21 00:49:51,507 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=12.0 2023-06-21 00:49:58,588 INFO [train.py:996] (3/4) Epoch 5, batch 18050, loss[loss=0.2391, simple_loss=0.3049, pruned_loss=0.08662, over 21652.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3235, pruned_loss=0.08893, over 4262131.07 frames. ], batch size: 298, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:50:15,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=840174.0, ans=0.125 2023-06-21 00:51:17,851 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-21 00:51:19,337 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-06-21 00:51:39,018 INFO [train.py:996] (3/4) Epoch 5, batch 18100, loss[loss=0.2644, simple_loss=0.32, pruned_loss=0.1045, over 21155.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3301, pruned_loss=0.09219, over 4264152.30 frames. ], batch size: 143, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:51:57,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=840474.0, ans=0.2 2023-06-21 00:52:27,298 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.902e+02 3.495e+02 4.106e+02 8.308e+02, threshold=6.990e+02, percent-clipped=1.0 2023-06-21 00:52:29,937 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.27 vs. limit=15.0 2023-06-21 00:52:30,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=840594.0, ans=0.125 2023-06-21 00:52:35,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=840594.0, ans=0.125 2023-06-21 00:53:10,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=840714.0, ans=0.125 2023-06-21 00:53:22,797 INFO [train.py:996] (3/4) Epoch 5, batch 18150, loss[loss=0.2221, simple_loss=0.3005, pruned_loss=0.07192, over 21574.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3285, pruned_loss=0.09145, over 4268819.00 frames. ], batch size: 230, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 00:53:41,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=840834.0, ans=0.125 2023-06-21 00:53:47,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=840834.0, ans=0.125 2023-06-21 00:54:10,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=840894.0, ans=0.2 2023-06-21 00:54:13,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=840894.0, ans=0.125 2023-06-21 00:54:54,848 INFO [train.py:996] (3/4) Epoch 5, batch 18200, loss[loss=0.2002, simple_loss=0.2733, pruned_loss=0.06351, over 21774.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3209, pruned_loss=0.09056, over 4266540.52 frames. ], batch size: 124, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 00:55:28,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=841134.0, ans=0.125 2023-06-21 00:55:37,485 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.776e+02 3.291e+02 4.569e+02 1.152e+03, threshold=6.583e+02, percent-clipped=3.0 2023-06-21 00:56:09,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=841254.0, ans=0.0 2023-06-21 00:56:32,308 INFO [train.py:996] (3/4) Epoch 5, batch 18250, loss[loss=0.3044, simple_loss=0.3498, pruned_loss=0.1295, over 21746.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3157, pruned_loss=0.08862, over 4249221.96 frames. ], batch size: 508, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 00:56:40,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=841374.0, ans=0.0 2023-06-21 00:56:56,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=841434.0, ans=0.125 2023-06-21 00:57:03,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=841434.0, ans=0.1 2023-06-21 00:57:05,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=841434.0, ans=0.0 2023-06-21 00:57:13,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=841494.0, ans=0.125 2023-06-21 00:58:11,253 INFO [train.py:996] (3/4) Epoch 5, batch 18300, loss[loss=0.2317, simple_loss=0.3028, pruned_loss=0.0803, over 21872.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3159, pruned_loss=0.08831, over 4240979.20 frames. ], batch size: 118, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 00:58:54,877 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.755e+02 2.809e+02 3.144e+02 3.817e+02 6.593e+02, threshold=6.288e+02, percent-clipped=1.0 2023-06-21 00:58:55,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=841794.0, ans=0.1 2023-06-21 00:59:14,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=841854.0, ans=0.125 2023-06-21 00:59:36,804 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=22.5 2023-06-21 00:59:39,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=841914.0, ans=0.125 2023-06-21 00:59:49,938 INFO [train.py:996] (3/4) Epoch 5, batch 18350, loss[loss=0.2893, simple_loss=0.4234, pruned_loss=0.07758, over 19700.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3203, pruned_loss=0.08895, over 4237494.07 frames. ], batch size: 702, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 01:00:21,833 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.62 vs. limit=15.0 2023-06-21 01:01:12,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=842214.0, ans=0.0 2023-06-21 01:01:30,162 INFO [train.py:996] (3/4) Epoch 5, batch 18400, loss[loss=0.2406, simple_loss=0.3279, pruned_loss=0.07664, over 21874.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3169, pruned_loss=0.08771, over 4249060.78 frames. ], batch size: 373, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 01:01:40,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=842274.0, ans=0.1 2023-06-21 01:02:14,606 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.055e+02 2.972e+02 3.476e+02 4.424e+02 9.442e+02, threshold=6.951e+02, percent-clipped=6.0 2023-06-21 01:02:15,016 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:02:51,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=842454.0, ans=10.0 2023-06-21 01:02:57,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=842514.0, ans=0.0 2023-06-21 01:03:03,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=842514.0, ans=0.95 2023-06-21 01:03:09,997 INFO [train.py:996] (3/4) Epoch 5, batch 18450, loss[loss=0.192, simple_loss=0.281, pruned_loss=0.05147, over 21728.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3129, pruned_loss=0.08325, over 4251382.53 frames. ], batch size: 298, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 01:03:25,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=842574.0, ans=0.0 2023-06-21 01:03:35,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=842634.0, ans=0.2 2023-06-21 01:03:43,491 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:04:36,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=842814.0, ans=0.025 2023-06-21 01:04:42,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=842814.0, ans=0.2 2023-06-21 01:04:47,242 INFO [train.py:996] (3/4) Epoch 5, batch 18500, loss[loss=0.2442, simple_loss=0.2984, pruned_loss=0.09502, over 21595.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3076, pruned_loss=0.08254, over 4244002.04 frames. ], batch size: 263, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 01:05:30,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.509e+02 2.864e+02 3.266e+02 4.867e+02, threshold=5.728e+02, percent-clipped=0.0 2023-06-21 01:05:34,044 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:06:02,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=843054.0, ans=0.125 2023-06-21 01:06:26,366 INFO [train.py:996] (3/4) Epoch 5, batch 18550, loss[loss=0.2297, simple_loss=0.2935, pruned_loss=0.08299, over 21519.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3039, pruned_loss=0.08165, over 4236920.29 frames. ], batch size: 391, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 01:06:35,027 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-06-21 01:06:51,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=843234.0, ans=0.125 2023-06-21 01:08:06,330 INFO [train.py:996] (3/4) Epoch 5, batch 18600, loss[loss=0.2651, simple_loss=0.3504, pruned_loss=0.08988, over 21607.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3021, pruned_loss=0.08181, over 4230719.58 frames. ], batch size: 442, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:08:21,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=843474.0, ans=0.0 2023-06-21 01:08:32,732 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-21 01:08:49,503 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.225e+02 2.765e+02 3.271e+02 3.896e+02 6.265e+02, threshold=6.542e+02, percent-clipped=2.0 2023-06-21 01:09:40,826 INFO [train.py:996] (3/4) Epoch 5, batch 18650, loss[loss=0.2463, simple_loss=0.3033, pruned_loss=0.09462, over 21318.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3033, pruned_loss=0.08319, over 4215893.27 frames. ], batch size: 551, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:09:47,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=843774.0, ans=0.125 2023-06-21 01:10:22,437 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.72 vs. limit=6.0 2023-06-21 01:10:28,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=843894.0, ans=0.125 2023-06-21 01:11:13,508 INFO [train.py:996] (3/4) Epoch 5, batch 18700, loss[loss=0.2426, simple_loss=0.303, pruned_loss=0.09113, over 22006.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.302, pruned_loss=0.08408, over 4230629.18 frames. ], batch size: 300, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:11:18,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=844074.0, ans=0.125 2023-06-21 01:11:56,977 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.719e+02 3.161e+02 4.088e+02 6.146e+02, threshold=6.321e+02, percent-clipped=0.0 2023-06-21 01:12:52,643 INFO [train.py:996] (3/4) Epoch 5, batch 18750, loss[loss=0.269, simple_loss=0.3392, pruned_loss=0.09939, over 21755.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3043, pruned_loss=0.08691, over 4250116.43 frames. ], batch size: 332, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:13:03,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=844374.0, ans=0.04949747468305833 2023-06-21 01:13:24,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=844434.0, ans=0.0 2023-06-21 01:14:26,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=844614.0, ans=0.1 2023-06-21 01:14:32,806 INFO [train.py:996] (3/4) Epoch 5, batch 18800, loss[loss=0.3452, simple_loss=0.4129, pruned_loss=0.1387, over 21540.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3111, pruned_loss=0.08917, over 4246557.75 frames. ], batch size: 471, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:15:11,000 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.115e+02 3.803e+02 4.953e+02 7.292e+02, threshold=7.607e+02, percent-clipped=7.0 2023-06-21 01:15:18,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=844794.0, ans=0.125 2023-06-21 01:16:07,886 INFO [train.py:996] (3/4) Epoch 5, batch 18850, loss[loss=0.2, simple_loss=0.2631, pruned_loss=0.06843, over 21185.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3066, pruned_loss=0.08355, over 4253265.84 frames. ], batch size: 159, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 01:16:24,880 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2023-06-21 01:17:11,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=845154.0, ans=0.1 2023-06-21 01:17:25,795 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-06-21 01:17:46,565 INFO [train.py:996] (3/4) Epoch 5, batch 18900, loss[loss=0.2263, simple_loss=0.2854, pruned_loss=0.08362, over 21676.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3045, pruned_loss=0.08381, over 4262042.71 frames. ], batch size: 247, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 01:17:52,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=15.0 2023-06-21 01:18:31,684 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.559e+02 2.920e+02 3.727e+02 6.054e+02, threshold=5.840e+02, percent-clipped=0.0 2023-06-21 01:18:32,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=845394.0, ans=0.04949747468305833 2023-06-21 01:18:36,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=845394.0, ans=0.125 2023-06-21 01:18:38,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=845394.0, ans=0.0 2023-06-21 01:18:40,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=845394.0, ans=0.125 2023-06-21 01:19:03,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=845454.0, ans=0.125 2023-06-21 01:19:27,655 INFO [train.py:996] (3/4) Epoch 5, batch 18950, loss[loss=0.2035, simple_loss=0.2567, pruned_loss=0.07512, over 21044.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3051, pruned_loss=0.08613, over 4276653.63 frames. ], batch size: 608, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 01:19:34,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=845574.0, ans=0.125 2023-06-21 01:20:38,830 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.26 vs. limit=10.0 2023-06-21 01:20:40,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=845754.0, ans=0.125 2023-06-21 01:21:08,459 INFO [train.py:996] (3/4) Epoch 5, batch 19000, loss[loss=0.3154, simple_loss=0.3776, pruned_loss=0.1266, over 21831.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3155, pruned_loss=0.08753, over 4285503.00 frames. ], batch size: 282, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 01:21:29,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=845874.0, ans=0.2 2023-06-21 01:21:44,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=845934.0, ans=0.125 2023-06-21 01:21:46,484 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-21 01:21:53,547 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.876e+02 3.456e+02 4.188e+02 7.110e+02, threshold=6.912e+02, percent-clipped=2.0 2023-06-21 01:21:55,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=845994.0, ans=0.015 2023-06-21 01:22:03,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=845994.0, ans=0.0 2023-06-21 01:22:47,089 INFO [train.py:996] (3/4) Epoch 5, batch 19050, loss[loss=0.3125, simple_loss=0.3621, pruned_loss=0.1315, over 21728.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3218, pruned_loss=0.09241, over 4287830.52 frames. ], batch size: 389, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:22:55,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=846174.0, ans=0.0 2023-06-21 01:23:45,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=846294.0, ans=0.04949747468305833 2023-06-21 01:23:53,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=846354.0, ans=0.2 2023-06-21 01:23:56,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=846354.0, ans=0.5 2023-06-21 01:23:59,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=846354.0, ans=0.04949747468305833 2023-06-21 01:23:59,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=846354.0, ans=15.0 2023-06-21 01:24:31,384 INFO [train.py:996] (3/4) Epoch 5, batch 19100, loss[loss=0.2168, simple_loss=0.2778, pruned_loss=0.07791, over 21686.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3196, pruned_loss=0.09305, over 4291094.99 frames. ], batch size: 316, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:24:53,073 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:24:53,665 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-21 01:25:22,018 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 2.859e+02 3.382e+02 4.111e+02 6.618e+02, threshold=6.763e+02, percent-clipped=0.0 2023-06-21 01:26:17,771 INFO [train.py:996] (3/4) Epoch 5, batch 19150, loss[loss=0.2809, simple_loss=0.3693, pruned_loss=0.09628, over 21702.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3232, pruned_loss=0.09453, over 4283418.19 frames. ], batch size: 298, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:26:18,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=846774.0, ans=0.125 2023-06-21 01:26:20,467 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.20 vs. limit=12.0 2023-06-21 01:27:22,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=846954.0, ans=0.125 2023-06-21 01:28:00,623 INFO [train.py:996] (3/4) Epoch 5, batch 19200, loss[loss=0.3231, simple_loss=0.4119, pruned_loss=0.1172, over 21740.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3339, pruned_loss=0.09479, over 4279623.93 frames. ], batch size: 351, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 01:28:01,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=847074.0, ans=0.2 2023-06-21 01:28:12,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=847074.0, ans=0.125 2023-06-21 01:28:19,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=847074.0, ans=0.125 2023-06-21 01:28:33,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=847134.0, ans=0.125 2023-06-21 01:28:35,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=847134.0, ans=0.0 2023-06-21 01:28:47,828 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.827e+02 3.204e+02 4.140e+02 7.071e+02, threshold=6.408e+02, percent-clipped=1.0 2023-06-21 01:28:54,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=847194.0, ans=0.1 2023-06-21 01:28:57,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=847254.0, ans=0.125 2023-06-21 01:29:26,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=847314.0, ans=0.1 2023-06-21 01:29:41,523 INFO [train.py:996] (3/4) Epoch 5, batch 19250, loss[loss=0.2177, simple_loss=0.2959, pruned_loss=0.06976, over 21605.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3363, pruned_loss=0.09141, over 4277969.18 frames. ], batch size: 263, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:29:48,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=847374.0, ans=0.0 2023-06-21 01:30:17,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=847434.0, ans=0.1 2023-06-21 01:30:25,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=847494.0, ans=0.0 2023-06-21 01:30:30,847 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-21 01:31:11,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=847614.0, ans=0.2 2023-06-21 01:31:20,349 INFO [train.py:996] (3/4) Epoch 5, batch 19300, loss[loss=0.2239, simple_loss=0.3023, pruned_loss=0.07269, over 21279.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3313, pruned_loss=0.0889, over 4281782.86 frames. ], batch size: 176, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:31:24,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=847674.0, ans=0.125 2023-06-21 01:31:32,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=847674.0, ans=0.0 2023-06-21 01:32:07,943 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.704e+02 3.211e+02 3.924e+02 6.818e+02, threshold=6.422e+02, percent-clipped=2.0 2023-06-21 01:32:11,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=847794.0, ans=0.125 2023-06-21 01:32:14,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=847794.0, ans=0.2 2023-06-21 01:32:34,814 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.63 vs. limit=10.0 2023-06-21 01:32:53,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=847914.0, ans=0.125 2023-06-21 01:33:01,213 INFO [train.py:996] (3/4) Epoch 5, batch 19350, loss[loss=0.3101, simple_loss=0.3756, pruned_loss=0.1223, over 21555.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3269, pruned_loss=0.08548, over 4272283.82 frames. ], batch size: 509, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:33:09,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=847974.0, ans=0.125 2023-06-21 01:33:11,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=847974.0, ans=0.5 2023-06-21 01:33:57,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=848154.0, ans=0.125 2023-06-21 01:34:35,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=848214.0, ans=0.0 2023-06-21 01:34:35,780 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.36 vs. limit=15.0 2023-06-21 01:34:37,444 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.69 vs. limit=12.0 2023-06-21 01:34:39,455 INFO [train.py:996] (3/4) Epoch 5, batch 19400, loss[loss=0.2567, simple_loss=0.3259, pruned_loss=0.09378, over 21078.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.323, pruned_loss=0.08414, over 4276951.55 frames. ], batch size: 608, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:34:49,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=848274.0, ans=0.1 2023-06-21 01:35:10,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=848334.0, ans=0.1 2023-06-21 01:35:25,891 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.765e+02 3.064e+02 3.598e+02 5.687e+02, threshold=6.129e+02, percent-clipped=0.0 2023-06-21 01:35:38,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=848394.0, ans=0.125 2023-06-21 01:35:56,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=848454.0, ans=0.125 2023-06-21 01:36:22,713 INFO [train.py:996] (3/4) Epoch 5, batch 19450, loss[loss=0.2474, simple_loss=0.3134, pruned_loss=0.09066, over 21802.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3195, pruned_loss=0.08649, over 4275215.83 frames. ], batch size: 118, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:36:30,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=848574.0, ans=0.0 2023-06-21 01:36:50,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=848634.0, ans=0.125 2023-06-21 01:37:58,156 INFO [train.py:996] (3/4) Epoch 5, batch 19500, loss[loss=0.2393, simple_loss=0.3038, pruned_loss=0.08735, over 21658.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3147, pruned_loss=0.08794, over 4263338.13 frames. ], batch size: 298, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:38:44,697 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.339e+02 2.810e+02 3.330e+02 3.940e+02 7.380e+02, threshold=6.661e+02, percent-clipped=6.0 2023-06-21 01:39:36,258 INFO [train.py:996] (3/4) Epoch 5, batch 19550, loss[loss=0.2146, simple_loss=0.3133, pruned_loss=0.058, over 21832.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3097, pruned_loss=0.08493, over 4266152.61 frames. ], batch size: 371, lr: 6.13e-03, grad_scale: 16.0 2023-06-21 01:39:50,660 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:39:57,717 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=12.0 2023-06-21 01:40:13,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=849234.0, ans=0.2 2023-06-21 01:40:21,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=849294.0, ans=0.07 2023-06-21 01:40:44,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=849354.0, ans=0.125 2023-06-21 01:41:19,065 INFO [train.py:996] (3/4) Epoch 5, batch 19600, loss[loss=0.2585, simple_loss=0.3195, pruned_loss=0.09873, over 21910.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3129, pruned_loss=0.08678, over 4275250.63 frames. ], batch size: 316, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:41:55,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=849594.0, ans=0.125 2023-06-21 01:41:59,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=849594.0, ans=0.1 2023-06-21 01:42:00,755 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 3.061e+02 3.495e+02 4.046e+02 6.477e+02, threshold=6.990e+02, percent-clipped=0.0 2023-06-21 01:42:04,904 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=12.0 2023-06-21 01:42:23,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=849654.0, ans=0.125 2023-06-21 01:42:24,096 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.51 vs. limit=6.0 2023-06-21 01:42:44,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=849714.0, ans=0.2 2023-06-21 01:42:53,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=849714.0, ans=0.125 2023-06-21 01:42:57,938 INFO [train.py:996] (3/4) Epoch 5, batch 19650, loss[loss=0.2607, simple_loss=0.3356, pruned_loss=0.09295, over 20014.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3177, pruned_loss=0.09063, over 4273543.68 frames. ], batch size: 702, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:43:50,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=849894.0, ans=0.025 2023-06-21 01:44:46,866 INFO [train.py:996] (3/4) Epoch 5, batch 19700, loss[loss=0.2311, simple_loss=0.3203, pruned_loss=0.07095, over 21742.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3193, pruned_loss=0.0908, over 4273657.69 frames. ], batch size: 332, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:45:12,874 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-21 01:45:34,550 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.202e+02 2.950e+02 3.404e+02 4.157e+02 1.102e+03, threshold=6.808e+02, percent-clipped=4.0 2023-06-21 01:45:55,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=850254.0, ans=0.1 2023-06-21 01:46:27,922 INFO [train.py:996] (3/4) Epoch 5, batch 19750, loss[loss=0.2631, simple_loss=0.3474, pruned_loss=0.08942, over 21833.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3287, pruned_loss=0.092, over 4277588.53 frames. ], batch size: 298, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:46:31,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=850374.0, ans=0.125 2023-06-21 01:46:54,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=850434.0, ans=0.0 2023-06-21 01:47:39,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=850554.0, ans=0.125 2023-06-21 01:48:02,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=850614.0, ans=0.05 2023-06-21 01:48:04,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=850614.0, ans=0.0 2023-06-21 01:48:06,769 INFO [train.py:996] (3/4) Epoch 5, batch 19800, loss[loss=0.2842, simple_loss=0.3386, pruned_loss=0.1149, over 21934.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3294, pruned_loss=0.09252, over 4279193.19 frames. ], batch size: 351, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:48:10,504 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:48:26,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=850734.0, ans=0.125 2023-06-21 01:48:35,240 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.01 vs. limit=22.5 2023-06-21 01:48:54,004 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 3.157e+02 4.058e+02 5.975e+02 1.111e+03, threshold=8.116e+02, percent-clipped=16.0 2023-06-21 01:49:52,583 INFO [train.py:996] (3/4) Epoch 5, batch 19850, loss[loss=0.2004, simple_loss=0.2786, pruned_loss=0.06109, over 21605.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3216, pruned_loss=0.08768, over 4278925.48 frames. ], batch size: 263, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:50:18,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=851034.0, ans=0.2 2023-06-21 01:50:51,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=851094.0, ans=0.125 2023-06-21 01:51:08,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=851154.0, ans=0.0 2023-06-21 01:51:13,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=851214.0, ans=0.125 2023-06-21 01:51:16,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=851214.0, ans=0.1 2023-06-21 01:51:32,064 INFO [train.py:996] (3/4) Epoch 5, batch 19900, loss[loss=0.2288, simple_loss=0.3213, pruned_loss=0.06811, over 21768.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.322, pruned_loss=0.08482, over 4276046.44 frames. ], batch size: 351, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:52:01,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=851334.0, ans=0.2 2023-06-21 01:52:19,473 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.633e+02 2.856e+02 3.289e+02 5.435e+02, threshold=5.712e+02, percent-clipped=0.0 2023-06-21 01:53:08,174 INFO [train.py:996] (3/4) Epoch 5, batch 19950, loss[loss=0.2391, simple_loss=0.2911, pruned_loss=0.09355, over 21630.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3158, pruned_loss=0.08445, over 4271670.56 frames. ], batch size: 282, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:54:10,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.25 vs. limit=15.0 2023-06-21 01:54:15,027 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.90 vs. limit=22.5 2023-06-21 01:54:18,118 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-06-21 01:54:21,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-21 01:54:26,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=851814.0, ans=0.125 2023-06-21 01:54:30,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=851814.0, ans=0.0 2023-06-21 01:54:44,585 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.81 vs. limit=10.0 2023-06-21 01:54:46,759 INFO [train.py:996] (3/4) Epoch 5, batch 20000, loss[loss=0.2635, simple_loss=0.3301, pruned_loss=0.09843, over 21418.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3181, pruned_loss=0.08512, over 4259045.94 frames. ], batch size: 159, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 01:54:59,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=851874.0, ans=0.125 2023-06-21 01:55:38,158 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.383e+02 3.174e+02 3.672e+02 4.869e+02 7.405e+02, threshold=7.343e+02, percent-clipped=12.0 2023-06-21 01:56:02,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=852054.0, ans=0.0 2023-06-21 01:56:25,457 INFO [train.py:996] (3/4) Epoch 5, batch 20050, loss[loss=0.2756, simple_loss=0.3403, pruned_loss=0.1054, over 21851.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3213, pruned_loss=0.08881, over 4272279.40 frames. ], batch size: 414, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 01:56:45,327 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:56:50,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=852234.0, ans=0.0 2023-06-21 01:57:01,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=852234.0, ans=0.125 2023-06-21 01:58:11,252 INFO [train.py:996] (3/4) Epoch 5, batch 20100, loss[loss=0.2927, simple_loss=0.3564, pruned_loss=0.1145, over 21603.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.323, pruned_loss=0.09122, over 4276050.46 frames. ], batch size: 471, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 01:58:14,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=852474.0, ans=0.125 2023-06-21 01:58:18,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=852474.0, ans=0.125 2023-06-21 01:58:34,473 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.83 vs. limit=15.0 2023-06-21 01:58:48,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=852534.0, ans=0.0 2023-06-21 01:58:58,321 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.283e+02 2.825e+02 3.167e+02 3.914e+02 6.858e+02, threshold=6.334e+02, percent-clipped=0.0 2023-06-21 01:59:39,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=852714.0, ans=0.125 2023-06-21 01:59:51,867 INFO [train.py:996] (3/4) Epoch 5, batch 20150, loss[loss=0.2988, simple_loss=0.3595, pruned_loss=0.119, over 21945.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3332, pruned_loss=0.0951, over 4271112.91 frames. ], batch size: 316, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 02:00:23,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=852834.0, ans=15.0 2023-06-21 02:00:32,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=852834.0, ans=10.0 2023-06-21 02:00:37,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=852894.0, ans=0.0 2023-06-21 02:00:57,726 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=12.0 2023-06-21 02:01:29,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=853014.0, ans=0.0 2023-06-21 02:01:44,364 INFO [train.py:996] (3/4) Epoch 5, batch 20200, loss[loss=0.3134, simple_loss=0.4105, pruned_loss=0.1082, over 21245.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3388, pruned_loss=0.0983, over 4268715.53 frames. ], batch size: 548, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 02:02:09,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=853134.0, ans=0.0 2023-06-21 02:02:28,617 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.103e+02 3.828e+02 4.837e+02 8.948e+02, threshold=7.656e+02, percent-clipped=6.0 2023-06-21 02:03:24,893 INFO [train.py:996] (3/4) Epoch 5, batch 20250, loss[loss=0.2285, simple_loss=0.2986, pruned_loss=0.07916, over 21256.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3393, pruned_loss=0.09636, over 4272227.64 frames. ], batch size: 143, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 02:03:54,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=853434.0, ans=0.0 2023-06-21 02:04:41,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=853614.0, ans=0.1 2023-06-21 02:04:56,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=853614.0, ans=0.125 2023-06-21 02:05:03,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=853674.0, ans=0.0 2023-06-21 02:05:04,407 INFO [train.py:996] (3/4) Epoch 5, batch 20300, loss[loss=0.2641, simple_loss=0.3302, pruned_loss=0.09897, over 21078.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3362, pruned_loss=0.09345, over 4268846.65 frames. ], batch size: 143, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 02:05:04,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=853674.0, ans=0.2 2023-06-21 02:05:32,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=853734.0, ans=0.1 2023-06-21 02:05:37,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=853734.0, ans=0.1 2023-06-21 02:05:51,757 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.734e+02 3.068e+02 3.713e+02 6.256e+02, threshold=6.135e+02, percent-clipped=0.0 2023-06-21 02:05:52,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=22.5 2023-06-21 02:06:06,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=853854.0, ans=0.125 2023-06-21 02:06:21,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=853914.0, ans=0.125 2023-06-21 02:06:36,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.20 vs. limit=10.0 2023-06-21 02:06:41,901 INFO [train.py:996] (3/4) Epoch 5, batch 20350, loss[loss=0.3248, simple_loss=0.3715, pruned_loss=0.139, over 21548.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3365, pruned_loss=0.09436, over 4260006.10 frames. ], batch size: 507, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 02:08:09,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=854214.0, ans=0.1 2023-06-21 02:08:17,291 INFO [train.py:996] (3/4) Epoch 5, batch 20400, loss[loss=0.2509, simple_loss=0.3237, pruned_loss=0.08899, over 21227.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3384, pruned_loss=0.09654, over 4259583.40 frames. ], batch size: 176, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 02:08:44,106 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-21 02:08:48,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=854334.0, ans=0.0 2023-06-21 02:09:05,779 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 3.145e+02 3.695e+02 4.616e+02 6.973e+02, threshold=7.390e+02, percent-clipped=6.0 2023-06-21 02:09:38,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=854454.0, ans=0.125 2023-06-21 02:09:50,411 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=14.76 vs. limit=15.0 2023-06-21 02:09:56,706 INFO [train.py:996] (3/4) Epoch 5, batch 20450, loss[loss=0.2521, simple_loss=0.3095, pruned_loss=0.09732, over 21433.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3386, pruned_loss=0.09883, over 4256263.46 frames. ], batch size: 194, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:10:00,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=854574.0, ans=0.125 2023-06-21 02:11:03,968 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-21 02:11:34,421 INFO [train.py:996] (3/4) Epoch 5, batch 20500, loss[loss=0.2387, simple_loss=0.2981, pruned_loss=0.08969, over 21761.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3343, pruned_loss=0.09941, over 4259780.91 frames. ], batch size: 316, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:11:38,559 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=12.0 2023-06-21 02:12:17,447 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 2.857e+02 3.263e+02 3.907e+02 6.416e+02, threshold=6.525e+02, percent-clipped=0.0 2023-06-21 02:12:37,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=855054.0, ans=0.07 2023-06-21 02:12:42,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=855054.0, ans=0.0 2023-06-21 02:12:55,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=855114.0, ans=0.0 2023-06-21 02:13:09,819 INFO [train.py:996] (3/4) Epoch 5, batch 20550, loss[loss=0.2904, simple_loss=0.3663, pruned_loss=0.1073, over 21461.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3264, pruned_loss=0.09712, over 4254217.43 frames. ], batch size: 473, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:13:37,164 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.47 vs. limit=8.0 2023-06-21 02:13:40,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=855234.0, ans=0.125 2023-06-21 02:13:45,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=855294.0, ans=0.1 2023-06-21 02:14:32,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=855414.0, ans=0.125 2023-06-21 02:14:46,284 INFO [train.py:996] (3/4) Epoch 5, batch 20600, loss[loss=0.2395, simple_loss=0.3125, pruned_loss=0.08322, over 21818.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3294, pruned_loss=0.0951, over 4265267.95 frames. ], batch size: 298, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 02:14:46,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=855474.0, ans=0.125 2023-06-21 02:14:59,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=855474.0, ans=0.125 2023-06-21 02:15:15,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-06-21 02:15:28,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=855594.0, ans=0.1 2023-06-21 02:15:35,926 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 2.804e+02 3.280e+02 3.757e+02 7.089e+02, threshold=6.559e+02, percent-clipped=1.0 2023-06-21 02:16:08,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=855714.0, ans=0.1 2023-06-21 02:16:13,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=855714.0, ans=0.125 2023-06-21 02:16:26,172 INFO [train.py:996] (3/4) Epoch 5, batch 20650, loss[loss=0.2592, simple_loss=0.3182, pruned_loss=0.1001, over 21798.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3256, pruned_loss=0.09582, over 4257023.69 frames. ], batch size: 351, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 02:16:53,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=855834.0, ans=0.1 2023-06-21 02:16:58,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=855834.0, ans=0.125 2023-06-21 02:17:06,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=855894.0, ans=0.0 2023-06-21 02:18:06,936 INFO [train.py:996] (3/4) Epoch 5, batch 20700, loss[loss=0.1935, simple_loss=0.2596, pruned_loss=0.06369, over 21754.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3185, pruned_loss=0.09179, over 4264651.12 frames. ], batch size: 124, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 02:18:47,539 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:18:57,428 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.666e+02 3.123e+02 3.714e+02 6.425e+02, threshold=6.247e+02, percent-clipped=0.0 2023-06-21 02:19:40,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=856314.0, ans=0.0 2023-06-21 02:19:53,126 INFO [train.py:996] (3/4) Epoch 5, batch 20750, loss[loss=0.3239, simple_loss=0.4152, pruned_loss=0.1163, over 21681.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3176, pruned_loss=0.08907, over 4256831.01 frames. ], batch size: 389, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 02:20:01,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=856374.0, ans=0.0 2023-06-21 02:20:21,991 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-21 02:20:41,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=856494.0, ans=0.125 2023-06-21 02:21:03,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=856554.0, ans=0.1 2023-06-21 02:21:08,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=856614.0, ans=0.0 2023-06-21 02:21:29,579 INFO [train.py:996] (3/4) Epoch 5, batch 20800, loss[loss=0.2982, simple_loss=0.3461, pruned_loss=0.1251, over 21340.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3228, pruned_loss=0.09128, over 4254844.01 frames. ], batch size: 507, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:21:48,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=856674.0, ans=0.0 2023-06-21 02:21:52,006 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.91 vs. limit=6.0 2023-06-21 02:21:53,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=12.0 2023-06-21 02:22:22,696 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=15.0 2023-06-21 02:22:25,028 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.108e+02 3.901e+02 5.599e+02 9.709e+02, threshold=7.803e+02, percent-clipped=19.0 2023-06-21 02:22:26,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.39 vs. limit=15.0 2023-06-21 02:22:27,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=856794.0, ans=0.125 2023-06-21 02:22:47,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=856914.0, ans=0.2 2023-06-21 02:23:08,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=856974.0, ans=0.125 2023-06-21 02:23:14,829 INFO [train.py:996] (3/4) Epoch 5, batch 20850, loss[loss=0.2039, simple_loss=0.2791, pruned_loss=0.06431, over 21859.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3144, pruned_loss=0.08881, over 4251417.36 frames. ], batch size: 333, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:23:15,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=856974.0, ans=0.0 2023-06-21 02:23:39,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=857034.0, ans=0.125 2023-06-21 02:24:55,582 INFO [train.py:996] (3/4) Epoch 5, batch 20900, loss[loss=0.2474, simple_loss=0.3274, pruned_loss=0.0837, over 21767.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3169, pruned_loss=0.09006, over 4255838.00 frames. ], batch size: 351, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:25:00,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=857274.0, ans=0.125 2023-06-21 02:25:02,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=857274.0, ans=0.0 2023-06-21 02:25:15,366 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=22.5 2023-06-21 02:25:29,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=857334.0, ans=0.2 2023-06-21 02:25:44,703 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.849e+02 3.550e+02 4.829e+02 8.716e+02, threshold=7.101e+02, percent-clipped=1.0 2023-06-21 02:25:51,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=857454.0, ans=0.125 2023-06-21 02:25:57,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=857454.0, ans=0.125 2023-06-21 02:25:58,200 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.06 vs. limit=12.0 2023-06-21 02:26:23,339 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=15.0 2023-06-21 02:26:24,221 INFO [train.py:996] (3/4) Epoch 5, batch 20950, loss[loss=0.2069, simple_loss=0.2809, pruned_loss=0.06644, over 21725.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3139, pruned_loss=0.086, over 4264547.70 frames. ], batch size: 282, lr: 6.10e-03, grad_scale: 32.0 2023-06-21 02:26:24,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=857574.0, ans=0.125 2023-06-21 02:26:45,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=857574.0, ans=0.0 2023-06-21 02:27:11,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.74 vs. limit=10.0 2023-06-21 02:27:36,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=857754.0, ans=0.125 2023-06-21 02:27:51,961 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:28:01,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=857874.0, ans=0.1 2023-06-21 02:28:02,747 INFO [train.py:996] (3/4) Epoch 5, batch 21000, loss[loss=0.2591, simple_loss=0.3216, pruned_loss=0.09831, over 21759.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3122, pruned_loss=0.08603, over 4264122.02 frames. ], batch size: 389, lr: 6.10e-03, grad_scale: 32.0 2023-06-21 02:28:02,747 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 02:28:23,290 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2707, simple_loss=0.3706, pruned_loss=0.0854, over 1796401.00 frames. 2023-06-21 02:28:23,291 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-21 02:28:25,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=857874.0, ans=0.95 2023-06-21 02:29:01,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=857994.0, ans=0.1 2023-06-21 02:29:08,046 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 2.446e+02 2.941e+02 3.372e+02 5.847e+02, threshold=5.881e+02, percent-clipped=0.0 2023-06-21 02:29:52,530 INFO [train.py:996] (3/4) Epoch 5, batch 21050, loss[loss=0.215, simple_loss=0.2753, pruned_loss=0.07738, over 21282.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3098, pruned_loss=0.08677, over 4264896.44 frames. ], batch size: 159, lr: 6.10e-03, grad_scale: 32.0 2023-06-21 02:29:53,604 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.64 vs. limit=15.0 2023-06-21 02:30:08,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=858234.0, ans=0.035 2023-06-21 02:30:16,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=858234.0, ans=0.95 2023-06-21 02:30:58,072 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=12.0 2023-06-21 02:30:58,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=858354.0, ans=0.0 2023-06-21 02:31:07,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-21 02:31:13,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=858414.0, ans=0.0 2023-06-21 02:31:31,896 INFO [train.py:996] (3/4) Epoch 5, batch 21100, loss[loss=0.2474, simple_loss=0.2952, pruned_loss=0.09979, over 21316.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3059, pruned_loss=0.0861, over 4262791.26 frames. ], batch size: 160, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:31:36,260 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=22.5 2023-06-21 02:31:44,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=858474.0, ans=0.0 2023-06-21 02:32:23,044 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.646e+02 3.104e+02 3.741e+02 7.727e+02, threshold=6.208e+02, percent-clipped=4.0 2023-06-21 02:32:32,583 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:32:50,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=858714.0, ans=0.0 2023-06-21 02:33:10,727 INFO [train.py:996] (3/4) Epoch 5, batch 21150, loss[loss=0.2132, simple_loss=0.2558, pruned_loss=0.08529, over 20816.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3032, pruned_loss=0.08709, over 4259633.13 frames. ], batch size: 609, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:33:26,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=858834.0, ans=0.2 2023-06-21 02:34:24,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=859014.0, ans=0.125 2023-06-21 02:34:28,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=859014.0, ans=10.0 2023-06-21 02:34:40,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=859014.0, ans=0.0 2023-06-21 02:34:49,178 INFO [train.py:996] (3/4) Epoch 5, batch 21200, loss[loss=0.2279, simple_loss=0.2907, pruned_loss=0.08257, over 21765.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.2994, pruned_loss=0.08639, over 4265934.84 frames. ], batch size: 371, lr: 6.10e-03, grad_scale: 32.0 2023-06-21 02:35:03,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=859134.0, ans=0.0 2023-06-21 02:35:42,258 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.556e+02 2.983e+02 3.477e+02 7.677e+02, threshold=5.965e+02, percent-clipped=1.0 2023-06-21 02:35:47,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=859254.0, ans=0.125 2023-06-21 02:35:56,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=859254.0, ans=0.0 2023-06-21 02:36:02,195 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-21 02:36:04,677 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:36:20,610 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=22.5 2023-06-21 02:36:30,217 INFO [train.py:996] (3/4) Epoch 5, batch 21250, loss[loss=0.2366, simple_loss=0.3021, pruned_loss=0.08551, over 21326.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.2969, pruned_loss=0.08605, over 4267192.65 frames. ], batch size: 131, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:36:47,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=859434.0, ans=0.1 2023-06-21 02:36:58,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=859434.0, ans=0.0 2023-06-21 02:37:05,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=859494.0, ans=0.125 2023-06-21 02:37:17,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=15.0 2023-06-21 02:37:25,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=859494.0, ans=0.0 2023-06-21 02:37:52,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=859614.0, ans=0.0 2023-06-21 02:38:09,622 INFO [train.py:996] (3/4) Epoch 5, batch 21300, loss[loss=0.2977, simple_loss=0.3462, pruned_loss=0.1246, over 21949.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3034, pruned_loss=0.0882, over 4264824.48 frames. ], batch size: 113, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:38:47,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=859794.0, ans=0.125 2023-06-21 02:38:55,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=859794.0, ans=0.2 2023-06-21 02:39:02,878 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.372e+02 2.969e+02 3.329e+02 4.486e+02 8.975e+02, threshold=6.657e+02, percent-clipped=6.0 2023-06-21 02:39:16,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=859854.0, ans=0.0 2023-06-21 02:39:26,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=859914.0, ans=0.1 2023-06-21 02:39:37,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=859914.0, ans=0.0 2023-06-21 02:39:50,017 INFO [train.py:996] (3/4) Epoch 5, batch 21350, loss[loss=0.2156, simple_loss=0.3007, pruned_loss=0.06528, over 21756.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3074, pruned_loss=0.08827, over 4265757.97 frames. ], batch size: 282, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:39:54,221 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-06-21 02:40:03,932 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-21 02:40:33,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=860094.0, ans=0.125 2023-06-21 02:40:57,934 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:41:11,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=860214.0, ans=0.035 2023-06-21 02:41:24,349 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-21 02:41:29,860 INFO [train.py:996] (3/4) Epoch 5, batch 21400, loss[loss=0.2511, simple_loss=0.335, pruned_loss=0.08361, over 21617.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3119, pruned_loss=0.08856, over 4265089.82 frames. ], batch size: 414, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:42:16,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=860394.0, ans=0.0 2023-06-21 02:42:19,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=860394.0, ans=0.1 2023-06-21 02:42:22,702 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.768e+02 3.163e+02 3.686e+02 6.049e+02, threshold=6.326e+02, percent-clipped=0.0 2023-06-21 02:42:37,769 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2023-06-21 02:43:09,463 INFO [train.py:996] (3/4) Epoch 5, batch 21450, loss[loss=0.2517, simple_loss=0.3209, pruned_loss=0.09126, over 21878.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3156, pruned_loss=0.09007, over 4275830.51 frames. ], batch size: 414, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:43:12,101 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.14 vs. limit=15.0 2023-06-21 02:43:26,113 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-21 02:43:44,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=860634.0, ans=0.125 2023-06-21 02:43:54,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=860694.0, ans=0.125 2023-06-21 02:44:00,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=860694.0, ans=0.0 2023-06-21 02:44:01,969 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:44:48,127 INFO [train.py:996] (3/4) Epoch 5, batch 21500, loss[loss=0.2395, simple_loss=0.2962, pruned_loss=0.09137, over 21630.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.313, pruned_loss=0.09044, over 4271896.30 frames. ], batch size: 298, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:45:30,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=860994.0, ans=0.02 2023-06-21 02:45:35,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=22.5 2023-06-21 02:45:39,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=860994.0, ans=0.025 2023-06-21 02:45:40,389 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 3.006e+02 3.483e+02 4.242e+02 6.315e+02, threshold=6.966e+02, percent-clipped=0.0 2023-06-21 02:46:07,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=861054.0, ans=0.2 2023-06-21 02:46:26,521 INFO [train.py:996] (3/4) Epoch 5, batch 21550, loss[loss=0.2541, simple_loss=0.3152, pruned_loss=0.09651, over 21838.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3069, pruned_loss=0.0879, over 4275054.57 frames. ], batch size: 107, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:47:11,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=861294.0, ans=0.2 2023-06-21 02:47:23,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=861354.0, ans=0.0 2023-06-21 02:47:34,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=861354.0, ans=0.125 2023-06-21 02:47:43,562 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-21 02:48:06,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=861474.0, ans=0.0 2023-06-21 02:48:07,445 INFO [train.py:996] (3/4) Epoch 5, batch 21600, loss[loss=0.1862, simple_loss=0.2529, pruned_loss=0.05978, over 20713.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3047, pruned_loss=0.08672, over 4272768.75 frames. ], batch size: 607, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 02:49:04,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=861594.0, ans=0.125 2023-06-21 02:49:05,658 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.906e+02 3.367e+02 4.103e+02 7.141e+02, threshold=6.734e+02, percent-clipped=1.0 2023-06-21 02:49:25,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=861654.0, ans=0.1 2023-06-21 02:49:28,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-21 02:49:45,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=861774.0, ans=0.0 2023-06-21 02:49:46,656 INFO [train.py:996] (3/4) Epoch 5, batch 21650, loss[loss=0.1743, simple_loss=0.2436, pruned_loss=0.05246, over 21466.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3087, pruned_loss=0.08567, over 4275480.72 frames. ], batch size: 212, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 02:50:12,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=861834.0, ans=0.125 2023-06-21 02:50:23,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=861834.0, ans=0.125 2023-06-21 02:50:27,044 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:50:42,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=861894.0, ans=0.035 2023-06-21 02:50:58,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=861954.0, ans=0.0 2023-06-21 02:51:00,980 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=22.5 2023-06-21 02:51:01,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=861954.0, ans=0.125 2023-06-21 02:51:26,175 INFO [train.py:996] (3/4) Epoch 5, batch 21700, loss[loss=0.2346, simple_loss=0.2863, pruned_loss=0.09148, over 20742.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3081, pruned_loss=0.08401, over 4271473.85 frames. ], batch size: 608, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 02:51:52,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=862134.0, ans=0.125 2023-06-21 02:52:02,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=862134.0, ans=0.1 2023-06-21 02:52:18,957 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-06-21 02:52:22,494 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.845e+02 2.607e+02 2.964e+02 3.424e+02 5.516e+02, threshold=5.928e+02, percent-clipped=0.0 2023-06-21 02:53:00,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-21 02:53:10,465 INFO [train.py:996] (3/4) Epoch 5, batch 21750, loss[loss=0.259, simple_loss=0.3137, pruned_loss=0.1022, over 21845.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3043, pruned_loss=0.08415, over 4264459.91 frames. ], batch size: 107, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:53:28,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=862374.0, ans=0.2 2023-06-21 02:53:59,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=862494.0, ans=0.0 2023-06-21 02:54:16,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=862554.0, ans=0.125 2023-06-21 02:54:16,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=862554.0, ans=0.2 2023-06-21 02:54:19,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=862554.0, ans=0.125 2023-06-21 02:54:49,584 INFO [train.py:996] (3/4) Epoch 5, batch 21800, loss[loss=0.2092, simple_loss=0.2646, pruned_loss=0.07694, over 21581.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.303, pruned_loss=0.08539, over 4254881.77 frames. ], batch size: 230, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:55:15,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=862734.0, ans=0.07 2023-06-21 02:55:18,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=862734.0, ans=0.125 2023-06-21 02:55:43,904 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.221e+02 2.757e+02 3.112e+02 3.604e+02 5.308e+02, threshold=6.224e+02, percent-clipped=0.0 2023-06-21 02:55:48,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=862854.0, ans=0.1 2023-06-21 02:56:29,371 INFO [train.py:996] (3/4) Epoch 5, batch 21850, loss[loss=0.2686, simple_loss=0.3579, pruned_loss=0.08963, over 19744.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3075, pruned_loss=0.08559, over 4229575.89 frames. ], batch size: 702, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 02:56:49,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=863034.0, ans=0.125 2023-06-21 02:57:03,363 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-21 02:57:05,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=863034.0, ans=0.1 2023-06-21 02:57:34,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=863154.0, ans=0.0 2023-06-21 02:58:08,164 INFO [train.py:996] (3/4) Epoch 5, batch 21900, loss[loss=0.213, simple_loss=0.2707, pruned_loss=0.07761, over 21227.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3077, pruned_loss=0.08699, over 4242687.82 frames. ], batch size: 176, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 02:58:27,728 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=22.5 2023-06-21 02:58:34,174 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-21 02:58:54,831 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-21 02:59:00,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=863394.0, ans=0.125 2023-06-21 02:59:06,542 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 2.823e+02 3.229e+02 3.710e+02 5.018e+02, threshold=6.457e+02, percent-clipped=0.0 2023-06-21 02:59:10,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=863454.0, ans=0.125 2023-06-21 02:59:15,709 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=15.0 2023-06-21 02:59:21,896 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:59:46,630 INFO [train.py:996] (3/4) Epoch 5, batch 21950, loss[loss=0.2126, simple_loss=0.2909, pruned_loss=0.06712, over 21510.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3029, pruned_loss=0.08572, over 4255287.68 frames. ], batch size: 441, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 02:59:48,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=863574.0, ans=0.0 2023-06-21 03:00:04,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=863574.0, ans=0.125 2023-06-21 03:01:27,626 INFO [train.py:996] (3/4) Epoch 5, batch 22000, loss[loss=0.2275, simple_loss=0.2884, pruned_loss=0.08329, over 21544.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.2982, pruned_loss=0.08317, over 4249889.72 frames. ], batch size: 212, lr: 6.08e-03, grad_scale: 32.0 2023-06-21 03:01:44,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=863874.0, ans=0.0 2023-06-21 03:01:47,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=863934.0, ans=0.125 2023-06-21 03:01:55,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=863934.0, ans=0.0 2023-06-21 03:02:06,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=863934.0, ans=0.2 2023-06-21 03:02:09,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=863994.0, ans=0.125 2023-06-21 03:02:29,958 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.512e+02 2.928e+02 3.420e+02 5.826e+02, threshold=5.856e+02, percent-clipped=0.0 2023-06-21 03:03:08,612 INFO [train.py:996] (3/4) Epoch 5, batch 22050, loss[loss=0.2353, simple_loss=0.306, pruned_loss=0.08227, over 21381.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3029, pruned_loss=0.08398, over 4244264.38 frames. ], batch size: 176, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:03:42,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=864234.0, ans=0.125 2023-06-21 03:03:45,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=864234.0, ans=0.1 2023-06-21 03:04:52,756 INFO [train.py:996] (3/4) Epoch 5, batch 22100, loss[loss=0.3257, simple_loss=0.3973, pruned_loss=0.127, over 21279.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3149, pruned_loss=0.08924, over 4242713.26 frames. ], batch size: 549, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:04:57,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=864474.0, ans=0.0 2023-06-21 03:05:12,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=864534.0, ans=0.2 2023-06-21 03:05:26,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=864534.0, ans=0.1 2023-06-21 03:05:30,297 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-21 03:05:48,621 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.231e+02 3.305e+02 3.693e+02 4.258e+02 6.395e+02, threshold=7.386e+02, percent-clipped=3.0 2023-06-21 03:06:03,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=864654.0, ans=0.125 2023-06-21 03:06:06,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=864714.0, ans=0.125 2023-06-21 03:06:31,556 INFO [train.py:996] (3/4) Epoch 5, batch 22150, loss[loss=0.2749, simple_loss=0.3357, pruned_loss=0.1071, over 21892.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3205, pruned_loss=0.09238, over 4254250.14 frames. ], batch size: 124, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:06:49,656 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=22.5 2023-06-21 03:07:28,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=864954.0, ans=0.2 2023-06-21 03:07:58,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=865014.0, ans=0.125 2023-06-21 03:08:10,572 INFO [train.py:996] (3/4) Epoch 5, batch 22200, loss[loss=0.2355, simple_loss=0.3111, pruned_loss=0.07992, over 21890.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3232, pruned_loss=0.09304, over 4260206.56 frames. ], batch size: 351, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:08:11,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=865074.0, ans=0.1 2023-06-21 03:08:53,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=865194.0, ans=0.0 2023-06-21 03:09:09,738 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 3.019e+02 3.347e+02 3.956e+02 6.093e+02, threshold=6.693e+02, percent-clipped=0.0 2023-06-21 03:09:31,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=865314.0, ans=0.0 2023-06-21 03:09:42,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=865314.0, ans=0.0 2023-06-21 03:09:57,875 INFO [train.py:996] (3/4) Epoch 5, batch 22250, loss[loss=0.2771, simple_loss=0.3526, pruned_loss=0.1008, over 21841.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3281, pruned_loss=0.09445, over 4265657.51 frames. ], batch size: 107, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:10:01,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-21 03:10:22,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=865434.0, ans=0.125 2023-06-21 03:11:26,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=865614.0, ans=0.0 2023-06-21 03:11:32,222 INFO [train.py:996] (3/4) Epoch 5, batch 22300, loss[loss=0.2773, simple_loss=0.3376, pruned_loss=0.1084, over 21762.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3301, pruned_loss=0.09665, over 4277654.37 frames. ], batch size: 441, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:11:46,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=865674.0, ans=0.0 2023-06-21 03:12:19,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=865794.0, ans=0.0 2023-06-21 03:12:20,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=865794.0, ans=0.95 2023-06-21 03:12:22,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=865794.0, ans=0.125 2023-06-21 03:12:27,217 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.377e+02 3.190e+02 3.753e+02 5.122e+02 1.002e+03, threshold=7.506e+02, percent-clipped=11.0 2023-06-21 03:12:38,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=865854.0, ans=0.2 2023-06-21 03:13:14,445 INFO [train.py:996] (3/4) Epoch 5, batch 22350, loss[loss=0.2535, simple_loss=0.3304, pruned_loss=0.08826, over 16859.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3281, pruned_loss=0.09745, over 4280489.14 frames. ], batch size: 60, lr: 6.07e-03, grad_scale: 16.0 2023-06-21 03:13:16,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=865974.0, ans=0.015 2023-06-21 03:13:26,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=865974.0, ans=0.0 2023-06-21 03:13:45,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=866034.0, ans=0.1 2023-06-21 03:13:54,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=866094.0, ans=0.025 2023-06-21 03:14:09,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=866154.0, ans=0.125 2023-06-21 03:14:16,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=866154.0, ans=0.125 2023-06-21 03:14:21,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=866154.0, ans=0.0 2023-06-21 03:14:53,822 INFO [train.py:996] (3/4) Epoch 5, batch 22400, loss[loss=0.1989, simple_loss=0.2814, pruned_loss=0.05818, over 21737.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3241, pruned_loss=0.09405, over 4283189.36 frames. ], batch size: 282, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:15:31,696 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.42 vs. limit=6.0 2023-06-21 03:15:44,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=866394.0, ans=22.5 2023-06-21 03:15:45,052 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.706e+02 3.089e+02 3.768e+02 7.797e+02, threshold=6.178e+02, percent-clipped=1.0 2023-06-21 03:15:46,006 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-21 03:16:22,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=866514.0, ans=0.0 2023-06-21 03:16:32,977 INFO [train.py:996] (3/4) Epoch 5, batch 22450, loss[loss=0.2508, simple_loss=0.3059, pruned_loss=0.09779, over 21794.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3171, pruned_loss=0.09186, over 4279984.73 frames. ], batch size: 112, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:16:46,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=866574.0, ans=0.125 2023-06-21 03:16:49,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=866634.0, ans=0.125 2023-06-21 03:18:14,411 INFO [train.py:996] (3/4) Epoch 5, batch 22500, loss[loss=0.2311, simple_loss=0.3065, pruned_loss=0.0779, over 21380.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3111, pruned_loss=0.09054, over 4278604.09 frames. ], batch size: 194, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:19:05,429 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.870e+02 3.254e+02 4.012e+02 8.224e+02, threshold=6.508e+02, percent-clipped=4.0 2023-06-21 03:19:12,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=867054.0, ans=0.125 2023-06-21 03:19:25,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=867054.0, ans=0.125 2023-06-21 03:19:37,036 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=22.5 2023-06-21 03:19:53,757 INFO [train.py:996] (3/4) Epoch 5, batch 22550, loss[loss=0.2358, simple_loss=0.3027, pruned_loss=0.0844, over 21855.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3167, pruned_loss=0.0912, over 4287906.54 frames. ], batch size: 282, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:20:11,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=867174.0, ans=0.07 2023-06-21 03:20:12,476 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-21 03:20:34,871 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-21 03:20:46,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=867294.0, ans=0.1 2023-06-21 03:21:03,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=867354.0, ans=0.125 2023-06-21 03:21:40,200 INFO [train.py:996] (3/4) Epoch 5, batch 22600, loss[loss=0.2584, simple_loss=0.333, pruned_loss=0.0919, over 21647.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3191, pruned_loss=0.09118, over 4291429.44 frames. ], batch size: 389, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:21:56,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=867534.0, ans=0.125 2023-06-21 03:21:56,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=867534.0, ans=0.2 2023-06-21 03:22:01,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=867534.0, ans=0.0 2023-06-21 03:22:07,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=867534.0, ans=0.125 2023-06-21 03:22:40,285 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.232e+02 3.062e+02 3.535e+02 4.633e+02 8.415e+02, threshold=7.070e+02, percent-clipped=6.0 2023-06-21 03:23:18,518 INFO [train.py:996] (3/4) Epoch 5, batch 22650, loss[loss=0.253, simple_loss=0.2938, pruned_loss=0.1061, over 21324.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3166, pruned_loss=0.09099, over 4288729.86 frames. ], batch size: 507, lr: 6.07e-03, grad_scale: 16.0 2023-06-21 03:23:36,464 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-21 03:23:39,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=867834.0, ans=0.125 2023-06-21 03:23:56,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=867894.0, ans=0.125 2023-06-21 03:24:43,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=868014.0, ans=0.1 2023-06-21 03:24:50,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=868074.0, ans=0.125 2023-06-21 03:24:57,499 INFO [train.py:996] (3/4) Epoch 5, batch 22700, loss[loss=0.2231, simple_loss=0.2846, pruned_loss=0.08083, over 21898.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3108, pruned_loss=0.09016, over 4283945.59 frames. ], batch size: 107, lr: 6.07e-03, grad_scale: 16.0 2023-06-21 03:25:10,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=868074.0, ans=0.1 2023-06-21 03:25:15,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=868134.0, ans=0.125 2023-06-21 03:25:26,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=868134.0, ans=0.2 2023-06-21 03:25:53,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=868194.0, ans=0.125 2023-06-21 03:25:59,500 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.710e+02 3.113e+02 3.866e+02 5.786e+02, threshold=6.226e+02, percent-clipped=0.0 2023-06-21 03:26:23,900 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-06-21 03:26:30,975 INFO [train.py:996] (3/4) Epoch 5, batch 22750, loss[loss=0.2965, simple_loss=0.3494, pruned_loss=0.1218, over 21173.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3118, pruned_loss=0.09231, over 4270367.32 frames. ], batch size: 143, lr: 6.07e-03, grad_scale: 16.0 2023-06-21 03:27:00,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=868434.0, ans=0.125 2023-06-21 03:27:37,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=868554.0, ans=0.0 2023-06-21 03:27:50,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=868554.0, ans=0.125 2023-06-21 03:27:55,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=868614.0, ans=0.0 2023-06-21 03:28:09,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=868674.0, ans=0.0 2023-06-21 03:28:15,790 INFO [train.py:996] (3/4) Epoch 5, batch 22800, loss[loss=0.2489, simple_loss=0.313, pruned_loss=0.09242, over 21361.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3165, pruned_loss=0.0957, over 4272794.84 frames. ], batch size: 143, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:29:08,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=868794.0, ans=0.2 2023-06-21 03:29:17,223 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.397e+02 2.823e+02 3.345e+02 3.974e+02 6.068e+02, threshold=6.691e+02, percent-clipped=0.0 2023-06-21 03:29:24,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=868854.0, ans=0.125 2023-06-21 03:29:49,106 INFO [train.py:996] (3/4) Epoch 5, batch 22850, loss[loss=0.221, simple_loss=0.2788, pruned_loss=0.08157, over 21559.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3127, pruned_loss=0.09416, over 4279619.87 frames. ], batch size: 263, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:30:01,278 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:31:12,151 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.52 vs. limit=10.0 2023-06-21 03:31:35,923 INFO [train.py:996] (3/4) Epoch 5, batch 22900, loss[loss=0.2509, simple_loss=0.3268, pruned_loss=0.08751, over 21747.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3163, pruned_loss=0.09359, over 4279014.37 frames. ], batch size: 351, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:31:59,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=869334.0, ans=0.0 2023-06-21 03:32:37,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=869454.0, ans=0.1 2023-06-21 03:32:39,200 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.436e+02 3.293e+02 3.917e+02 5.124e+02 7.831e+02, threshold=7.834e+02, percent-clipped=10.0 2023-06-21 03:33:06,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=869514.0, ans=0.125 2023-06-21 03:33:15,482 INFO [train.py:996] (3/4) Epoch 5, batch 22950, loss[loss=0.2342, simple_loss=0.2889, pruned_loss=0.08976, over 19910.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3311, pruned_loss=0.09137, over 4278019.15 frames. ], batch size: 702, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:33:28,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=869574.0, ans=0.125 2023-06-21 03:34:43,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=869814.0, ans=0.1 2023-06-21 03:34:52,756 INFO [train.py:996] (3/4) Epoch 5, batch 23000, loss[loss=0.2178, simple_loss=0.284, pruned_loss=0.07584, over 21536.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3283, pruned_loss=0.08887, over 4278645.60 frames. ], batch size: 195, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:35:25,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=869934.0, ans=0.125 2023-06-21 03:35:39,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=869994.0, ans=0.125 2023-06-21 03:35:56,030 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.067e+02 2.793e+02 3.379e+02 3.965e+02 7.564e+02, threshold=6.759e+02, percent-clipped=0.0 2023-06-21 03:35:56,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=870054.0, ans=0.0 2023-06-21 03:36:01,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=870054.0, ans=0.125 2023-06-21 03:36:12,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=870054.0, ans=0.0 2023-06-21 03:36:43,309 INFO [train.py:996] (3/4) Epoch 5, batch 23050, loss[loss=0.2691, simple_loss=0.344, pruned_loss=0.09713, over 21906.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.33, pruned_loss=0.09151, over 4286152.83 frames. ], batch size: 316, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:36:47,362 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-06-21 03:37:02,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.64 vs. limit=10.0 2023-06-21 03:37:07,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-21 03:37:08,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=870234.0, ans=0.125 2023-06-21 03:38:22,814 INFO [train.py:996] (3/4) Epoch 5, batch 23100, loss[loss=0.2084, simple_loss=0.2701, pruned_loss=0.07335, over 21844.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3248, pruned_loss=0.0918, over 4280384.46 frames. ], batch size: 317, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:38:35,951 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2023-06-21 03:38:41,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=870534.0, ans=0.125 2023-06-21 03:38:53,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=870534.0, ans=0.125 2023-06-21 03:38:54,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=870534.0, ans=0.125 2023-06-21 03:39:20,136 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 2.864e+02 3.390e+02 4.261e+02 7.523e+02, threshold=6.780e+02, percent-clipped=3.0 2023-06-21 03:39:40,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=870714.0, ans=0.125 2023-06-21 03:39:50,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=870714.0, ans=0.1 2023-06-21 03:40:00,802 INFO [train.py:996] (3/4) Epoch 5, batch 23150, loss[loss=0.2575, simple_loss=0.3146, pruned_loss=0.1002, over 21932.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3181, pruned_loss=0.09059, over 4280298.41 frames. ], batch size: 316, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:40:33,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=870834.0, ans=0.125 2023-06-21 03:41:10,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=870954.0, ans=0.125 2023-06-21 03:41:28,439 INFO [train.py:996] (3/4) Epoch 5, batch 23200, loss[loss=0.2881, simple_loss=0.3335, pruned_loss=0.1213, over 21803.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3169, pruned_loss=0.09146, over 4290599.16 frames. ], batch size: 508, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:41:58,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=871134.0, ans=0.125 2023-06-21 03:42:29,787 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.864e+02 3.235e+02 3.730e+02 5.431e+02, threshold=6.469e+02, percent-clipped=0.0 2023-06-21 03:42:30,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=871254.0, ans=0.125 2023-06-21 03:43:11,485 INFO [train.py:996] (3/4) Epoch 5, batch 23250, loss[loss=0.2391, simple_loss=0.3027, pruned_loss=0.08773, over 21475.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3178, pruned_loss=0.09335, over 4293028.09 frames. ], batch size: 211, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:44:25,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=15.0 2023-06-21 03:44:56,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=871674.0, ans=0.125 2023-06-21 03:44:57,859 INFO [train.py:996] (3/4) Epoch 5, batch 23300, loss[loss=0.263, simple_loss=0.3543, pruned_loss=0.08589, over 20719.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3271, pruned_loss=0.09606, over 4290558.00 frames. ], batch size: 607, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:45:18,112 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.85 vs. limit=15.0 2023-06-21 03:45:49,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=871854.0, ans=0.0 2023-06-21 03:45:52,628 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 2.979e+02 3.447e+02 3.938e+02 6.103e+02, threshold=6.894e+02, percent-clipped=0.0 2023-06-21 03:46:10,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=871854.0, ans=0.125 2023-06-21 03:46:33,339 INFO [train.py:996] (3/4) Epoch 5, batch 23350, loss[loss=0.1939, simple_loss=0.2636, pruned_loss=0.06215, over 21077.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3299, pruned_loss=0.09438, over 4290720.02 frames. ], batch size: 143, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:46:37,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=871974.0, ans=0.125 2023-06-21 03:47:22,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=872094.0, ans=0.0 2023-06-21 03:47:28,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=872154.0, ans=0.025 2023-06-21 03:48:11,189 INFO [train.py:996] (3/4) Epoch 5, batch 23400, loss[loss=0.2463, simple_loss=0.3026, pruned_loss=0.095, over 21447.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3229, pruned_loss=0.09061, over 4288260.32 frames. ], batch size: 176, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:48:21,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=872274.0, ans=0.0 2023-06-21 03:48:21,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=872274.0, ans=0.125 2023-06-21 03:48:48,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=872394.0, ans=0.125 2023-06-21 03:48:58,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=872394.0, ans=0.2 2023-06-21 03:49:17,346 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.694e+02 2.664e+02 3.181e+02 4.182e+02 6.937e+02, threshold=6.362e+02, percent-clipped=1.0 2023-06-21 03:49:52,793 INFO [train.py:996] (3/4) Epoch 5, batch 23450, loss[loss=0.3222, simple_loss=0.3733, pruned_loss=0.1356, over 21934.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3243, pruned_loss=0.09384, over 4275191.01 frames. ], batch size: 372, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:50:24,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=872634.0, ans=0.1 2023-06-21 03:51:21,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=872814.0, ans=0.0 2023-06-21 03:51:27,321 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.20 vs. limit=15.0 2023-06-21 03:51:30,830 INFO [train.py:996] (3/4) Epoch 5, batch 23500, loss[loss=0.221, simple_loss=0.2796, pruned_loss=0.08121, over 21137.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3252, pruned_loss=0.09617, over 4282157.81 frames. ], batch size: 608, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:52:02,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=872934.0, ans=0.2 2023-06-21 03:52:04,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=872934.0, ans=0.125 2023-06-21 03:52:38,543 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.401e+02 3.046e+02 3.693e+02 4.776e+02 9.117e+02, threshold=7.385e+02, percent-clipped=5.0 2023-06-21 03:52:40,896 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=12.0 2023-06-21 03:53:08,201 INFO [train.py:996] (3/4) Epoch 5, batch 23550, loss[loss=0.2253, simple_loss=0.2939, pruned_loss=0.07833, over 21429.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3215, pruned_loss=0.0954, over 4271387.45 frames. ], batch size: 389, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:53:27,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=873174.0, ans=6.0 2023-06-21 03:54:46,867 INFO [train.py:996] (3/4) Epoch 5, batch 23600, loss[loss=0.243, simple_loss=0.2809, pruned_loss=0.1026, over 20024.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3198, pruned_loss=0.095, over 4263624.16 frames. ], batch size: 702, lr: 6.05e-03, grad_scale: 32.0 2023-06-21 03:55:10,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=873534.0, ans=0.125 2023-06-21 03:55:31,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=873594.0, ans=0.125 2023-06-21 03:55:52,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=873654.0, ans=0.0 2023-06-21 03:55:55,533 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.629e+02 3.088e+02 3.713e+02 7.100e+02, threshold=6.175e+02, percent-clipped=0.0 2023-06-21 03:56:00,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=873654.0, ans=0.2 2023-06-21 03:56:16,076 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.56 vs. limit=10.0 2023-06-21 03:56:32,162 INFO [train.py:996] (3/4) Epoch 5, batch 23650, loss[loss=0.3081, simple_loss=0.3737, pruned_loss=0.1212, over 21407.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3213, pruned_loss=0.09421, over 4270226.02 frames. ], batch size: 507, lr: 6.05e-03, grad_scale: 32.0 2023-06-21 03:56:58,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=873834.0, ans=0.0 2023-06-21 03:58:07,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=874014.0, ans=0.125 2023-06-21 03:58:13,463 INFO [train.py:996] (3/4) Epoch 5, batch 23700, loss[loss=0.2647, simple_loss=0.3512, pruned_loss=0.08912, over 21284.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3245, pruned_loss=0.09325, over 4271689.74 frames. ], batch size: 549, lr: 6.05e-03, grad_scale: 32.0 2023-06-21 03:58:27,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=874074.0, ans=0.125 2023-06-21 03:58:34,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=874134.0, ans=0.1 2023-06-21 03:58:48,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=874134.0, ans=0.2 2023-06-21 03:58:57,432 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.66 vs. limit=22.5 2023-06-21 03:59:14,046 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-06-21 03:59:17,505 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.032e+02 3.536e+02 4.190e+02 7.050e+02, threshold=7.071e+02, percent-clipped=3.0 2023-06-21 03:59:34,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=874254.0, ans=0.0 2023-06-21 03:59:35,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=874314.0, ans=0.125 2023-06-21 03:59:53,568 INFO [train.py:996] (3/4) Epoch 5, batch 23750, loss[loss=0.2314, simple_loss=0.3168, pruned_loss=0.07298, over 21444.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3272, pruned_loss=0.09396, over 4269199.37 frames. ], batch size: 194, lr: 6.05e-03, grad_scale: 32.0 2023-06-21 03:59:55,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=874374.0, ans=0.125 2023-06-21 04:00:15,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=874374.0, ans=0.125 2023-06-21 04:00:26,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=874434.0, ans=0.0 2023-06-21 04:00:50,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=874494.0, ans=0.025 2023-06-21 04:00:51,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=874494.0, ans=0.0 2023-06-21 04:01:37,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=874674.0, ans=0.2 2023-06-21 04:01:38,903 INFO [train.py:996] (3/4) Epoch 5, batch 23800, loss[loss=0.2714, simple_loss=0.3626, pruned_loss=0.09009, over 21654.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3229, pruned_loss=0.09003, over 4270199.56 frames. ], batch size: 389, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:02:18,290 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:02:25,409 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-21 04:02:43,702 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.704e+02 3.208e+02 4.045e+02 9.409e+02, threshold=6.416e+02, percent-clipped=3.0 2023-06-21 04:03:29,785 INFO [train.py:996] (3/4) Epoch 5, batch 23850, loss[loss=0.181, simple_loss=0.2252, pruned_loss=0.0684, over 16192.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3321, pruned_loss=0.09285, over 4267786.75 frames. ], batch size: 61, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:03:52,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=875034.0, ans=0.2 2023-06-21 04:03:58,275 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2023-06-21 04:04:46,258 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-21 04:05:04,015 INFO [train.py:996] (3/4) Epoch 5, batch 23900, loss[loss=0.2191, simple_loss=0.2817, pruned_loss=0.07824, over 21088.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3404, pruned_loss=0.09559, over 4258642.51 frames. ], batch size: 143, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:05:16,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=875274.0, ans=0.0 2023-06-21 04:06:02,864 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.265e+02 3.113e+02 3.559e+02 4.462e+02 8.067e+02, threshold=7.118e+02, percent-clipped=8.0 2023-06-21 04:06:38,682 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.28 vs. limit=15.0 2023-06-21 04:06:41,970 INFO [train.py:996] (3/4) Epoch 5, batch 23950, loss[loss=0.2847, simple_loss=0.3401, pruned_loss=0.1147, over 21623.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3327, pruned_loss=0.09485, over 4262470.56 frames. ], batch size: 441, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:08:21,270 INFO [train.py:996] (3/4) Epoch 5, batch 24000, loss[loss=0.2804, simple_loss=0.3471, pruned_loss=0.1068, over 21473.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3337, pruned_loss=0.09747, over 4270378.57 frames. ], batch size: 211, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:08:21,270 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 04:08:38,093 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2683, simple_loss=0.3693, pruned_loss=0.08367, over 1796401.00 frames. 2023-06-21 04:08:38,094 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-21 04:08:51,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=875874.0, ans=0.125 2023-06-21 04:08:58,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=875934.0, ans=0.125 2023-06-21 04:09:19,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=875994.0, ans=0.125 2023-06-21 04:09:35,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=875994.0, ans=0.0 2023-06-21 04:09:37,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=875994.0, ans=0.125 2023-06-21 04:09:43,017 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.369e+02 3.178e+02 3.721e+02 4.593e+02 6.442e+02, threshold=7.441e+02, percent-clipped=0.0 2023-06-21 04:10:18,505 INFO [train.py:996] (3/4) Epoch 5, batch 24050, loss[loss=0.2077, simple_loss=0.3158, pruned_loss=0.04985, over 20868.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.336, pruned_loss=0.09766, over 4273786.21 frames. ], batch size: 608, lr: 6.04e-03, grad_scale: 16.0 2023-06-21 04:10:28,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=876174.0, ans=0.125 2023-06-21 04:10:32,239 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-21 04:10:34,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=876234.0, ans=0.035 2023-06-21 04:10:36,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=876234.0, ans=0.125 2023-06-21 04:11:43,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=876414.0, ans=0.1 2023-06-21 04:11:56,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=876474.0, ans=0.0 2023-06-21 04:11:57,376 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.73 vs. limit=15.0 2023-06-21 04:11:57,806 INFO [train.py:996] (3/4) Epoch 5, batch 24100, loss[loss=0.2564, simple_loss=0.3278, pruned_loss=0.09246, over 21168.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3358, pruned_loss=0.09568, over 4268107.63 frames. ], batch size: 143, lr: 6.04e-03, grad_scale: 16.0 2023-06-21 04:12:46,707 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=12.0 2023-06-21 04:12:57,145 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.320e+02 2.915e+02 3.291e+02 4.001e+02 6.877e+02, threshold=6.582e+02, percent-clipped=0.0 2023-06-21 04:13:07,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=876654.0, ans=0.1 2023-06-21 04:13:18,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=876714.0, ans=0.125 2023-06-21 04:13:28,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=876714.0, ans=0.0 2023-06-21 04:13:31,043 INFO [train.py:996] (3/4) Epoch 5, batch 24150, loss[loss=0.2811, simple_loss=0.3464, pruned_loss=0.108, over 21815.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3346, pruned_loss=0.0972, over 4278927.44 frames. ], batch size: 124, lr: 6.04e-03, grad_scale: 16.0 2023-06-21 04:13:33,528 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-21 04:13:40,215 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-21 04:14:35,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=876954.0, ans=0.125 2023-06-21 04:15:11,005 INFO [train.py:996] (3/4) Epoch 5, batch 24200, loss[loss=0.2798, simple_loss=0.35, pruned_loss=0.1048, over 21651.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3372, pruned_loss=0.09921, over 4285457.88 frames. ], batch size: 263, lr: 6.04e-03, grad_scale: 16.0 2023-06-21 04:15:24,947 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.70 vs. limit=15.0 2023-06-21 04:15:41,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=877134.0, ans=0.125 2023-06-21 04:15:59,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=877194.0, ans=0.0 2023-06-21 04:16:17,765 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.992e+02 3.434e+02 4.148e+02 5.774e+02, threshold=6.868e+02, percent-clipped=0.0 2023-06-21 04:16:42,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=877314.0, ans=0.125 2023-06-21 04:16:58,520 INFO [train.py:996] (3/4) Epoch 5, batch 24250, loss[loss=0.2039, simple_loss=0.3077, pruned_loss=0.05005, over 21656.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3343, pruned_loss=0.09241, over 4288031.45 frames. ], batch size: 389, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:17:48,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=877494.0, ans=0.2 2023-06-21 04:17:50,310 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=12.0 2023-06-21 04:18:16,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=877614.0, ans=0.125 2023-06-21 04:18:33,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=877614.0, ans=0.1 2023-06-21 04:18:33,947 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-21 04:18:37,941 INFO [train.py:996] (3/4) Epoch 5, batch 24300, loss[loss=0.2014, simple_loss=0.2805, pruned_loss=0.06111, over 21822.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3256, pruned_loss=0.0864, over 4287976.67 frames. ], batch size: 316, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:18:55,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=877674.0, ans=0.125 2023-06-21 04:19:00,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=877674.0, ans=0.125 2023-06-21 04:19:42,944 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.429e+02 3.041e+02 4.140e+02 6.830e+02, threshold=6.081e+02, percent-clipped=0.0 2023-06-21 04:20:10,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=877914.0, ans=0.1 2023-06-21 04:20:20,904 INFO [train.py:996] (3/4) Epoch 5, batch 24350, loss[loss=0.2252, simple_loss=0.2924, pruned_loss=0.07904, over 21673.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3213, pruned_loss=0.08587, over 4292589.17 frames. ], batch size: 263, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:20:45,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=878034.0, ans=0.2 2023-06-21 04:21:17,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=878154.0, ans=0.125 2023-06-21 04:21:35,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=878154.0, ans=0.125 2023-06-21 04:21:55,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=878214.0, ans=0.125 2023-06-21 04:22:02,665 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-06-21 04:22:04,665 INFO [train.py:996] (3/4) Epoch 5, batch 24400, loss[loss=0.2195, simple_loss=0.301, pruned_loss=0.06899, over 20736.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3269, pruned_loss=0.09, over 4288603.86 frames. ], batch size: 608, lr: 6.03e-03, grad_scale: 32.0 2023-06-21 04:22:11,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=878274.0, ans=0.07 2023-06-21 04:22:16,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=878274.0, ans=0.0 2023-06-21 04:22:43,582 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2023-06-21 04:23:06,235 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.330e+02 3.732e+02 4.584e+02 7.697e+02, threshold=7.464e+02, percent-clipped=2.0 2023-06-21 04:23:36,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=878514.0, ans=0.1 2023-06-21 04:23:44,675 INFO [train.py:996] (3/4) Epoch 5, batch 24450, loss[loss=0.2884, simple_loss=0.3731, pruned_loss=0.1018, over 21676.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3303, pruned_loss=0.09252, over 4286013.95 frames. ], batch size: 389, lr: 6.03e-03, grad_scale: 32.0 2023-06-21 04:24:24,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=878694.0, ans=0.09899494936611666 2023-06-21 04:25:06,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=878814.0, ans=0.0 2023-06-21 04:25:07,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=878814.0, ans=0.2 2023-06-21 04:25:23,337 INFO [train.py:996] (3/4) Epoch 5, batch 24500, loss[loss=0.2187, simple_loss=0.2829, pruned_loss=0.07718, over 17232.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3312, pruned_loss=0.09253, over 4288335.60 frames. ], batch size: 65, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:25:36,697 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:25:38,720 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=15.0 2023-06-21 04:26:31,225 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.806e+02 3.370e+02 4.048e+02 6.223e+02, threshold=6.740e+02, percent-clipped=0.0 2023-06-21 04:27:02,339 INFO [train.py:996] (3/4) Epoch 5, batch 24550, loss[loss=0.3172, simple_loss=0.3875, pruned_loss=0.1235, over 21558.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3337, pruned_loss=0.09519, over 4288694.48 frames. ], batch size: 414, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:27:06,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=879174.0, ans=0.125 2023-06-21 04:28:42,255 INFO [train.py:996] (3/4) Epoch 5, batch 24600, loss[loss=0.2477, simple_loss=0.3106, pruned_loss=0.0924, over 21737.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3285, pruned_loss=0.09541, over 4279903.75 frames. ], batch size: 316, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:29:03,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=879534.0, ans=0.0 2023-06-21 04:29:04,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=879534.0, ans=0.015 2023-06-21 04:29:14,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=879534.0, ans=0.0 2023-06-21 04:29:53,771 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.443e+02 3.125e+02 3.627e+02 4.480e+02 7.581e+02, threshold=7.254e+02, percent-clipped=2.0 2023-06-21 04:30:13,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=879714.0, ans=0.125 2023-06-21 04:30:21,426 INFO [train.py:996] (3/4) Epoch 5, batch 24650, loss[loss=0.2126, simple_loss=0.2675, pruned_loss=0.07881, over 21452.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3214, pruned_loss=0.09375, over 4267841.96 frames. ], batch size: 212, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:30:32,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=879774.0, ans=0.125 2023-06-21 04:30:51,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=879834.0, ans=0.125 2023-06-21 04:31:34,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=879954.0, ans=0.05 2023-06-21 04:31:47,616 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-21 04:31:52,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=880014.0, ans=0.1 2023-06-21 04:32:07,113 INFO [train.py:996] (3/4) Epoch 5, batch 24700, loss[loss=0.2305, simple_loss=0.2942, pruned_loss=0.08339, over 15325.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3214, pruned_loss=0.09177, over 4258123.32 frames. ], batch size: 61, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:32:23,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=880134.0, ans=0.125 2023-06-21 04:32:26,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=880134.0, ans=0.125 2023-06-21 04:32:53,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=880194.0, ans=0.125 2023-06-21 04:33:13,140 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.127e+02 2.830e+02 3.084e+02 3.762e+02 5.962e+02, threshold=6.167e+02, percent-clipped=0.0 2023-06-21 04:33:13,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=880254.0, ans=0.0 2023-06-21 04:33:39,757 INFO [train.py:996] (3/4) Epoch 5, batch 24750, loss[loss=0.2018, simple_loss=0.2778, pruned_loss=0.06295, over 21496.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3138, pruned_loss=0.08892, over 4259970.66 frames. ], batch size: 389, lr: 6.02e-03, grad_scale: 8.0 2023-06-21 04:33:40,731 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.35 vs. limit=22.5 2023-06-21 04:34:01,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=880434.0, ans=0.125 2023-06-21 04:34:37,043 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-21 04:34:50,615 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=22.5 2023-06-21 04:35:00,200 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-21 04:35:18,240 INFO [train.py:996] (3/4) Epoch 5, batch 24800, loss[loss=0.2279, simple_loss=0.2884, pruned_loss=0.08366, over 21549.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3086, pruned_loss=0.08918, over 4263778.87 frames. ], batch size: 391, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:35:49,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=880734.0, ans=0.1 2023-06-21 04:36:03,732 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-21 04:36:31,834 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.672e+02 2.950e+02 3.460e+02 6.225e+02, threshold=5.900e+02, percent-clipped=1.0 2023-06-21 04:36:57,239 INFO [train.py:996] (3/4) Epoch 5, batch 24850, loss[loss=0.245, simple_loss=0.2994, pruned_loss=0.09527, over 21192.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.309, pruned_loss=0.09076, over 4275370.05 frames. ], batch size: 608, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:37:13,198 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=15.0 2023-06-21 04:37:17,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=881034.0, ans=0.125 2023-06-21 04:37:41,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-21 04:38:36,637 INFO [train.py:996] (3/4) Epoch 5, batch 24900, loss[loss=0.277, simple_loss=0.3679, pruned_loss=0.09304, over 21286.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3133, pruned_loss=0.09197, over 4276966.73 frames. ], batch size: 548, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:39:51,276 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 3.066e+02 3.454e+02 4.012e+02 6.143e+02, threshold=6.909e+02, percent-clipped=1.0 2023-06-21 04:40:22,241 INFO [train.py:996] (3/4) Epoch 5, batch 24950, loss[loss=0.2969, simple_loss=0.3708, pruned_loss=0.1115, over 21760.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3211, pruned_loss=0.0957, over 4276100.62 frames. ], batch size: 124, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:40:40,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=881574.0, ans=0.125 2023-06-21 04:41:36,104 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.19 vs. limit=15.0 2023-06-21 04:42:02,091 INFO [train.py:996] (3/4) Epoch 5, batch 25000, loss[loss=0.2513, simple_loss=0.3187, pruned_loss=0.09201, over 21797.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.328, pruned_loss=0.09747, over 4283403.85 frames. ], batch size: 107, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:42:28,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=881934.0, ans=0.2 2023-06-21 04:43:10,523 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.862e+02 3.406e+02 4.060e+02 6.504e+02, threshold=6.812e+02, percent-clipped=0.0 2023-06-21 04:43:20,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=882054.0, ans=0.125 2023-06-21 04:43:46,031 INFO [train.py:996] (3/4) Epoch 5, batch 25050, loss[loss=0.2461, simple_loss=0.2946, pruned_loss=0.09881, over 21488.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3199, pruned_loss=0.09541, over 4277606.48 frames. ], batch size: 212, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:43:46,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=882174.0, ans=0.1 2023-06-21 04:44:02,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=882174.0, ans=6.0 2023-06-21 04:44:10,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=882234.0, ans=0.125 2023-06-21 04:45:00,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=882354.0, ans=0.95 2023-06-21 04:45:05,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=882414.0, ans=0.2 2023-06-21 04:45:06,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=882414.0, ans=0.125 2023-06-21 04:45:13,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=882414.0, ans=0.1 2023-06-21 04:45:20,793 INFO [train.py:996] (3/4) Epoch 5, batch 25100, loss[loss=0.2531, simple_loss=0.3087, pruned_loss=0.09873, over 21328.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3131, pruned_loss=0.09335, over 4283633.03 frames. ], batch size: 144, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:45:55,293 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:45:56,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=882534.0, ans=0.1 2023-06-21 04:46:20,745 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.21 vs. limit=15.0 2023-06-21 04:46:21,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=882654.0, ans=0.1 2023-06-21 04:46:29,028 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.734e+02 3.137e+02 3.918e+02 6.199e+02, threshold=6.274e+02, percent-clipped=0.0 2023-06-21 04:46:40,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=882654.0, ans=0.1 2023-06-21 04:46:44,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=882714.0, ans=0.035 2023-06-21 04:46:55,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=882714.0, ans=0.125 2023-06-21 04:46:59,074 INFO [train.py:996] (3/4) Epoch 5, batch 25150, loss[loss=0.2288, simple_loss=0.3153, pruned_loss=0.07118, over 21829.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3172, pruned_loss=0.09109, over 4265650.60 frames. ], batch size: 351, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:47:01,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=882774.0, ans=0.125 2023-06-21 04:47:07,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=882774.0, ans=0.0 2023-06-21 04:47:33,352 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.10 vs. limit=10.0 2023-06-21 04:48:17,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=883014.0, ans=0.125 2023-06-21 04:48:28,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=883014.0, ans=0.1 2023-06-21 04:48:37,286 INFO [train.py:996] (3/4) Epoch 5, batch 25200, loss[loss=0.2191, simple_loss=0.3009, pruned_loss=0.06864, over 21702.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3167, pruned_loss=0.0891, over 4263811.49 frames. ], batch size: 247, lr: 6.02e-03, grad_scale: 32.0 2023-06-21 04:49:31,967 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.83 vs. limit=22.5 2023-06-21 04:49:34,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=883194.0, ans=0.0 2023-06-21 04:49:42,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=883254.0, ans=0.0 2023-06-21 04:49:46,903 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.669e+02 3.257e+02 4.012e+02 7.318e+02, threshold=6.513e+02, percent-clipped=2.0 2023-06-21 04:50:17,407 INFO [train.py:996] (3/4) Epoch 5, batch 25250, loss[loss=0.2801, simple_loss=0.3184, pruned_loss=0.1209, over 21363.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3136, pruned_loss=0.0874, over 4267607.33 frames. ], batch size: 508, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:50:48,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=883434.0, ans=0.125 2023-06-21 04:50:57,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=883494.0, ans=0.125 2023-06-21 04:51:57,236 INFO [train.py:996] (3/4) Epoch 5, batch 25300, loss[loss=0.2315, simple_loss=0.3067, pruned_loss=0.07814, over 21707.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3114, pruned_loss=0.08729, over 4271767.52 frames. ], batch size: 298, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:52:06,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=883674.0, ans=0.125 2023-06-21 04:52:27,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=883734.0, ans=0.2 2023-06-21 04:53:02,442 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.159e+02 2.770e+02 3.140e+02 3.813e+02 4.907e+02, threshold=6.281e+02, percent-clipped=0.0 2023-06-21 04:53:18,756 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=15.0 2023-06-21 04:53:32,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=883974.0, ans=0.2 2023-06-21 04:53:33,694 INFO [train.py:996] (3/4) Epoch 5, batch 25350, loss[loss=0.1943, simple_loss=0.2802, pruned_loss=0.05418, over 21612.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3124, pruned_loss=0.0863, over 4264441.90 frames. ], batch size: 263, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:53:48,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=883974.0, ans=0.0 2023-06-21 04:54:00,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=884034.0, ans=0.125 2023-06-21 04:54:24,823 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.18 vs. limit=10.0 2023-06-21 04:55:03,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=884214.0, ans=0.1 2023-06-21 04:55:07,843 INFO [train.py:996] (3/4) Epoch 5, batch 25400, loss[loss=0.2202, simple_loss=0.2795, pruned_loss=0.0804, over 21377.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3076, pruned_loss=0.08559, over 4259571.78 frames. ], batch size: 194, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:55:56,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=884394.0, ans=0.125 2023-06-21 04:56:15,909 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.763e+02 3.058e+02 3.669e+02 6.374e+02, threshold=6.116e+02, percent-clipped=1.0 2023-06-21 04:56:46,796 INFO [train.py:996] (3/4) Epoch 5, batch 25450, loss[loss=0.2415, simple_loss=0.3376, pruned_loss=0.07272, over 21631.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3082, pruned_loss=0.08646, over 4260872.72 frames. ], batch size: 263, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:56:58,644 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.52 vs. limit=15.0 2023-06-21 04:58:15,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=884814.0, ans=0.1 2023-06-21 04:58:31,838 INFO [train.py:996] (3/4) Epoch 5, batch 25500, loss[loss=0.2739, simple_loss=0.3542, pruned_loss=0.09677, over 21576.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3078, pruned_loss=0.08248, over 4246085.92 frames. ], batch size: 414, lr: 6.01e-03, grad_scale: 16.0 2023-06-21 04:59:20,497 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-06-21 04:59:44,011 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.815e+02 3.207e+02 3.771e+02 6.756e+02, threshold=6.413e+02, percent-clipped=1.0 2023-06-21 04:59:53,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=885114.0, ans=0.125 2023-06-21 04:59:59,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=885114.0, ans=0.125 2023-06-21 05:00:13,093 INFO [train.py:996] (3/4) Epoch 5, batch 25550, loss[loss=0.3243, simple_loss=0.4017, pruned_loss=0.1234, over 21426.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3157, pruned_loss=0.08333, over 4250375.23 frames. ], batch size: 507, lr: 6.01e-03, grad_scale: 16.0 2023-06-21 05:00:22,593 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-21 05:01:21,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=885354.0, ans=0.1 2023-06-21 05:01:31,812 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.53 vs. limit=15.0 2023-06-21 05:01:32,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=885414.0, ans=0.0 2023-06-21 05:01:54,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=885414.0, ans=0.02 2023-06-21 05:02:02,399 INFO [train.py:996] (3/4) Epoch 5, batch 25600, loss[loss=0.2935, simple_loss=0.3523, pruned_loss=0.1174, over 21859.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3201, pruned_loss=0.08482, over 4254917.74 frames. ], batch size: 371, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 05:02:59,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=885654.0, ans=0.125 2023-06-21 05:03:03,964 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.775e+02 3.286e+02 3.783e+02 5.833e+02, threshold=6.573e+02, percent-clipped=0.0 2023-06-21 05:03:22,462 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.13 vs. limit=15.0 2023-06-21 05:03:41,862 INFO [train.py:996] (3/4) Epoch 5, batch 25650, loss[loss=0.2369, simple_loss=0.2942, pruned_loss=0.08984, over 21637.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3215, pruned_loss=0.08869, over 4252315.62 frames. ], batch size: 298, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 05:03:48,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=885774.0, ans=0.125 2023-06-21 05:04:46,742 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:04:47,355 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-21 05:04:57,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=886014.0, ans=0.1 2023-06-21 05:05:21,298 INFO [train.py:996] (3/4) Epoch 5, batch 25700, loss[loss=0.2454, simple_loss=0.3226, pruned_loss=0.08411, over 21634.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3178, pruned_loss=0.0899, over 4250595.69 frames. ], batch size: 263, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 05:05:34,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=886074.0, ans=0.125 2023-06-21 05:05:54,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=886134.0, ans=0.0 2023-06-21 05:06:15,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=886254.0, ans=0.125 2023-06-21 05:06:23,791 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.865e+02 3.376e+02 4.055e+02 7.604e+02, threshold=6.752e+02, percent-clipped=2.0 2023-06-21 05:06:46,232 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-21 05:06:58,882 INFO [train.py:996] (3/4) Epoch 5, batch 25750, loss[loss=0.2762, simple_loss=0.3395, pruned_loss=0.1065, over 21198.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3238, pruned_loss=0.09234, over 4249276.90 frames. ], batch size: 143, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:07:06,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=886374.0, ans=0.0 2023-06-21 05:07:50,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=886494.0, ans=0.125 2023-06-21 05:08:14,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=886554.0, ans=0.2 2023-06-21 05:08:14,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=886554.0, ans=0.2 2023-06-21 05:08:17,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=886554.0, ans=0.2 2023-06-21 05:08:22,987 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-21 05:08:46,846 INFO [train.py:996] (3/4) Epoch 5, batch 25800, loss[loss=0.2683, simple_loss=0.3454, pruned_loss=0.09561, over 21451.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3384, pruned_loss=0.09802, over 4252866.95 frames. ], batch size: 194, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:09:22,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=886734.0, ans=0.125 2023-06-21 05:09:25,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=886734.0, ans=0.0 2023-06-21 05:09:34,344 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-21 05:09:59,471 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.944e+02 3.581e+02 4.306e+02 8.254e+02, threshold=7.162e+02, percent-clipped=3.0 2023-06-21 05:10:17,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=886914.0, ans=0.025 2023-06-21 05:10:26,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=22.5 2023-06-21 05:10:26,641 INFO [train.py:996] (3/4) Epoch 5, batch 25850, loss[loss=0.2717, simple_loss=0.3325, pruned_loss=0.1055, over 21838.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3382, pruned_loss=0.09612, over 4250347.15 frames. ], batch size: 124, lr: 6.00e-03, grad_scale: 16.0 2023-06-21 05:10:58,821 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.97 vs. limit=22.5 2023-06-21 05:11:25,982 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.74 vs. limit=15.0 2023-06-21 05:11:56,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=887214.0, ans=0.125 2023-06-21 05:12:01,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=887214.0, ans=0.125 2023-06-21 05:12:07,862 INFO [train.py:996] (3/4) Epoch 5, batch 25900, loss[loss=0.3444, simple_loss=0.4421, pruned_loss=0.1234, over 20894.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.34, pruned_loss=0.0972, over 4259938.29 frames. ], batch size: 607, lr: 6.00e-03, grad_scale: 16.0 2023-06-21 05:12:14,041 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-21 05:12:35,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=887334.0, ans=0.125 2023-06-21 05:12:54,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=887394.0, ans=0.0 2023-06-21 05:13:26,829 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 3.088e+02 3.549e+02 4.240e+02 5.933e+02, threshold=7.098e+02, percent-clipped=0.0 2023-06-21 05:13:58,657 INFO [train.py:996] (3/4) Epoch 5, batch 25950, loss[loss=0.2858, simple_loss=0.3548, pruned_loss=0.1084, over 21394.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3467, pruned_loss=0.1004, over 4260959.02 frames. ], batch size: 194, lr: 6.00e-03, grad_scale: 16.0 2023-06-21 05:14:40,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=887694.0, ans=0.125 2023-06-21 05:15:40,348 INFO [train.py:996] (3/4) Epoch 5, batch 26000, loss[loss=0.2267, simple_loss=0.3146, pruned_loss=0.06944, over 21701.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3464, pruned_loss=0.09998, over 4266964.98 frames. ], batch size: 298, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:15:51,041 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:16:08,079 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2023-06-21 05:16:19,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=887934.0, ans=0.125 2023-06-21 05:16:52,269 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.179e+02 2.994e+02 3.502e+02 4.127e+02 6.076e+02, threshold=7.004e+02, percent-clipped=0.0 2023-06-21 05:16:53,181 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-06-21 05:17:07,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-21 05:17:19,588 INFO [train.py:996] (3/4) Epoch 5, batch 26050, loss[loss=0.2468, simple_loss=0.3085, pruned_loss=0.09255, over 21600.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.3453, pruned_loss=0.1007, over 4277795.60 frames. ], batch size: 548, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:17:21,827 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-21 05:18:09,356 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.88 vs. limit=22.5 2023-06-21 05:18:58,126 INFO [train.py:996] (3/4) Epoch 5, batch 26100, loss[loss=0.2111, simple_loss=0.2787, pruned_loss=0.07178, over 21466.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3386, pruned_loss=0.09989, over 4278018.62 frames. ], batch size: 194, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:19:47,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=888594.0, ans=0.125 2023-06-21 05:19:55,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=888654.0, ans=0.125 2023-06-21 05:19:57,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=15.0 2023-06-21 05:20:05,883 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.461e+02 2.983e+02 3.615e+02 4.836e+02 1.225e+03, threshold=7.230e+02, percent-clipped=7.0 2023-06-21 05:20:39,119 INFO [train.py:996] (3/4) Epoch 5, batch 26150, loss[loss=0.2529, simple_loss=0.3223, pruned_loss=0.09176, over 21625.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3377, pruned_loss=0.1003, over 4282488.43 frames. ], batch size: 230, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:21:37,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=888954.0, ans=0.05 2023-06-21 05:21:59,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=889014.0, ans=0.1 2023-06-21 05:22:00,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=889014.0, ans=0.2 2023-06-21 05:22:20,380 INFO [train.py:996] (3/4) Epoch 5, batch 26200, loss[loss=0.288, simple_loss=0.3812, pruned_loss=0.09742, over 21719.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3383, pruned_loss=0.09804, over 4283437.45 frames. ], batch size: 351, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:22:25,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=889074.0, ans=0.5 2023-06-21 05:22:27,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=889074.0, ans=10.0 2023-06-21 05:22:29,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=889074.0, ans=0.0 2023-06-21 05:22:49,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=889134.0, ans=0.0 2023-06-21 05:23:02,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=889194.0, ans=0.125 2023-06-21 05:23:07,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=889194.0, ans=0.1 2023-06-21 05:23:33,371 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.909e+02 3.359e+02 4.257e+02 6.778e+02, threshold=6.718e+02, percent-clipped=0.0 2023-06-21 05:23:47,040 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-06-21 05:24:01,201 INFO [train.py:996] (3/4) Epoch 5, batch 26250, loss[loss=0.2615, simple_loss=0.3303, pruned_loss=0.09633, over 21179.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3422, pruned_loss=0.09719, over 4278670.53 frames. ], batch size: 608, lr: 5.99e-03, grad_scale: 32.0 2023-06-21 05:24:02,152 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=22.5 2023-06-21 05:24:17,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=889374.0, ans=0.0 2023-06-21 05:24:25,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=889434.0, ans=0.125 2023-06-21 05:24:30,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=889434.0, ans=0.125 2023-06-21 05:24:38,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=889494.0, ans=0.0 2023-06-21 05:25:06,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=889554.0, ans=0.2 2023-06-21 05:25:14,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=889554.0, ans=0.125 2023-06-21 05:25:16,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=889554.0, ans=0.0 2023-06-21 05:25:34,194 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=22.5 2023-06-21 05:25:39,634 INFO [train.py:996] (3/4) Epoch 5, batch 26300, loss[loss=0.2662, simple_loss=0.3266, pruned_loss=0.103, over 21290.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3381, pruned_loss=0.09728, over 4282401.11 frames. ], batch size: 176, lr: 5.99e-03, grad_scale: 32.0 2023-06-21 05:25:58,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=889734.0, ans=0.2 2023-06-21 05:26:00,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=889734.0, ans=0.2 2023-06-21 05:26:08,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=889734.0, ans=0.125 2023-06-21 05:26:08,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=889734.0, ans=0.2 2023-06-21 05:26:15,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=889794.0, ans=0.1 2023-06-21 05:26:58,701 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 2.888e+02 3.226e+02 3.870e+02 6.035e+02, threshold=6.451e+02, percent-clipped=0.0 2023-06-21 05:27:08,106 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.64 vs. limit=22.5 2023-06-21 05:27:25,028 INFO [train.py:996] (3/4) Epoch 5, batch 26350, loss[loss=0.2757, simple_loss=0.347, pruned_loss=0.1023, over 21869.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3365, pruned_loss=0.09779, over 4286337.42 frames. ], batch size: 118, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:27:49,474 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-21 05:28:09,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=890094.0, ans=0.2 2023-06-21 05:28:29,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-21 05:28:30,236 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:28:46,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=890214.0, ans=0.125 2023-06-21 05:28:59,009 INFO [train.py:996] (3/4) Epoch 5, batch 26400, loss[loss=0.2467, simple_loss=0.2948, pruned_loss=0.0993, over 21538.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3303, pruned_loss=0.09764, over 4283479.73 frames. ], batch size: 441, lr: 5.99e-03, grad_scale: 32.0 2023-06-21 05:29:04,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=890274.0, ans=0.125 2023-06-21 05:30:10,877 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.298e+02 2.951e+02 3.748e+02 4.421e+02 1.228e+03, threshold=7.496e+02, percent-clipped=6.0 2023-06-21 05:30:33,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=890514.0, ans=0.2 2023-06-21 05:30:35,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=890514.0, ans=0.1 2023-06-21 05:30:37,980 INFO [train.py:996] (3/4) Epoch 5, batch 26450, loss[loss=0.2859, simple_loss=0.3874, pruned_loss=0.09222, over 21651.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3304, pruned_loss=0.09745, over 4257400.81 frames. ], batch size: 389, lr: 5.99e-03, grad_scale: 32.0 2023-06-21 05:31:12,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=890634.0, ans=0.1 2023-06-21 05:31:25,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=890694.0, ans=0.125 2023-06-21 05:31:47,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=890754.0, ans=0.125 2023-06-21 05:31:55,428 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=22.5 2023-06-21 05:31:56,724 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.64 vs. limit=12.0 2023-06-21 05:32:19,172 INFO [train.py:996] (3/4) Epoch 5, batch 26500, loss[loss=0.2102, simple_loss=0.2784, pruned_loss=0.07099, over 21442.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3321, pruned_loss=0.09611, over 4260880.62 frames. ], batch size: 194, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:32:26,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=890874.0, ans=0.0 2023-06-21 05:32:31,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=890874.0, ans=0.0 2023-06-21 05:32:54,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=890934.0, ans=0.125 2023-06-21 05:33:11,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=890994.0, ans=0.125 2023-06-21 05:33:34,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=891054.0, ans=0.0 2023-06-21 05:33:42,180 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.071e+02 3.801e+02 4.543e+02 1.004e+03, threshold=7.603e+02, percent-clipped=5.0 2023-06-21 05:34:01,413 INFO [train.py:996] (3/4) Epoch 5, batch 26550, loss[loss=0.2536, simple_loss=0.3606, pruned_loss=0.07327, over 19707.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3296, pruned_loss=0.09288, over 4260359.96 frames. ], batch size: 703, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:34:08,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=891174.0, ans=0.1 2023-06-21 05:35:08,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=22.5 2023-06-21 05:35:19,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=891354.0, ans=0.125 2023-06-21 05:35:40,869 INFO [train.py:996] (3/4) Epoch 5, batch 26600, loss[loss=0.2219, simple_loss=0.2935, pruned_loss=0.07515, over 21821.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3296, pruned_loss=0.08978, over 4256103.06 frames. ], batch size: 118, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:35:51,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=891474.0, ans=0.125 2023-06-21 05:36:25,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=891594.0, ans=0.0 2023-06-21 05:36:34,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=891594.0, ans=0.125 2023-06-21 05:36:41,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=891654.0, ans=0.1 2023-06-21 05:36:43,452 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-21 05:36:57,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=891654.0, ans=0.1 2023-06-21 05:36:59,728 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.044e+02 3.559e+02 4.505e+02 6.702e+02, threshold=7.118e+02, percent-clipped=0.0 2023-06-21 05:37:23,622 INFO [train.py:996] (3/4) Epoch 5, batch 26650, loss[loss=0.2367, simple_loss=0.2904, pruned_loss=0.09151, over 21624.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.321, pruned_loss=0.08823, over 4257067.89 frames. ], batch size: 247, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:37:43,239 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=15.0 2023-06-21 05:37:52,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=891834.0, ans=0.125 2023-06-21 05:38:00,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=891894.0, ans=0.125 2023-06-21 05:39:01,920 INFO [train.py:996] (3/4) Epoch 5, batch 26700, loss[loss=0.2122, simple_loss=0.285, pruned_loss=0.06973, over 21923.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.313, pruned_loss=0.08437, over 4267129.68 frames. ], batch size: 333, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:39:29,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=892134.0, ans=0.2 2023-06-21 05:39:32,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=892134.0, ans=0.2 2023-06-21 05:39:45,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=892194.0, ans=0.1 2023-06-21 05:39:59,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=892194.0, ans=0.125 2023-06-21 05:40:18,702 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.683e+02 2.509e+02 2.895e+02 3.334e+02 4.980e+02, threshold=5.790e+02, percent-clipped=0.0 2023-06-21 05:40:47,742 INFO [train.py:996] (3/4) Epoch 5, batch 26750, loss[loss=0.2582, simple_loss=0.3417, pruned_loss=0.08738, over 21658.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3141, pruned_loss=0.08415, over 4275506.45 frames. ], batch size: 389, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:40:52,331 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-21 05:41:19,569 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=15.0 2023-06-21 05:42:23,650 INFO [train.py:996] (3/4) Epoch 5, batch 26800, loss[loss=0.3562, simple_loss=0.4013, pruned_loss=0.1556, over 21438.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3224, pruned_loss=0.08895, over 4272925.27 frames. ], batch size: 510, lr: 5.98e-03, grad_scale: 32.0 2023-06-21 05:42:35,462 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-21 05:43:12,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=892794.0, ans=0.0 2023-06-21 05:43:12,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=892794.0, ans=0.125 2023-06-21 05:43:13,101 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.48 vs. limit=15.0 2023-06-21 05:43:13,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=892794.0, ans=0.125 2023-06-21 05:43:13,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=892794.0, ans=0.125 2023-06-21 05:43:25,659 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.27 vs. limit=22.5 2023-06-21 05:43:33,496 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.59 vs. limit=10.0 2023-06-21 05:43:45,167 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 3.040e+02 3.566e+02 4.548e+02 6.934e+02, threshold=7.132e+02, percent-clipped=4.0 2023-06-21 05:44:03,856 INFO [train.py:996] (3/4) Epoch 5, batch 26850, loss[loss=0.2377, simple_loss=0.2962, pruned_loss=0.08954, over 21438.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3234, pruned_loss=0.09161, over 4271188.95 frames. ], batch size: 131, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:44:07,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=892974.0, ans=0.125 2023-06-21 05:44:59,059 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=22.5 2023-06-21 05:45:25,486 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:45:27,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-21 05:45:43,437 INFO [train.py:996] (3/4) Epoch 5, batch 26900, loss[loss=0.2027, simple_loss=0.2585, pruned_loss=0.07341, over 21442.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3146, pruned_loss=0.09046, over 4267374.49 frames. ], batch size: 212, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:45:59,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=893274.0, ans=0.0 2023-06-21 05:46:12,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=893334.0, ans=0.125 2023-06-21 05:46:29,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=893394.0, ans=0.0 2023-06-21 05:47:04,563 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.809e+02 3.300e+02 3.699e+02 7.956e+02, threshold=6.601e+02, percent-clipped=1.0 2023-06-21 05:47:16,972 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2023-06-21 05:47:22,080 INFO [train.py:996] (3/4) Epoch 5, batch 26950, loss[loss=0.2507, simple_loss=0.3376, pruned_loss=0.08189, over 21676.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.314, pruned_loss=0.09045, over 4264540.53 frames. ], batch size: 247, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:47:30,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=893574.0, ans=0.0 2023-06-21 05:48:27,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=893754.0, ans=0.1 2023-06-21 05:48:33,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=893754.0, ans=0.1 2023-06-21 05:48:38,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=893754.0, ans=0.125 2023-06-21 05:48:42,444 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=22.5 2023-06-21 05:48:56,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=893814.0, ans=0.125 2023-06-21 05:48:57,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=893814.0, ans=0.0 2023-06-21 05:49:00,088 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-21 05:49:02,209 INFO [train.py:996] (3/4) Epoch 5, batch 27000, loss[loss=0.2272, simple_loss=0.3151, pruned_loss=0.06966, over 21637.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.315, pruned_loss=0.08901, over 4259385.58 frames. ], batch size: 263, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:49:02,210 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 05:49:20,487 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2444, simple_loss=0.3449, pruned_loss=0.07195, over 1796401.00 frames. 2023-06-21 05:49:20,488 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-21 05:49:21,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=893874.0, ans=0.125 2023-06-21 05:49:40,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=893934.0, ans=10.0 2023-06-21 05:49:55,266 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.42 vs. limit=15.0 2023-06-21 05:50:38,948 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.497e+02 2.990e+02 3.496e+02 4.876e+02, threshold=5.980e+02, percent-clipped=0.0 2023-06-21 05:50:51,021 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.79 vs. limit=15.0 2023-06-21 05:50:58,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=894114.0, ans=0.125 2023-06-21 05:51:01,167 INFO [train.py:996] (3/4) Epoch 5, batch 27050, loss[loss=0.2387, simple_loss=0.3162, pruned_loss=0.08062, over 21710.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3168, pruned_loss=0.08579, over 4263887.71 frames. ], batch size: 247, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:51:08,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=894174.0, ans=0.0 2023-06-21 05:51:17,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=894234.0, ans=0.04949747468305833 2023-06-21 05:51:43,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=894294.0, ans=0.0 2023-06-21 05:51:51,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=894294.0, ans=0.0 2023-06-21 05:52:42,201 INFO [train.py:996] (3/4) Epoch 5, batch 27100, loss[loss=0.2316, simple_loss=0.2922, pruned_loss=0.08548, over 21729.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3194, pruned_loss=0.0866, over 4272172.23 frames. ], batch size: 264, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:53:12,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=894534.0, ans=0.125 2023-06-21 05:54:00,318 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 3.188e+02 3.922e+02 5.852e+02 9.183e+02, threshold=7.845e+02, percent-clipped=23.0 2023-06-21 05:54:09,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=894714.0, ans=0.125 2023-06-21 05:54:18,400 INFO [train.py:996] (3/4) Epoch 5, batch 27150, loss[loss=0.2647, simple_loss=0.3494, pruned_loss=0.09, over 21695.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3304, pruned_loss=0.08991, over 4272933.00 frames. ], batch size: 247, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:55:33,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=894954.0, ans=0.0 2023-06-21 05:55:36,473 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:55:37,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=894954.0, ans=0.125 2023-06-21 05:55:58,211 INFO [train.py:996] (3/4) Epoch 5, batch 27200, loss[loss=0.2899, simple_loss=0.3658, pruned_loss=0.107, over 21589.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3383, pruned_loss=0.09331, over 4272876.18 frames. ], batch size: 389, lr: 5.98e-03, grad_scale: 32.0 2023-06-21 05:57:23,620 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.260e+02 3.703e+02 4.553e+02 9.386e+02, threshold=7.407e+02, percent-clipped=2.0 2023-06-21 05:57:52,265 INFO [train.py:996] (3/4) Epoch 5, batch 27250, loss[loss=0.2983, simple_loss=0.3598, pruned_loss=0.1185, over 21331.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3431, pruned_loss=0.09826, over 4275006.77 frames. ], batch size: 176, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 05:58:03,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=895374.0, ans=0.125 2023-06-21 05:59:33,980 INFO [train.py:996] (3/4) Epoch 5, batch 27300, loss[loss=0.3067, simple_loss=0.3768, pruned_loss=0.1183, over 21570.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3441, pruned_loss=0.09914, over 4266153.19 frames. ], batch size: 131, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 05:59:59,935 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.74 vs. limit=10.0 2023-06-21 06:00:20,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=895794.0, ans=0.125 2023-06-21 06:00:57,906 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.467e+02 2.999e+02 3.424e+02 4.068e+02 6.879e+02, threshold=6.848e+02, percent-clipped=0.0 2023-06-21 06:01:15,079 INFO [train.py:996] (3/4) Epoch 5, batch 27350, loss[loss=0.3036, simple_loss=0.3668, pruned_loss=0.1202, over 21856.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3481, pruned_loss=0.1004, over 4260759.84 frames. ], batch size: 124, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:01:24,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=895974.0, ans=0.0 2023-06-21 06:02:54,638 INFO [train.py:996] (3/4) Epoch 5, batch 27400, loss[loss=0.2501, simple_loss=0.3078, pruned_loss=0.09622, over 21658.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3424, pruned_loss=0.09936, over 4268843.75 frames. ], batch size: 247, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:02:55,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=896274.0, ans=0.125 2023-06-21 06:03:10,400 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.92 vs. limit=10.0 2023-06-21 06:03:13,849 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-06-21 06:03:38,919 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.18 vs. limit=15.0 2023-06-21 06:03:50,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=896394.0, ans=0.125 2023-06-21 06:04:08,144 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.765e+02 3.152e+02 3.980e+02 5.730e+02, threshold=6.304e+02, percent-clipped=0.0 2023-06-21 06:04:14,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=896514.0, ans=0.125 2023-06-21 06:04:34,753 INFO [train.py:996] (3/4) Epoch 5, batch 27450, loss[loss=0.2507, simple_loss=0.3403, pruned_loss=0.08054, over 21591.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3357, pruned_loss=0.09724, over 4272021.04 frames. ], batch size: 389, lr: 5.97e-03, grad_scale: 16.0 2023-06-21 06:05:06,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=896694.0, ans=0.0 2023-06-21 06:05:20,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=896694.0, ans=0.2 2023-06-21 06:06:14,862 INFO [train.py:996] (3/4) Epoch 5, batch 27500, loss[loss=0.2682, simple_loss=0.3307, pruned_loss=0.1028, over 21238.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3345, pruned_loss=0.09789, over 4279294.50 frames. ], batch size: 143, lr: 5.97e-03, grad_scale: 16.0 2023-06-21 06:06:41,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=896934.0, ans=0.2 2023-06-21 06:07:15,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=897054.0, ans=0.125 2023-06-21 06:07:29,132 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 2.898e+02 3.228e+02 3.815e+02 7.854e+02, threshold=6.456e+02, percent-clipped=2.0 2023-06-21 06:07:54,330 INFO [train.py:996] (3/4) Epoch 5, batch 27550, loss[loss=0.1971, simple_loss=0.2683, pruned_loss=0.06295, over 21637.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3293, pruned_loss=0.09438, over 4280124.41 frames. ], batch size: 247, lr: 5.97e-03, grad_scale: 16.0 2023-06-21 06:08:24,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=897234.0, ans=0.0 2023-06-21 06:08:55,044 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=15.0 2023-06-21 06:09:29,799 INFO [train.py:996] (3/4) Epoch 5, batch 27600, loss[loss=0.2496, simple_loss=0.3054, pruned_loss=0.09688, over 22016.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.323, pruned_loss=0.09332, over 4277305.17 frames. ], batch size: 103, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:09:34,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=897474.0, ans=0.125 2023-06-21 06:09:35,524 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-21 06:09:47,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=897534.0, ans=0.07 2023-06-21 06:10:20,370 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.96 vs. limit=15.0 2023-06-21 06:10:43,731 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.396e+02 2.760e+02 3.130e+02 3.904e+02 5.692e+02, threshold=6.260e+02, percent-clipped=0.0 2023-06-21 06:10:53,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=897714.0, ans=0.0 2023-06-21 06:10:54,146 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=22.5 2023-06-21 06:11:08,630 INFO [train.py:996] (3/4) Epoch 5, batch 27650, loss[loss=0.2666, simple_loss=0.3297, pruned_loss=0.1018, over 16656.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3174, pruned_loss=0.09267, over 4261850.05 frames. ], batch size: 62, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:11:15,738 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=22.5 2023-06-21 06:11:33,595 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=22.5 2023-06-21 06:11:40,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=897894.0, ans=0.125 2023-06-21 06:11:56,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=897894.0, ans=0.0 2023-06-21 06:11:58,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=897894.0, ans=0.125 2023-06-21 06:12:47,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=898074.0, ans=0.0 2023-06-21 06:12:49,023 INFO [train.py:996] (3/4) Epoch 5, batch 27700, loss[loss=0.3005, simple_loss=0.3804, pruned_loss=0.1103, over 21748.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3176, pruned_loss=0.09127, over 4258927.87 frames. ], batch size: 332, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:13:15,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=898134.0, ans=0.0 2023-06-21 06:13:25,375 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=22.5 2023-06-21 06:13:26,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=898194.0, ans=0.1 2023-06-21 06:13:59,363 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-21 06:14:07,890 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 3.066e+02 3.761e+02 4.326e+02 8.310e+02, threshold=7.523e+02, percent-clipped=4.0 2023-06-21 06:14:28,397 INFO [train.py:996] (3/4) Epoch 5, batch 27750, loss[loss=0.2374, simple_loss=0.314, pruned_loss=0.08036, over 21802.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.32, pruned_loss=0.09023, over 4264340.28 frames. ], batch size: 414, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:14:51,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=898434.0, ans=0.0 2023-06-21 06:14:54,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=898434.0, ans=0.125 2023-06-21 06:15:45,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=898614.0, ans=0.125 2023-06-21 06:16:06,796 INFO [train.py:996] (3/4) Epoch 5, batch 27800, loss[loss=0.2812, simple_loss=0.3416, pruned_loss=0.1104, over 21886.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3206, pruned_loss=0.09106, over 4277213.31 frames. ], batch size: 118, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:16:08,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=898674.0, ans=0.125 2023-06-21 06:16:49,889 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=15.0 2023-06-21 06:17:12,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=898854.0, ans=0.1 2023-06-21 06:17:26,838 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.337e+02 2.753e+02 3.252e+02 3.951e+02 6.290e+02, threshold=6.504e+02, percent-clipped=0.0 2023-06-21 06:17:30,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=898914.0, ans=0.125 2023-06-21 06:17:48,405 INFO [train.py:996] (3/4) Epoch 5, batch 27850, loss[loss=0.2144, simple_loss=0.2771, pruned_loss=0.07586, over 21200.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3201, pruned_loss=0.09241, over 4288428.38 frames. ], batch size: 608, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:19:29,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=899214.0, ans=0.0 2023-06-21 06:19:31,877 INFO [train.py:996] (3/4) Epoch 5, batch 27900, loss[loss=0.2236, simple_loss=0.3119, pruned_loss=0.06763, over 21657.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3295, pruned_loss=0.09362, over 4292196.89 frames. ], batch size: 263, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:19:59,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=899334.0, ans=0.0 2023-06-21 06:20:27,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=899394.0, ans=0.125 2023-06-21 06:20:28,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=899394.0, ans=0.125 2023-06-21 06:20:44,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=899454.0, ans=0.0 2023-06-21 06:20:54,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=899454.0, ans=0.0 2023-06-21 06:20:57,630 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 2.911e+02 3.342e+02 3.967e+02 6.742e+02, threshold=6.683e+02, percent-clipped=1.0 2023-06-21 06:21:07,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=899514.0, ans=0.2 2023-06-21 06:21:19,104 INFO [train.py:996] (3/4) Epoch 5, batch 27950, loss[loss=0.3059, simple_loss=0.3846, pruned_loss=0.1136, over 21717.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3291, pruned_loss=0.08967, over 4286908.21 frames. ], batch size: 441, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:21:26,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=899574.0, ans=0.125 2023-06-21 06:21:34,954 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.94 vs. limit=10.0 2023-06-21 06:21:49,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=899634.0, ans=0.5 2023-06-21 06:22:09,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=899694.0, ans=0.0 2023-06-21 06:22:29,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=899754.0, ans=0.07 2023-06-21 06:22:36,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=899814.0, ans=0.1 2023-06-21 06:22:38,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=899814.0, ans=0.1 2023-06-21 06:22:59,473 INFO [train.py:996] (3/4) Epoch 5, batch 28000, loss[loss=0.2584, simple_loss=0.3142, pruned_loss=0.1013, over 21422.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3262, pruned_loss=0.08684, over 4289116.93 frames. ], batch size: 144, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:23:25,121 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.05 vs. limit=5.0 2023-06-21 06:24:16,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=900054.0, ans=0.125 2023-06-21 06:24:19,317 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.872e+02 3.186e+02 3.800e+02 5.572e+02, threshold=6.373e+02, percent-clipped=0.0 2023-06-21 06:24:32,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=900114.0, ans=0.2 2023-06-21 06:24:40,547 INFO [train.py:996] (3/4) Epoch 5, batch 28050, loss[loss=0.2519, simple_loss=0.331, pruned_loss=0.08642, over 21730.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3248, pruned_loss=0.08872, over 4293911.34 frames. ], batch size: 441, lr: 5.96e-03, grad_scale: 16.0 2023-06-21 06:24:59,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=900174.0, ans=0.0 2023-06-21 06:26:20,268 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=15.0 2023-06-21 06:26:20,959 INFO [train.py:996] (3/4) Epoch 5, batch 28100, loss[loss=0.2752, simple_loss=0.3288, pruned_loss=0.1108, over 21938.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3221, pruned_loss=0.08899, over 4293336.70 frames. ], batch size: 103, lr: 5.96e-03, grad_scale: 16.0 2023-06-21 06:26:37,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=900474.0, ans=0.95 2023-06-21 06:26:58,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=900534.0, ans=0.2 2023-06-21 06:27:25,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=900654.0, ans=0.2 2023-06-21 06:27:32,547 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.98 vs. limit=10.0 2023-06-21 06:27:41,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=900654.0, ans=0.125 2023-06-21 06:27:47,381 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 3.042e+02 3.636e+02 4.421e+02 1.163e+03, threshold=7.272e+02, percent-clipped=7.0 2023-06-21 06:27:58,043 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-06-21 06:28:07,011 INFO [train.py:996] (3/4) Epoch 5, batch 28150, loss[loss=0.2318, simple_loss=0.2863, pruned_loss=0.08862, over 14824.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3155, pruned_loss=0.08912, over 4276742.26 frames. ], batch size: 62, lr: 5.96e-03, grad_scale: 16.0 2023-06-21 06:28:25,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=900774.0, ans=0.125 2023-06-21 06:28:30,640 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.33 vs. limit=6.0 2023-06-21 06:28:36,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=900834.0, ans=0.2 2023-06-21 06:28:46,887 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=22.5 2023-06-21 06:29:32,242 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.41 vs. limit=10.0 2023-06-21 06:29:48,141 INFO [train.py:996] (3/4) Epoch 5, batch 28200, loss[loss=0.2562, simple_loss=0.3181, pruned_loss=0.09713, over 21692.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.314, pruned_loss=0.09056, over 4277200.25 frames. ], batch size: 351, lr: 5.96e-03, grad_scale: 16.0 2023-06-21 06:30:03,504 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=15.0 2023-06-21 06:30:50,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=901254.0, ans=0.125 2023-06-21 06:31:14,328 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.109e+02 3.691e+02 4.482e+02 7.045e+02, threshold=7.382e+02, percent-clipped=0.0 2023-06-21 06:31:33,974 INFO [train.py:996] (3/4) Epoch 5, batch 28250, loss[loss=0.2689, simple_loss=0.3406, pruned_loss=0.09864, over 16164.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.318, pruned_loss=0.09341, over 4264213.31 frames. ], batch size: 60, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:31:56,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=901434.0, ans=0.0 2023-06-21 06:32:15,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=901494.0, ans=0.0 2023-06-21 06:32:16,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=901494.0, ans=0.125 2023-06-21 06:32:25,986 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=22.5 2023-06-21 06:33:07,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=901614.0, ans=0.125 2023-06-21 06:33:15,362 INFO [train.py:996] (3/4) Epoch 5, batch 28300, loss[loss=0.2094, simple_loss=0.2933, pruned_loss=0.06275, over 21759.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3147, pruned_loss=0.08982, over 4257109.58 frames. ], batch size: 351, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:33:16,317 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-21 06:33:24,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=901674.0, ans=0.0 2023-06-21 06:33:44,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=901734.0, ans=0.125 2023-06-21 06:34:16,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=901854.0, ans=0.0 2023-06-21 06:34:18,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=901854.0, ans=0.125 2023-06-21 06:34:26,915 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-06-21 06:34:28,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=901854.0, ans=0.125 2023-06-21 06:34:41,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.736e+02 2.744e+02 3.366e+02 4.135e+02 8.525e+02, threshold=6.731e+02, percent-clipped=3.0 2023-06-21 06:34:56,269 INFO [train.py:996] (3/4) Epoch 5, batch 28350, loss[loss=0.2359, simple_loss=0.2923, pruned_loss=0.08969, over 21283.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3101, pruned_loss=0.08413, over 4254939.63 frames. ], batch size: 160, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:35:06,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=901974.0, ans=0.2 2023-06-21 06:36:03,037 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:36:16,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=902154.0, ans=0.0 2023-06-21 06:36:18,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=902214.0, ans=0.125 2023-06-21 06:36:40,510 INFO [train.py:996] (3/4) Epoch 5, batch 28400, loss[loss=0.2763, simple_loss=0.34, pruned_loss=0.1063, over 21711.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3068, pruned_loss=0.08399, over 4260812.22 frames. ], batch size: 351, lr: 5.95e-03, grad_scale: 32.0 2023-06-21 06:37:28,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=902394.0, ans=0.125 2023-06-21 06:37:36,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=902394.0, ans=0.125 2023-06-21 06:38:03,532 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 3.069e+02 3.636e+02 4.494e+02 7.236e+02, threshold=7.272e+02, percent-clipped=3.0 2023-06-21 06:38:20,750 INFO [train.py:996] (3/4) Epoch 5, batch 28450, loss[loss=0.2856, simple_loss=0.3441, pruned_loss=0.1135, over 21771.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3143, pruned_loss=0.08883, over 4268619.37 frames. ], batch size: 441, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:38:27,684 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.93 vs. limit=10.0 2023-06-21 06:38:44,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=902634.0, ans=0.125 2023-06-21 06:38:59,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=902694.0, ans=10.0 2023-06-21 06:39:56,072 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=12.0 2023-06-21 06:39:59,731 INFO [train.py:996] (3/4) Epoch 5, batch 28500, loss[loss=0.2718, simple_loss=0.3352, pruned_loss=0.1042, over 21554.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3164, pruned_loss=0.09119, over 4277634.61 frames. ], batch size: 230, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:40:05,278 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:40:36,049 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.68 vs. limit=15.0 2023-06-21 06:41:24,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=903114.0, ans=0.2 2023-06-21 06:41:28,907 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 2.871e+02 3.406e+02 3.870e+02 6.038e+02, threshold=6.812e+02, percent-clipped=0.0 2023-06-21 06:41:29,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=903114.0, ans=0.125 2023-06-21 06:41:41,933 INFO [train.py:996] (3/4) Epoch 5, batch 28550, loss[loss=0.2292, simple_loss=0.2993, pruned_loss=0.07955, over 21864.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3234, pruned_loss=0.09328, over 4282846.55 frames. ], batch size: 98, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:41:47,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=903174.0, ans=0.04949747468305833 2023-06-21 06:42:39,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=903294.0, ans=0.0 2023-06-21 06:43:24,800 INFO [train.py:996] (3/4) Epoch 5, batch 28600, loss[loss=0.2788, simple_loss=0.3513, pruned_loss=0.1031, over 21667.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3302, pruned_loss=0.09537, over 4278069.33 frames. ], batch size: 351, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:44:10,512 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:44:12,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=903594.0, ans=0.2 2023-06-21 06:44:28,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=903594.0, ans=0.2 2023-06-21 06:44:38,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=903654.0, ans=0.0 2023-06-21 06:44:54,248 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.310e+02 2.937e+02 3.367e+02 4.039e+02 6.744e+02, threshold=6.734e+02, percent-clipped=0.0 2023-06-21 06:44:56,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=903714.0, ans=0.125 2023-06-21 06:45:12,136 INFO [train.py:996] (3/4) Epoch 5, batch 28650, loss[loss=0.2139, simple_loss=0.2732, pruned_loss=0.07725, over 21760.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3248, pruned_loss=0.09513, over 4269698.62 frames. ], batch size: 317, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:45:20,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=903774.0, ans=0.125 2023-06-21 06:45:21,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=903774.0, ans=0.0 2023-06-21 06:45:53,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=903834.0, ans=0.035 2023-06-21 06:46:06,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=903894.0, ans=0.05 2023-06-21 06:46:08,441 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:46:19,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=903954.0, ans=0.125 2023-06-21 06:46:21,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=903954.0, ans=0.2 2023-06-21 06:46:22,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=903954.0, ans=0.04949747468305833 2023-06-21 06:46:56,174 INFO [train.py:996] (3/4) Epoch 5, batch 28700, loss[loss=0.2678, simple_loss=0.329, pruned_loss=0.1032, over 21700.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3228, pruned_loss=0.09609, over 4264778.51 frames. ], batch size: 298, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:46:58,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=904074.0, ans=0.0 2023-06-21 06:47:40,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=904194.0, ans=0.125 2023-06-21 06:47:54,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=904254.0, ans=0.125 2023-06-21 06:48:13,789 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.313e+02 2.955e+02 3.205e+02 3.884e+02 6.833e+02, threshold=6.409e+02, percent-clipped=1.0 2023-06-21 06:48:14,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=904314.0, ans=0.125 2023-06-21 06:48:37,249 INFO [train.py:996] (3/4) Epoch 5, batch 28750, loss[loss=0.236, simple_loss=0.3159, pruned_loss=0.07809, over 21751.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3231, pruned_loss=0.09708, over 4272704.84 frames. ], batch size: 247, lr: 5.94e-03, grad_scale: 16.0 2023-06-21 06:48:56,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=904434.0, ans=0.0 2023-06-21 06:49:07,227 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-06-21 06:50:12,462 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-21 06:50:17,670 INFO [train.py:996] (3/4) Epoch 5, batch 28800, loss[loss=0.3232, simple_loss=0.3875, pruned_loss=0.1295, over 21205.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3276, pruned_loss=0.09772, over 4280147.27 frames. ], batch size: 143, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:50:28,340 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.16 vs. limit=10.0 2023-06-21 06:50:42,599 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-21 06:51:22,188 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.99 vs. limit=5.0 2023-06-21 06:51:36,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=904914.0, ans=0.125 2023-06-21 06:51:45,681 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 2.918e+02 3.315e+02 4.122e+02 9.599e+02, threshold=6.630e+02, percent-clipped=10.0 2023-06-21 06:52:08,212 INFO [train.py:996] (3/4) Epoch 5, batch 28850, loss[loss=0.2581, simple_loss=0.3165, pruned_loss=0.09983, over 21838.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3289, pruned_loss=0.09922, over 4281813.05 frames. ], batch size: 124, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:52:11,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=904974.0, ans=0.125 2023-06-21 06:52:16,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=904974.0, ans=0.1 2023-06-21 06:52:26,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=905034.0, ans=0.04949747468305833 2023-06-21 06:52:31,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=905034.0, ans=0.125 2023-06-21 06:52:41,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=905094.0, ans=0.2 2023-06-21 06:52:55,987 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:53:27,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=905214.0, ans=0.2 2023-06-21 06:53:27,860 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-21 06:53:48,819 INFO [train.py:996] (3/4) Epoch 5, batch 28900, loss[loss=0.2293, simple_loss=0.3, pruned_loss=0.07927, over 21787.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3325, pruned_loss=0.1009, over 4284783.76 frames. ], batch size: 247, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:54:34,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=905394.0, ans=15.0 2023-06-21 06:54:43,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=905454.0, ans=0.125 2023-06-21 06:54:45,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=905454.0, ans=0.0 2023-06-21 06:55:17,780 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 3.129e+02 3.502e+02 4.010e+02 6.253e+02, threshold=7.003e+02, percent-clipped=0.0 2023-06-21 06:55:28,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=905514.0, ans=0.0 2023-06-21 06:55:30,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=905574.0, ans=0.0 2023-06-21 06:55:31,401 INFO [train.py:996] (3/4) Epoch 5, batch 28950, loss[loss=0.2854, simple_loss=0.3702, pruned_loss=0.1003, over 21567.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3317, pruned_loss=0.09931, over 4276990.81 frames. ], batch size: 471, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:55:51,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=905634.0, ans=0.125 2023-06-21 06:55:51,801 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-21 06:56:53,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=905754.0, ans=0.125 2023-06-21 06:56:54,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=905814.0, ans=0.125 2023-06-21 06:56:58,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=905814.0, ans=0.1 2023-06-21 06:57:09,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=905814.0, ans=0.0 2023-06-21 06:57:12,961 INFO [train.py:996] (3/4) Epoch 5, batch 29000, loss[loss=0.2676, simple_loss=0.3393, pruned_loss=0.09791, over 21792.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3352, pruned_loss=0.09852, over 4274423.87 frames. ], batch size: 247, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:57:29,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=905934.0, ans=0.0 2023-06-21 06:58:39,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.974e+02 3.473e+02 4.065e+02 5.758e+02, threshold=6.947e+02, percent-clipped=0.0 2023-06-21 06:58:52,018 INFO [train.py:996] (3/4) Epoch 5, batch 29050, loss[loss=0.2568, simple_loss=0.3135, pruned_loss=0.1001, over 21589.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.334, pruned_loss=0.09985, over 4281078.98 frames. ], batch size: 548, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:59:58,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=906354.0, ans=0.125 2023-06-21 07:00:16,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=906414.0, ans=0.0 2023-06-21 07:00:31,532 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.77 vs. limit=15.0 2023-06-21 07:00:32,307 INFO [train.py:996] (3/4) Epoch 5, batch 29100, loss[loss=0.2381, simple_loss=0.3526, pruned_loss=0.06179, over 19983.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.325, pruned_loss=0.09719, over 4287790.52 frames. ], batch size: 702, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 07:00:37,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=906474.0, ans=10.0 2023-06-21 07:00:40,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=906474.0, ans=0.1 2023-06-21 07:01:23,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=906594.0, ans=0.0 2023-06-21 07:01:34,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=906594.0, ans=0.125 2023-06-21 07:02:00,027 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.771e+02 3.124e+02 3.774e+02 6.095e+02, threshold=6.248e+02, percent-clipped=0.0 2023-06-21 07:02:08,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=906714.0, ans=0.125 2023-06-21 07:02:13,076 INFO [train.py:996] (3/4) Epoch 5, batch 29150, loss[loss=0.284, simple_loss=0.3464, pruned_loss=0.1108, over 20065.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3226, pruned_loss=0.09474, over 4290027.95 frames. ], batch size: 707, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 07:02:30,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=906774.0, ans=0.125 2023-06-21 07:03:06,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=906894.0, ans=0.125 2023-06-21 07:03:22,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=906954.0, ans=0.1 2023-06-21 07:03:40,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=907014.0, ans=0.125 2023-06-21 07:03:40,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=907014.0, ans=0.125 2023-06-21 07:03:53,178 INFO [train.py:996] (3/4) Epoch 5, batch 29200, loss[loss=0.1924, simple_loss=0.2582, pruned_loss=0.06326, over 21549.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3184, pruned_loss=0.09367, over 4285444.07 frames. ], batch size: 263, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 07:05:04,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=907254.0, ans=0.0 2023-06-21 07:05:19,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=907314.0, ans=0.125 2023-06-21 07:05:22,443 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 2.832e+02 3.192e+02 3.760e+02 6.246e+02, threshold=6.385e+02, percent-clipped=0.0 2023-06-21 07:05:40,658 INFO [train.py:996] (3/4) Epoch 5, batch 29250, loss[loss=0.2468, simple_loss=0.3214, pruned_loss=0.0861, over 21602.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3169, pruned_loss=0.09114, over 4284616.94 frames. ], batch size: 230, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:06:00,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=907434.0, ans=0.0 2023-06-21 07:06:12,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=907434.0, ans=0.035 2023-06-21 07:06:40,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=907494.0, ans=15.0 2023-06-21 07:06:45,266 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.92 vs. limit=12.0 2023-06-21 07:07:01,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=907614.0, ans=0.125 2023-06-21 07:07:12,336 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:07:21,316 INFO [train.py:996] (3/4) Epoch 5, batch 29300, loss[loss=0.2342, simple_loss=0.299, pruned_loss=0.0847, over 21694.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3178, pruned_loss=0.08951, over 4280925.08 frames. ], batch size: 282, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:07:53,011 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:07:57,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=907734.0, ans=0.125 2023-06-21 07:08:10,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=907794.0, ans=0.1 2023-06-21 07:08:29,112 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=22.5 2023-06-21 07:08:46,690 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 2.808e+02 3.408e+02 4.024e+02 6.878e+02, threshold=6.816e+02, percent-clipped=1.0 2023-06-21 07:09:05,110 INFO [train.py:996] (3/4) Epoch 5, batch 29350, loss[loss=0.2456, simple_loss=0.3332, pruned_loss=0.07905, over 21825.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3143, pruned_loss=0.08837, over 4279605.62 frames. ], batch size: 317, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:09:57,010 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-21 07:10:52,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=908274.0, ans=0.125 2023-06-21 07:10:53,128 INFO [train.py:996] (3/4) Epoch 5, batch 29400, loss[loss=0.1845, simple_loss=0.2476, pruned_loss=0.06074, over 21529.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3128, pruned_loss=0.086, over 4269890.39 frames. ], batch size: 195, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:11:34,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=908394.0, ans=0.0 2023-06-21 07:12:05,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=908454.0, ans=0.0 2023-06-21 07:12:26,263 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.910e+02 3.307e+02 3.988e+02 6.309e+02, threshold=6.614e+02, percent-clipped=0.0 2023-06-21 07:12:26,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=908514.0, ans=0.07 2023-06-21 07:12:37,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=908514.0, ans=0.125 2023-06-21 07:12:37,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=908514.0, ans=0.1 2023-06-21 07:12:41,817 INFO [train.py:996] (3/4) Epoch 5, batch 29450, loss[loss=0.3591, simple_loss=0.4046, pruned_loss=0.1568, over 21350.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3128, pruned_loss=0.08597, over 4273327.58 frames. ], batch size: 507, lr: 5.93e-03, grad_scale: 16.0 2023-06-21 07:12:59,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=908574.0, ans=0.125 2023-06-21 07:13:08,182 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-06-21 07:13:23,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=908694.0, ans=6.0 2023-06-21 07:13:53,774 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-21 07:14:21,516 INFO [train.py:996] (3/4) Epoch 5, batch 29500, loss[loss=0.2796, simple_loss=0.3323, pruned_loss=0.1134, over 21483.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3179, pruned_loss=0.08965, over 4271704.56 frames. ], batch size: 131, lr: 5.93e-03, grad_scale: 16.0 2023-06-21 07:15:21,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=909054.0, ans=0.125 2023-06-21 07:15:21,826 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.33 vs. limit=10.0 2023-06-21 07:15:45,748 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.324e+02 2.991e+02 3.477e+02 4.425e+02 6.921e+02, threshold=6.954e+02, percent-clipped=2.0 2023-06-21 07:15:46,974 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.82 vs. limit=15.0 2023-06-21 07:15:51,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=909114.0, ans=0.1 2023-06-21 07:15:56,186 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:15:57,589 INFO [train.py:996] (3/4) Epoch 5, batch 29550, loss[loss=0.2455, simple_loss=0.3026, pruned_loss=0.09423, over 21619.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.317, pruned_loss=0.09139, over 4280506.55 frames. ], batch size: 212, lr: 5.93e-03, grad_scale: 16.0 2023-06-21 07:16:20,921 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.08 vs. limit=15.0 2023-06-21 07:17:22,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=909414.0, ans=0.2 2023-06-21 07:17:43,933 INFO [train.py:996] (3/4) Epoch 5, batch 29600, loss[loss=0.2854, simple_loss=0.3889, pruned_loss=0.09094, over 20805.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3232, pruned_loss=0.09369, over 4276878.20 frames. ], batch size: 608, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:17:55,806 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:18:17,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=909534.0, ans=0.07 2023-06-21 07:18:20,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=909594.0, ans=0.125 2023-06-21 07:18:34,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=909594.0, ans=0.0 2023-06-21 07:18:36,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=909594.0, ans=0.125 2023-06-21 07:18:47,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=909654.0, ans=0.95 2023-06-21 07:19:08,632 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.750e+02 3.191e+02 3.922e+02 6.335e+02, threshold=6.382e+02, percent-clipped=0.0 2023-06-21 07:19:09,589 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-21 07:19:10,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=909714.0, ans=0.09899494936611666 2023-06-21 07:19:28,007 INFO [train.py:996] (3/4) Epoch 5, batch 29650, loss[loss=0.2043, simple_loss=0.2716, pruned_loss=0.06852, over 21567.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3202, pruned_loss=0.08981, over 4283074.03 frames. ], batch size: 195, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:19:28,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=909774.0, ans=0.1 2023-06-21 07:19:50,113 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.23 vs. limit=12.0 2023-06-21 07:19:55,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=909834.0, ans=0.125 2023-06-21 07:20:17,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=909894.0, ans=0.025 2023-06-21 07:20:23,137 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=22.5 2023-06-21 07:20:35,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=909954.0, ans=0.0 2023-06-21 07:21:09,358 INFO [train.py:996] (3/4) Epoch 5, batch 29700, loss[loss=0.2681, simple_loss=0.3836, pruned_loss=0.07631, over 20913.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3234, pruned_loss=0.09099, over 4290476.50 frames. ], batch size: 607, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:21:28,246 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=22.5 2023-06-21 07:21:37,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=910134.0, ans=0.125 2023-06-21 07:21:39,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=910134.0, ans=0.0 2023-06-21 07:22:24,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=910314.0, ans=0.0 2023-06-21 07:22:29,344 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.944e+02 3.350e+02 4.483e+02 9.156e+02, threshold=6.700e+02, percent-clipped=7.0 2023-06-21 07:22:45,749 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-21 07:22:50,069 INFO [train.py:996] (3/4) Epoch 5, batch 29750, loss[loss=0.2353, simple_loss=0.3218, pruned_loss=0.07445, over 21635.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3292, pruned_loss=0.09091, over 4283243.48 frames. ], batch size: 230, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:23:13,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=910434.0, ans=0.125 2023-06-21 07:23:13,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=910434.0, ans=0.125 2023-06-21 07:23:33,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=910494.0, ans=0.125 2023-06-21 07:24:12,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=910614.0, ans=0.0 2023-06-21 07:24:24,058 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.57 vs. limit=15.0 2023-06-21 07:24:31,246 INFO [train.py:996] (3/4) Epoch 5, batch 29800, loss[loss=0.2489, simple_loss=0.318, pruned_loss=0.08989, over 21855.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3323, pruned_loss=0.09315, over 4292310.33 frames. ], batch size: 414, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:25:05,140 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-21 07:25:23,374 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-06-21 07:25:31,578 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-21 07:25:32,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=910854.0, ans=0.0 2023-06-21 07:25:46,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=910914.0, ans=22.5 2023-06-21 07:25:56,577 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.642e+02 3.044e+02 3.598e+02 6.041e+02, threshold=6.089e+02, percent-clipped=0.0 2023-06-21 07:26:11,947 INFO [train.py:996] (3/4) Epoch 5, batch 29850, loss[loss=0.3011, simple_loss=0.3523, pruned_loss=0.125, over 21934.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.328, pruned_loss=0.09033, over 4286310.12 frames. ], batch size: 107, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:26:22,971 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=15.0 2023-06-21 07:26:25,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=910974.0, ans=0.2 2023-06-21 07:26:41,707 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=15.0 2023-06-21 07:27:20,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=911154.0, ans=0.125 2023-06-21 07:27:53,654 INFO [train.py:996] (3/4) Epoch 5, batch 29900, loss[loss=0.2346, simple_loss=0.3001, pruned_loss=0.08456, over 21498.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3265, pruned_loss=0.09149, over 4289574.06 frames. ], batch size: 211, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:28:05,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=911274.0, ans=0.02 2023-06-21 07:28:18,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=911334.0, ans=0.95 2023-06-21 07:29:09,834 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:29:22,822 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.918e+02 3.287e+02 3.972e+02 7.501e+02, threshold=6.574e+02, percent-clipped=2.0 2023-06-21 07:29:34,452 INFO [train.py:996] (3/4) Epoch 5, batch 29950, loss[loss=0.311, simple_loss=0.3717, pruned_loss=0.1252, over 21572.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.331, pruned_loss=0.09558, over 4288116.63 frames. ], batch size: 194, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:29:40,494 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-21 07:30:05,546 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-21 07:30:54,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=911754.0, ans=0.125 2023-06-21 07:30:56,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=911754.0, ans=0.2 2023-06-21 07:30:58,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=911814.0, ans=0.2 2023-06-21 07:31:10,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=911814.0, ans=0.125 2023-06-21 07:31:16,737 INFO [train.py:996] (3/4) Epoch 5, batch 30000, loss[loss=0.2823, simple_loss=0.3651, pruned_loss=0.09974, over 21727.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3336, pruned_loss=0.09593, over 4288132.50 frames. ], batch size: 441, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:31:16,737 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 07:31:28,463 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.9270, 4.9238, 4.5784, 4.5928], device='cuda:3') 2023-06-21 07:31:38,132 INFO [train.py:1028] (3/4) Epoch 5, validation: loss=0.2485, simple_loss=0.3493, pruned_loss=0.0739, over 1796401.00 frames. 2023-06-21 07:31:38,133 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-21 07:32:55,341 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-21 07:33:13,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=912114.0, ans=0.125 2023-06-21 07:33:14,051 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-21 07:33:14,710 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 3.033e+02 3.680e+02 4.795e+02 8.556e+02, threshold=7.360e+02, percent-clipped=8.0 2023-06-21 07:33:36,612 INFO [train.py:996] (3/4) Epoch 5, batch 30050, loss[loss=0.2657, simple_loss=0.3907, pruned_loss=0.07035, over 20743.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3348, pruned_loss=0.09212, over 4278980.83 frames. ], batch size: 607, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:33:55,502 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=15.0 2023-06-21 07:34:11,867 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=22.5 2023-06-21 07:34:12,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=912294.0, ans=0.125 2023-06-21 07:34:19,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=912294.0, ans=0.07 2023-06-21 07:34:54,977 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.94 vs. limit=15.0 2023-06-21 07:35:04,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=912414.0, ans=0.125 2023-06-21 07:35:15,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=912474.0, ans=0.2 2023-06-21 07:35:16,533 INFO [train.py:996] (3/4) Epoch 5, batch 30100, loss[loss=0.2353, simple_loss=0.3104, pruned_loss=0.08011, over 21486.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3328, pruned_loss=0.0924, over 4271826.21 frames. ], batch size: 389, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:35:47,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=912534.0, ans=0.0 2023-06-21 07:35:57,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=912594.0, ans=0.04949747468305833 2023-06-21 07:36:30,515 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-21 07:36:40,814 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.471e+02 3.174e+02 3.797e+02 4.451e+02 9.370e+02, threshold=7.593e+02, percent-clipped=1.0 2023-06-21 07:37:02,690 INFO [train.py:996] (3/4) Epoch 5, batch 30150, loss[loss=0.3397, simple_loss=0.3796, pruned_loss=0.1499, over 21418.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3285, pruned_loss=0.09308, over 4274278.63 frames. ], batch size: 510, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:37:29,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=912834.0, ans=0.125 2023-06-21 07:37:34,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=912834.0, ans=0.0 2023-06-21 07:37:44,103 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:38:45,872 INFO [train.py:996] (3/4) Epoch 5, batch 30200, loss[loss=0.2395, simple_loss=0.3262, pruned_loss=0.07642, over 21781.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3292, pruned_loss=0.09121, over 4279020.42 frames. ], batch size: 282, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:39:49,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=913254.0, ans=0.0 2023-06-21 07:40:17,004 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.228e+02 2.970e+02 3.553e+02 4.438e+02 6.781e+02, threshold=7.107e+02, percent-clipped=0.0 2023-06-21 07:40:28,569 INFO [train.py:996] (3/4) Epoch 5, batch 30250, loss[loss=0.2557, simple_loss=0.3497, pruned_loss=0.0809, over 21390.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3362, pruned_loss=0.09314, over 4278336.45 frames. ], batch size: 194, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:41:32,111 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=22.5 2023-06-21 07:41:48,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=913554.0, ans=0.0 2023-06-21 07:42:08,260 INFO [train.py:996] (3/4) Epoch 5, batch 30300, loss[loss=0.2298, simple_loss=0.2866, pruned_loss=0.0865, over 21750.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3356, pruned_loss=0.09446, over 4278035.45 frames. ], batch size: 112, lr: 5.91e-03, grad_scale: 32.0 2023-06-21 07:42:22,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=913674.0, ans=0.0 2023-06-21 07:42:22,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=913674.0, ans=0.125 2023-06-21 07:42:25,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=913674.0, ans=0.125 2023-06-21 07:42:41,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=913734.0, ans=0.0 2023-06-21 07:42:41,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=913734.0, ans=0.125 2023-06-21 07:42:48,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=913734.0, ans=0.125 2023-06-21 07:43:04,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=913794.0, ans=0.125 2023-06-21 07:43:10,323 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.05 vs. limit=5.0 2023-06-21 07:43:16,362 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-21 07:43:21,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=913854.0, ans=0.0 2023-06-21 07:43:24,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=913854.0, ans=0.125 2023-06-21 07:43:28,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-06-21 07:43:34,401 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 3.375e+02 4.066e+02 5.117e+02 7.478e+02, threshold=8.132e+02, percent-clipped=2.0 2023-06-21 07:43:51,172 INFO [train.py:996] (3/4) Epoch 5, batch 30350, loss[loss=0.2807, simple_loss=0.3478, pruned_loss=0.1068, over 21687.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3382, pruned_loss=0.09681, over 4283375.17 frames. ], batch size: 298, lr: 5.91e-03, grad_scale: 32.0 2023-06-21 07:44:08,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=913974.0, ans=0.0 2023-06-21 07:44:17,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=914034.0, ans=0.1 2023-06-21 07:45:20,103 INFO [train.py:996] (3/4) Epoch 5, batch 30400, loss[loss=0.2362, simple_loss=0.2971, pruned_loss=0.08763, over 20215.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3331, pruned_loss=0.09448, over 4274826.53 frames. ], batch size: 703, lr: 5.91e-03, grad_scale: 32.0 2023-06-21 07:45:29,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=914274.0, ans=0.125 2023-06-21 07:45:42,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=914334.0, ans=0.125 2023-06-21 07:45:47,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=914334.0, ans=0.125 2023-06-21 07:46:21,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=914454.0, ans=0.125 2023-06-21 07:46:35,426 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.883e+02 3.783e+02 4.866e+02 6.156e+02 1.756e+03, threshold=9.731e+02, percent-clipped=9.0 2023-06-21 07:46:46,022 INFO [train.py:996] (3/4) Epoch 5, batch 30450, loss[loss=0.3755, simple_loss=0.4707, pruned_loss=0.1401, over 19697.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3349, pruned_loss=0.09422, over 4211790.49 frames. ], batch size: 702, lr: 5.91e-03, grad_scale: 32.0 2023-06-21 07:46:52,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=914574.0, ans=0.0 2023-06-21 07:47:06,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=914634.0, ans=0.125 2023-06-21 07:47:20,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=914694.0, ans=0.125 2023-06-21 07:47:24,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=914694.0, ans=0.0 2023-06-21 07:47:37,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=914754.0, ans=0.0 2023-06-21 07:47:39,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=914754.0, ans=0.2 2023-06-21 07:49:37,060 INFO [train.py:996] (3/4) Epoch 6, batch 0, loss[loss=0.2395, simple_loss=0.2976, pruned_loss=0.09068, over 21732.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.2976, pruned_loss=0.09068, over 21732.00 frames. ], batch size: 317, lr: 5.35e-03, grad_scale: 32.0 2023-06-21 07:49:37,060 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 07:49:52,706 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2457, simple_loss=0.3531, pruned_loss=0.06922, over 1796401.00 frames. 2023-06-21 07:49:52,706 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-21 07:50:31,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=914898.0, ans=0.0 2023-06-21 07:51:27,044 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 3.626e+02 5.784e+02 9.951e+02 2.861e+03, threshold=1.157e+03, percent-clipped=26.0 2023-06-21 07:51:28,592 INFO [train.py:996] (3/4) Epoch 6, batch 50, loss[loss=0.3208, simple_loss=0.3898, pruned_loss=0.1259, over 21482.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3464, pruned_loss=0.1004, over 970624.88 frames. ], batch size: 471, lr: 5.35e-03, grad_scale: 32.0 2023-06-21 07:52:01,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=915198.0, ans=0.07 2023-06-21 07:52:31,083 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-21 07:52:32,842 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-06-21 07:52:56,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=915378.0, ans=0.1 2023-06-21 07:53:05,783 INFO [train.py:996] (3/4) Epoch 6, batch 100, loss[loss=0.2731, simple_loss=0.3487, pruned_loss=0.09874, over 21510.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3512, pruned_loss=0.09756, over 1713400.33 frames. ], batch size: 194, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 07:53:09,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=915438.0, ans=0.125 2023-06-21 07:53:16,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-21 07:53:54,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=915558.0, ans=0.07 2023-06-21 07:54:02,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=915558.0, ans=0.2 2023-06-21 07:54:41,352 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.736e+02 3.116e+02 3.564e+02 7.052e+02, threshold=6.231e+02, percent-clipped=0.0 2023-06-21 07:54:42,886 INFO [train.py:996] (3/4) Epoch 6, batch 150, loss[loss=0.2882, simple_loss=0.3579, pruned_loss=0.1093, over 21791.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.352, pruned_loss=0.09729, over 2276076.35 frames. ], batch size: 124, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 07:56:19,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=915978.0, ans=0.125 2023-06-21 07:56:22,161 INFO [train.py:996] (3/4) Epoch 6, batch 200, loss[loss=0.2684, simple_loss=0.3731, pruned_loss=0.08183, over 21209.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3463, pruned_loss=0.09468, over 2716151.43 frames. ], batch size: 548, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 07:56:44,765 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-06-21 07:57:31,928 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.05 vs. limit=6.0 2023-06-21 07:58:01,351 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 3.028e+02 3.538e+02 4.112e+02 1.174e+03, threshold=7.076e+02, percent-clipped=8.0 2023-06-21 07:58:01,371 INFO [train.py:996] (3/4) Epoch 6, batch 250, loss[loss=0.237, simple_loss=0.33, pruned_loss=0.07198, over 21666.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3426, pruned_loss=0.09396, over 3059453.49 frames. ], batch size: 414, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 07:58:15,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=916398.0, ans=0.2 2023-06-21 07:58:22,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=916398.0, ans=0.1 2023-06-21 07:58:57,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=916458.0, ans=0.07 2023-06-21 07:59:00,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=916458.0, ans=0.125 2023-06-21 07:59:35,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=916578.0, ans=0.125 2023-06-21 07:59:39,954 INFO [train.py:996] (3/4) Epoch 6, batch 300, loss[loss=0.3092, simple_loss=0.367, pruned_loss=0.1258, over 21475.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3375, pruned_loss=0.09393, over 3320464.06 frames. ], batch size: 194, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 08:00:03,239 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.16 vs. limit=10.0 2023-06-21 08:00:11,049 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=22.5 2023-06-21 08:00:36,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=916758.0, ans=0.125 2023-06-21 08:00:56,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=916818.0, ans=0.125 2023-06-21 08:01:00,281 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=12.0 2023-06-21 08:01:20,510 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 3.010e+02 3.563e+02 4.495e+02 6.815e+02, threshold=7.126e+02, percent-clipped=0.0 2023-06-21 08:01:20,541 INFO [train.py:996] (3/4) Epoch 6, batch 350, loss[loss=0.2831, simple_loss=0.3446, pruned_loss=0.1108, over 20004.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3292, pruned_loss=0.09246, over 3532525.76 frames. ], batch size: 703, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 08:01:41,925 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-21 08:02:01,300 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=22.5 2023-06-21 08:02:01,452 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.68 vs. limit=6.0 2023-06-21 08:02:08,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=917058.0, ans=0.125 2023-06-21 08:02:12,652 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-21 08:02:15,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=917058.0, ans=15.0 2023-06-21 08:02:58,464 INFO [train.py:996] (3/4) Epoch 6, batch 400, loss[loss=0.2299, simple_loss=0.3339, pruned_loss=0.06294, over 21821.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3218, pruned_loss=0.09038, over 3683666.09 frames. ], batch size: 316, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 08:03:11,595 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:03:39,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=917298.0, ans=0.0 2023-06-21 08:04:10,830 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-21 08:04:15,051 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-06-21 08:04:27,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=917478.0, ans=0.125 2023-06-21 08:04:32,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=917478.0, ans=0.125 2023-06-21 08:04:33,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=917478.0, ans=0.0 2023-06-21 08:04:36,491 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.827e+02 3.421e+02 4.074e+02 6.754e+02, threshold=6.843e+02, percent-clipped=0.0 2023-06-21 08:04:36,511 INFO [train.py:996] (3/4) Epoch 6, batch 450, loss[loss=0.2078, simple_loss=0.3068, pruned_loss=0.05438, over 21782.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3174, pruned_loss=0.08839, over 3818790.43 frames. ], batch size: 371, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 08:04:44,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=917538.0, ans=0.0 2023-06-21 08:04:44,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=917538.0, ans=0.125 2023-06-21 08:04:53,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=917538.0, ans=0.125 2023-06-21 08:05:26,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=917658.0, ans=0.125 2023-06-21 08:05:26,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=917658.0, ans=0.1 2023-06-21 08:05:51,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=917718.0, ans=0.125 2023-06-21 08:06:18,061 INFO [train.py:996] (3/4) Epoch 6, batch 500, loss[loss=0.2391, simple_loss=0.307, pruned_loss=0.08561, over 21780.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3172, pruned_loss=0.0868, over 3920754.20 frames. ], batch size: 247, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 08:06:45,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=917898.0, ans=0.125 2023-06-21 08:07:00,182 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-21 08:07:19,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=918018.0, ans=0.125 2023-06-21 08:07:27,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=918018.0, ans=0.0 2023-06-21 08:07:45,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=918078.0, ans=0.5 2023-06-21 08:07:51,191 INFO [train.py:996] (3/4) Epoch 6, batch 550, loss[loss=0.2191, simple_loss=0.2914, pruned_loss=0.07338, over 21468.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3224, pruned_loss=0.08697, over 3987439.19 frames. ], batch size: 131, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 08:07:57,429 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.995e+02 3.563e+02 4.699e+02 8.861e+02, threshold=7.125e+02, percent-clipped=10.0 2023-06-21 08:08:18,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=918198.0, ans=0.2 2023-06-21 08:08:19,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=918198.0, ans=0.125 2023-06-21 08:09:29,217 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=22.5 2023-06-21 08:09:31,310 INFO [train.py:996] (3/4) Epoch 6, batch 600, loss[loss=0.1933, simple_loss=0.3054, pruned_loss=0.04061, over 20809.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3254, pruned_loss=0.08716, over 4046674.24 frames. ], batch size: 608, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 08:10:21,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=918558.0, ans=0.0 2023-06-21 08:10:35,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=15.0 2023-06-21 08:11:09,680 INFO [train.py:996] (3/4) Epoch 6, batch 650, loss[loss=0.2237, simple_loss=0.2966, pruned_loss=0.0754, over 21857.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3277, pruned_loss=0.08807, over 4102907.80 frames. ], batch size: 124, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 08:11:11,270 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.881e+02 3.396e+02 3.907e+02 7.469e+02, threshold=6.792e+02, percent-clipped=1.0 2023-06-21 08:12:26,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=918978.0, ans=0.125 2023-06-21 08:12:42,308 INFO [train.py:996] (3/4) Epoch 6, batch 700, loss[loss=0.2563, simple_loss=0.3298, pruned_loss=0.09136, over 21771.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3279, pruned_loss=0.0896, over 4145983.20 frames. ], batch size: 112, lr: 5.33e-03, grad_scale: 16.0 2023-06-21 08:13:14,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=919098.0, ans=0.125 2023-06-21 08:13:27,855 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.95 vs. limit=15.0 2023-06-21 08:13:31,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=919158.0, ans=0.125 2023-06-21 08:14:20,546 INFO [train.py:996] (3/4) Epoch 6, batch 750, loss[loss=0.1947, simple_loss=0.2643, pruned_loss=0.06256, over 21360.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3288, pruned_loss=0.09049, over 4181584.95 frames. ], batch size: 194, lr: 5.33e-03, grad_scale: 16.0 2023-06-21 08:14:26,736 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.276e+02 4.088e+02 4.962e+02 1.159e+03, threshold=8.176e+02, percent-clipped=5.0 2023-06-21 08:15:10,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=919458.0, ans=0.125 2023-06-21 08:15:19,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=919458.0, ans=0.1 2023-06-21 08:15:58,071 INFO [train.py:996] (3/4) Epoch 6, batch 800, loss[loss=0.2347, simple_loss=0.3016, pruned_loss=0.08394, over 21947.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3262, pruned_loss=0.09013, over 4201527.98 frames. ], batch size: 333, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:16:12,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=919638.0, ans=0.125 2023-06-21 08:16:27,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=919698.0, ans=0.1 2023-06-21 08:16:32,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=919698.0, ans=0.125 2023-06-21 08:17:06,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=919818.0, ans=0.0 2023-06-21 08:17:27,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=919878.0, ans=0.0 2023-06-21 08:17:31,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=919878.0, ans=0.125 2023-06-21 08:17:34,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=919878.0, ans=0.125 2023-06-21 08:17:38,689 INFO [train.py:996] (3/4) Epoch 6, batch 850, loss[loss=0.2043, simple_loss=0.2876, pruned_loss=0.06044, over 21240.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3231, pruned_loss=0.09021, over 4226226.13 frames. ], batch size: 144, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:17:40,196 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.357e+02 2.947e+02 3.491e+02 3.933e+02 7.622e+02, threshold=6.983e+02, percent-clipped=0.0 2023-06-21 08:18:17,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.48 vs. limit=15.0 2023-06-21 08:18:45,537 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-21 08:19:21,803 INFO [train.py:996] (3/4) Epoch 6, batch 900, loss[loss=0.2144, simple_loss=0.2798, pruned_loss=0.07453, over 21760.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3201, pruned_loss=0.08864, over 4230153.65 frames. ], batch size: 124, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:19:35,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=920238.0, ans=0.0 2023-06-21 08:19:55,095 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.91 vs. limit=22.5 2023-06-21 08:21:05,282 INFO [train.py:996] (3/4) Epoch 6, batch 950, loss[loss=0.2261, simple_loss=0.3159, pruned_loss=0.06818, over 21762.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3181, pruned_loss=0.08863, over 4240159.47 frames. ], batch size: 247, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:21:06,935 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.884e+02 3.289e+02 4.152e+02 6.570e+02, threshold=6.579e+02, percent-clipped=0.0 2023-06-21 08:22:03,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=920718.0, ans=0.125 2023-06-21 08:22:13,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=920778.0, ans=0.125 2023-06-21 08:22:39,420 INFO [train.py:996] (3/4) Epoch 6, batch 1000, loss[loss=0.2693, simple_loss=0.3579, pruned_loss=0.09031, over 21717.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3182, pruned_loss=0.08922, over 4257911.67 frames. ], batch size: 414, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:23:38,310 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-21 08:23:49,297 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=12.0 2023-06-21 08:23:50,480 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:24:09,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=921078.0, ans=0.125 2023-06-21 08:24:11,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-21 08:24:13,741 INFO [train.py:996] (3/4) Epoch 6, batch 1050, loss[loss=0.3361, simple_loss=0.3791, pruned_loss=0.1465, over 21456.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3199, pruned_loss=0.09087, over 4260804.40 frames. ], batch size: 507, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:24:15,318 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.022e+02 3.396e+02 3.710e+02 5.985e+02, threshold=6.792e+02, percent-clipped=0.0 2023-06-21 08:24:22,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=921138.0, ans=0.025 2023-06-21 08:24:39,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=921198.0, ans=0.0 2023-06-21 08:25:21,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=921378.0, ans=0.1 2023-06-21 08:25:48,836 INFO [train.py:996] (3/4) Epoch 6, batch 1100, loss[loss=0.3209, simple_loss=0.3818, pruned_loss=0.13, over 21577.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.32, pruned_loss=0.08957, over 4267571.42 frames. ], batch size: 414, lr: 5.33e-03, grad_scale: 16.0 2023-06-21 08:26:25,328 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=15.0 2023-06-21 08:27:25,422 INFO [train.py:996] (3/4) Epoch 6, batch 1150, loss[loss=0.2692, simple_loss=0.3391, pruned_loss=0.09962, over 21532.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.32, pruned_loss=0.0883, over 4274775.76 frames. ], batch size: 471, lr: 5.33e-03, grad_scale: 16.0 2023-06-21 08:27:28,760 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 3.136e+02 3.809e+02 5.209e+02 8.344e+02, threshold=7.619e+02, percent-clipped=5.0 2023-06-21 08:27:48,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=12.0 2023-06-21 08:28:22,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=921918.0, ans=0.07 2023-06-21 08:29:05,568 INFO [train.py:996] (3/4) Epoch 6, batch 1200, loss[loss=0.2391, simple_loss=0.2785, pruned_loss=0.09987, over 19970.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3194, pruned_loss=0.08778, over 4277739.37 frames. ], batch size: 703, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:30:11,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=922218.0, ans=0.2 2023-06-21 08:30:44,977 INFO [train.py:996] (3/4) Epoch 6, batch 1250, loss[loss=0.2712, simple_loss=0.332, pruned_loss=0.1052, over 21273.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3217, pruned_loss=0.09022, over 4276826.97 frames. ], batch size: 159, lr: 5.32e-03, grad_scale: 32.0 2023-06-21 08:30:47,955 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.807e+02 3.082e+02 3.703e+02 6.160e+02, threshold=6.164e+02, percent-clipped=0.0 2023-06-21 08:30:55,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=922338.0, ans=0.2 2023-06-21 08:30:56,847 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:31:03,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=922338.0, ans=0.95 2023-06-21 08:31:17,183 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.20 vs. limit=10.0 2023-06-21 08:31:59,170 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-21 08:32:23,336 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=15.0 2023-06-21 08:32:23,538 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.54 vs. limit=5.0 2023-06-21 08:32:25,622 INFO [train.py:996] (3/4) Epoch 6, batch 1300, loss[loss=0.287, simple_loss=0.3705, pruned_loss=0.1018, over 21771.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.324, pruned_loss=0.09127, over 4277694.80 frames. ], batch size: 351, lr: 5.32e-03, grad_scale: 32.0 2023-06-21 08:32:39,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=922638.0, ans=0.1 2023-06-21 08:32:39,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=922638.0, ans=0.0 2023-06-21 08:32:46,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=922698.0, ans=0.125 2023-06-21 08:32:50,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=922698.0, ans=0.1 2023-06-21 08:33:02,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=922758.0, ans=0.125 2023-06-21 08:33:18,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.33 vs. limit=15.0 2023-06-21 08:33:46,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=922818.0, ans=22.5 2023-06-21 08:34:12,137 INFO [train.py:996] (3/4) Epoch 6, batch 1350, loss[loss=0.2725, simple_loss=0.3596, pruned_loss=0.09268, over 21473.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3237, pruned_loss=0.0908, over 4286700.39 frames. ], batch size: 471, lr: 5.32e-03, grad_scale: 32.0 2023-06-21 08:34:15,477 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.151e+02 2.950e+02 3.402e+02 4.327e+02 7.422e+02, threshold=6.804e+02, percent-clipped=3.0 2023-06-21 08:34:23,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=922938.0, ans=0.125 2023-06-21 08:34:36,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=922998.0, ans=0.125 2023-06-21 08:34:38,489 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-21 08:34:54,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=923058.0, ans=0.025 2023-06-21 08:35:38,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=923178.0, ans=0.125 2023-06-21 08:35:47,234 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=15.0 2023-06-21 08:35:48,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=923178.0, ans=0.0 2023-06-21 08:35:50,712 INFO [train.py:996] (3/4) Epoch 6, batch 1400, loss[loss=0.303, simple_loss=0.376, pruned_loss=0.1149, over 21653.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3218, pruned_loss=0.08969, over 4284493.76 frames. ], batch size: 263, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:36:04,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=923238.0, ans=0.125 2023-06-21 08:36:16,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=923298.0, ans=0.125 2023-06-21 08:36:47,467 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.40 vs. limit=15.0 2023-06-21 08:37:31,182 INFO [train.py:996] (3/4) Epoch 6, batch 1450, loss[loss=0.2219, simple_loss=0.3075, pruned_loss=0.06817, over 21702.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3226, pruned_loss=0.09079, over 4278880.67 frames. ], batch size: 263, lr: 5.32e-03, grad_scale: 8.0 2023-06-21 08:37:37,335 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.855e+02 3.384e+02 3.937e+02 6.877e+02, threshold=6.768e+02, percent-clipped=1.0 2023-06-21 08:37:39,456 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:37:55,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=923598.0, ans=0.125 2023-06-21 08:38:05,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=923598.0, ans=0.125 2023-06-21 08:38:07,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=923658.0, ans=0.125 2023-06-21 08:38:40,124 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-21 08:38:44,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=923718.0, ans=0.2 2023-06-21 08:39:05,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=923778.0, ans=0.035 2023-06-21 08:39:11,666 INFO [train.py:996] (3/4) Epoch 6, batch 1500, loss[loss=0.2523, simple_loss=0.3116, pruned_loss=0.09647, over 21811.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3248, pruned_loss=0.09245, over 4285818.94 frames. ], batch size: 282, lr: 5.32e-03, grad_scale: 8.0 2023-06-21 08:39:46,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=923898.0, ans=0.125 2023-06-21 08:40:53,815 INFO [train.py:996] (3/4) Epoch 6, batch 1550, loss[loss=0.1671, simple_loss=0.2194, pruned_loss=0.05744, over 17243.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3229, pruned_loss=0.09093, over 4288087.40 frames. ], batch size: 62, lr: 5.32e-03, grad_scale: 8.0 2023-06-21 08:41:00,454 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.808e+02 3.171e+02 3.740e+02 6.860e+02, threshold=6.342e+02, percent-clipped=1.0 2023-06-21 08:41:33,989 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:42:03,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=924318.0, ans=0.05 2023-06-21 08:42:26,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=924378.0, ans=0.125 2023-06-21 08:42:36,160 INFO [train.py:996] (3/4) Epoch 6, batch 1600, loss[loss=0.3459, simple_loss=0.3997, pruned_loss=0.146, over 21433.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3225, pruned_loss=0.09036, over 4283101.87 frames. ], batch size: 507, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:43:20,804 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.76 vs. limit=15.0 2023-06-21 08:44:19,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=924678.0, ans=0.0 2023-06-21 08:44:19,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=924678.0, ans=0.125 2023-06-21 08:44:21,546 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.74 vs. limit=22.5 2023-06-21 08:44:25,203 INFO [train.py:996] (3/4) Epoch 6, batch 1650, loss[loss=0.2101, simple_loss=0.268, pruned_loss=0.07609, over 21448.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3207, pruned_loss=0.08976, over 4279341.49 frames. ], batch size: 230, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:44:31,640 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 3.183e+02 3.962e+02 4.475e+02 7.912e+02, threshold=7.925e+02, percent-clipped=6.0 2023-06-21 08:44:43,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=924738.0, ans=0.125 2023-06-21 08:45:34,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=924918.0, ans=0.125 2023-06-21 08:46:07,214 INFO [train.py:996] (3/4) Epoch 6, batch 1700, loss[loss=0.2646, simple_loss=0.3253, pruned_loss=0.102, over 21682.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3246, pruned_loss=0.09111, over 4283075.78 frames. ], batch size: 230, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:47:06,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=925158.0, ans=0.0 2023-06-21 08:47:25,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=925278.0, ans=0.125 2023-06-21 08:47:40,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=925278.0, ans=0.2 2023-06-21 08:47:51,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=925278.0, ans=0.125 2023-06-21 08:47:54,707 INFO [train.py:996] (3/4) Epoch 6, batch 1750, loss[loss=0.222, simple_loss=0.2966, pruned_loss=0.07376, over 21474.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3218, pruned_loss=0.08799, over 4265967.45 frames. ], batch size: 211, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:48:05,871 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 3.133e+02 3.705e+02 4.363e+02 7.096e+02, threshold=7.410e+02, percent-clipped=0.0 2023-06-21 08:48:07,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=925338.0, ans=0.125 2023-06-21 08:48:12,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=925338.0, ans=0.1 2023-06-21 08:48:21,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=925398.0, ans=0.125 2023-06-21 08:48:23,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=925398.0, ans=0.125 2023-06-21 08:48:30,787 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=22.5 2023-06-21 08:48:33,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=925398.0, ans=0.125 2023-06-21 08:48:50,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=925458.0, ans=0.125 2023-06-21 08:48:50,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=925458.0, ans=0.1 2023-06-21 08:49:12,586 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-06-21 08:49:37,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=925578.0, ans=0.05 2023-06-21 08:49:43,004 INFO [train.py:996] (3/4) Epoch 6, batch 1800, loss[loss=0.2128, simple_loss=0.3088, pruned_loss=0.05836, over 21746.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3202, pruned_loss=0.0857, over 4274164.05 frames. ], batch size: 352, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:50:08,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=925698.0, ans=0.125 2023-06-21 08:50:11,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=925698.0, ans=0.125 2023-06-21 08:50:14,832 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-21 08:51:23,698 INFO [train.py:996] (3/4) Epoch 6, batch 1850, loss[loss=0.2329, simple_loss=0.3223, pruned_loss=0.07174, over 21784.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3235, pruned_loss=0.08509, over 4277672.44 frames. ], batch size: 282, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 08:51:30,089 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.862e+02 3.405e+02 4.274e+02 8.543e+02, threshold=6.809e+02, percent-clipped=2.0 2023-06-21 08:52:25,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=926118.0, ans=0.0 2023-06-21 08:52:55,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=926178.0, ans=0.2 2023-06-21 08:53:01,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=926178.0, ans=0.125 2023-06-21 08:53:05,024 INFO [train.py:996] (3/4) Epoch 6, batch 1900, loss[loss=0.1839, simple_loss=0.2681, pruned_loss=0.04984, over 21629.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3224, pruned_loss=0.08467, over 4276222.91 frames. ], batch size: 230, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 08:53:10,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=926238.0, ans=0.125 2023-06-21 08:53:38,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=926298.0, ans=0.125 2023-06-21 08:53:40,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=926298.0, ans=0.025 2023-06-21 08:53:42,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=926298.0, ans=0.2 2023-06-21 08:54:07,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=926418.0, ans=0.125 2023-06-21 08:54:34,979 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-21 08:54:48,415 INFO [train.py:996] (3/4) Epoch 6, batch 1950, loss[loss=0.316, simple_loss=0.3772, pruned_loss=0.1273, over 21365.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3213, pruned_loss=0.0855, over 4283557.15 frames. ], batch size: 549, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 08:54:55,255 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 2.902e+02 3.428e+02 4.161e+02 7.529e+02, threshold=6.855e+02, percent-clipped=4.0 2023-06-21 08:56:27,448 INFO [train.py:996] (3/4) Epoch 6, batch 2000, loss[loss=0.1975, simple_loss=0.2666, pruned_loss=0.06415, over 21306.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3173, pruned_loss=0.08388, over 4280654.33 frames. ], batch size: 131, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 08:58:08,459 INFO [train.py:996] (3/4) Epoch 6, batch 2050, loss[loss=0.2704, simple_loss=0.3442, pruned_loss=0.09834, over 21769.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3178, pruned_loss=0.08344, over 4271723.12 frames. ], batch size: 247, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 08:58:19,901 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 3.151e+02 3.657e+02 4.300e+02 8.922e+02, threshold=7.314e+02, percent-clipped=4.0 2023-06-21 08:58:42,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.42 vs. limit=10.0 2023-06-21 08:59:06,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=927318.0, ans=0.125 2023-06-21 08:59:23,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=927318.0, ans=0.1 2023-06-21 08:59:27,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=927318.0, ans=0.125 2023-06-21 08:59:49,714 INFO [train.py:996] (3/4) Epoch 6, batch 2100, loss[loss=0.2364, simple_loss=0.3097, pruned_loss=0.08154, over 20724.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3186, pruned_loss=0.08536, over 4278100.06 frames. ], batch size: 607, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 09:00:06,269 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-21 09:00:07,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=927438.0, ans=0.125 2023-06-21 09:00:25,302 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-21 09:00:31,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=927558.0, ans=0.125 2023-06-21 09:00:33,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=927558.0, ans=0.125 2023-06-21 09:00:37,433 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.90 vs. limit=15.0 2023-06-21 09:01:07,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=927618.0, ans=0.125 2023-06-21 09:01:31,390 INFO [train.py:996] (3/4) Epoch 6, batch 2150, loss[loss=0.2331, simple_loss=0.2933, pruned_loss=0.08648, over 21493.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3183, pruned_loss=0.08684, over 4278717.53 frames. ], batch size: 441, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 09:01:43,410 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.262e+02 2.848e+02 3.301e+02 4.038e+02 6.672e+02, threshold=6.603e+02, percent-clipped=0.0 2023-06-21 09:03:09,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=927978.0, ans=0.125 2023-06-21 09:03:13,665 INFO [train.py:996] (3/4) Epoch 6, batch 2200, loss[loss=0.2391, simple_loss=0.3072, pruned_loss=0.08552, over 21178.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3204, pruned_loss=0.08771, over 4269999.98 frames. ], batch size: 608, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 09:03:41,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=928098.0, ans=0.0 2023-06-21 09:03:47,160 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-21 09:04:05,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=928158.0, ans=0.125 2023-06-21 09:04:07,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=928158.0, ans=0.2 2023-06-21 09:04:36,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=928278.0, ans=0.125 2023-06-21 09:04:48,034 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.24 vs. limit=8.0 2023-06-21 09:04:50,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=928278.0, ans=0.125 2023-06-21 09:04:53,480 INFO [train.py:996] (3/4) Epoch 6, batch 2250, loss[loss=0.1973, simple_loss=0.2624, pruned_loss=0.0661, over 21813.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3167, pruned_loss=0.08498, over 4272499.55 frames. ], batch size: 98, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 09:05:06,967 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.750e+02 3.169e+02 3.694e+02 5.600e+02, threshold=6.338e+02, percent-clipped=0.0 2023-06-21 09:05:14,557 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-06-21 09:05:15,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=928398.0, ans=0.0 2023-06-21 09:05:40,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=928458.0, ans=0.1 2023-06-21 09:06:26,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=928578.0, ans=0.125 2023-06-21 09:06:35,963 INFO [train.py:996] (3/4) Epoch 6, batch 2300, loss[loss=0.2095, simple_loss=0.2783, pruned_loss=0.07034, over 21838.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3136, pruned_loss=0.08485, over 4268420.86 frames. ], batch size: 107, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 09:06:44,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=928638.0, ans=0.015 2023-06-21 09:06:48,102 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.72 vs. limit=6.0 2023-06-21 09:07:21,448 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.89 vs. limit=5.0 2023-06-21 09:07:49,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=928818.0, ans=0.2 2023-06-21 09:07:49,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=928818.0, ans=0.2 2023-06-21 09:08:16,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=928938.0, ans=0.125 2023-06-21 09:08:17,123 INFO [train.py:996] (3/4) Epoch 6, batch 2350, loss[loss=0.2102, simple_loss=0.2697, pruned_loss=0.07537, over 21347.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3113, pruned_loss=0.08551, over 4265230.29 frames. ], batch size: 211, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 09:08:25,731 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.413e+02 3.345e+02 4.237e+02 6.014e+02 1.096e+03, threshold=8.474e+02, percent-clipped=18.0 2023-06-21 09:08:43,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=928998.0, ans=0.0 2023-06-21 09:09:16,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=929118.0, ans=0.0 2023-06-21 09:09:34,141 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=22.5 2023-06-21 09:09:55,388 INFO [train.py:996] (3/4) Epoch 6, batch 2400, loss[loss=0.2928, simple_loss=0.3595, pruned_loss=0.113, over 21592.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3135, pruned_loss=0.08802, over 4270947.40 frames. ], batch size: 415, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 09:10:58,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=929418.0, ans=0.0 2023-06-21 09:11:06,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=929418.0, ans=0.125 2023-06-21 09:11:30,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=929478.0, ans=0.2 2023-06-21 09:11:33,331 INFO [train.py:996] (3/4) Epoch 6, batch 2450, loss[loss=0.2468, simple_loss=0.372, pruned_loss=0.06081, over 20719.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3178, pruned_loss=0.09048, over 4273523.29 frames. ], batch size: 608, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:11:41,390 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.479e+02 3.127e+02 3.688e+02 4.498e+02 8.076e+02, threshold=7.375e+02, percent-clipped=0.0 2023-06-21 09:11:49,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=929598.0, ans=0.1 2023-06-21 09:12:00,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=929598.0, ans=0.125 2023-06-21 09:12:30,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=929658.0, ans=0.1 2023-06-21 09:12:54,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=929778.0, ans=0.125 2023-06-21 09:13:13,610 INFO [train.py:996] (3/4) Epoch 6, batch 2500, loss[loss=0.2276, simple_loss=0.3272, pruned_loss=0.06397, over 21164.00 frames. ], tot_loss[loss=0.251, simple_loss=0.319, pruned_loss=0.0915, over 4282340.25 frames. ], batch size: 143, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:13:38,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=929898.0, ans=0.125 2023-06-21 09:13:41,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=929898.0, ans=0.2 2023-06-21 09:14:13,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=930018.0, ans=0.0 2023-06-21 09:14:32,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=930078.0, ans=0.125 2023-06-21 09:14:49,951 INFO [train.py:996] (3/4) Epoch 6, batch 2550, loss[loss=0.2579, simple_loss=0.3472, pruned_loss=0.08435, over 21720.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3187, pruned_loss=0.0909, over 4278383.65 frames. ], batch size: 247, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:14:58,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.790e+02 3.218e+02 3.631e+02 5.360e+02, threshold=6.436e+02, percent-clipped=0.0 2023-06-21 09:15:57,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=930318.0, ans=0.125 2023-06-21 09:16:00,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=930318.0, ans=0.05 2023-06-21 09:16:06,289 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2023-06-21 09:16:29,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=930438.0, ans=0.2 2023-06-21 09:16:31,246 INFO [train.py:996] (3/4) Epoch 6, batch 2600, loss[loss=0.2629, simple_loss=0.3336, pruned_loss=0.09608, over 21930.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3202, pruned_loss=0.09065, over 4274614.04 frames. ], batch size: 372, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:16:35,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=930438.0, ans=0.2 2023-06-21 09:16:37,051 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=15.0 2023-06-21 09:16:38,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=930438.0, ans=0.125 2023-06-21 09:16:51,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=930498.0, ans=15.0 2023-06-21 09:17:13,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=930558.0, ans=0.125 2023-06-21 09:17:24,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=930558.0, ans=0.04949747468305833 2023-06-21 09:17:31,465 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.70 vs. limit=22.5 2023-06-21 09:17:48,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=930678.0, ans=0.0 2023-06-21 09:18:05,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=930678.0, ans=0.1 2023-06-21 09:18:09,157 INFO [train.py:996] (3/4) Epoch 6, batch 2650, loss[loss=0.227, simple_loss=0.2994, pruned_loss=0.07726, over 21914.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3201, pruned_loss=0.09178, over 4283541.97 frames. ], batch size: 351, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:18:16,901 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.474e+02 3.039e+02 3.537e+02 4.396e+02 7.352e+02, threshold=7.074e+02, percent-clipped=6.0 2023-06-21 09:19:07,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=930858.0, ans=0.125 2023-06-21 09:19:10,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=930858.0, ans=0.0 2023-06-21 09:19:52,593 INFO [train.py:996] (3/4) Epoch 6, batch 2700, loss[loss=0.2562, simple_loss=0.3211, pruned_loss=0.09568, over 21493.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3183, pruned_loss=0.09168, over 4282788.21 frames. ], batch size: 131, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:20:13,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=931098.0, ans=0.0 2023-06-21 09:21:34,655 INFO [train.py:996] (3/4) Epoch 6, batch 2750, loss[loss=0.2781, simple_loss=0.3462, pruned_loss=0.105, over 21797.00 frames. ], tot_loss[loss=0.248, simple_loss=0.316, pruned_loss=0.08998, over 4285353.44 frames. ], batch size: 124, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:21:42,920 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.421e+02 2.939e+02 3.495e+02 4.251e+02 6.748e+02, threshold=6.989e+02, percent-clipped=0.0 2023-06-21 09:22:19,961 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:22:25,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=931458.0, ans=0.125 2023-06-21 09:22:45,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=22.5 2023-06-21 09:23:20,456 INFO [train.py:996] (3/4) Epoch 6, batch 2800, loss[loss=0.2587, simple_loss=0.3254, pruned_loss=0.09598, over 21218.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3201, pruned_loss=0.09021, over 4289572.24 frames. ], batch size: 176, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:23:33,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=931638.0, ans=15.0 2023-06-21 09:24:05,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=931698.0, ans=0.1 2023-06-21 09:25:03,025 INFO [train.py:996] (3/4) Epoch 6, batch 2850, loss[loss=0.2746, simple_loss=0.3582, pruned_loss=0.09553, over 20916.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3246, pruned_loss=0.09198, over 4285304.37 frames. ], batch size: 607, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:25:06,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=931938.0, ans=0.125 2023-06-21 09:25:23,445 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.345e+02 3.230e+02 3.892e+02 4.894e+02 8.283e+02, threshold=7.785e+02, percent-clipped=6.0 2023-06-21 09:25:25,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=931938.0, ans=0.0 2023-06-21 09:25:35,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=931998.0, ans=0.1 2023-06-21 09:25:51,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=932058.0, ans=0.125 2023-06-21 09:25:59,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-21 09:26:21,525 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.83 vs. limit=15.0 2023-06-21 09:26:25,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=932118.0, ans=0.0 2023-06-21 09:26:45,922 INFO [train.py:996] (3/4) Epoch 6, batch 2900, loss[loss=0.2797, simple_loss=0.331, pruned_loss=0.1142, over 21757.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.321, pruned_loss=0.09126, over 4284895.25 frames. ], batch size: 473, lr: 5.30e-03, grad_scale: 16.0 2023-06-21 09:27:00,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=932238.0, ans=0.125 2023-06-21 09:27:22,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=932298.0, ans=0.2 2023-06-21 09:27:24,202 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:28:07,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=932418.0, ans=0.125 2023-06-21 09:28:11,120 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.88 vs. limit=10.0 2023-06-21 09:28:17,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=932478.0, ans=0.125 2023-06-21 09:28:20,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=932478.0, ans=0.04949747468305833 2023-06-21 09:28:24,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.93 vs. limit=15.0 2023-06-21 09:28:28,466 INFO [train.py:996] (3/4) Epoch 6, batch 2950, loss[loss=0.2148, simple_loss=0.3101, pruned_loss=0.05973, over 21663.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3223, pruned_loss=0.09146, over 4291006.20 frames. ], batch size: 230, lr: 5.30e-03, grad_scale: 16.0 2023-06-21 09:28:42,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.364e+02 2.961e+02 3.319e+02 4.000e+02 7.696e+02, threshold=6.638e+02, percent-clipped=0.0 2023-06-21 09:29:03,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=932598.0, ans=0.0 2023-06-21 09:29:19,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=932658.0, ans=0.125 2023-06-21 09:29:44,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=932718.0, ans=0.0 2023-06-21 09:29:44,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=932718.0, ans=0.04949747468305833 2023-06-21 09:29:55,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=932778.0, ans=0.2 2023-06-21 09:30:14,715 INFO [train.py:996] (3/4) Epoch 6, batch 3000, loss[loss=0.2714, simple_loss=0.3438, pruned_loss=0.09952, over 21816.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3273, pruned_loss=0.09272, over 4292617.83 frames. ], batch size: 282, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:30:14,716 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 09:30:30,849 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.1642, 1.4062, 1.9456, 1.7107, 1.1343, 1.9648, 2.0235, 1.1791], device='cuda:3') 2023-06-21 09:30:34,693 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.255, simple_loss=0.3481, pruned_loss=0.08099, over 1796401.00 frames. 2023-06-21 09:30:34,694 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-21 09:30:49,143 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-21 09:30:57,599 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-21 09:31:16,287 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.74 vs. limit=15.0 2023-06-21 09:31:19,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=932958.0, ans=0.125 2023-06-21 09:31:58,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=933078.0, ans=0.125 2023-06-21 09:32:16,767 INFO [train.py:996] (3/4) Epoch 6, batch 3050, loss[loss=0.2028, simple_loss=0.2895, pruned_loss=0.05812, over 21765.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3275, pruned_loss=0.0904, over 4291671.20 frames. ], batch size: 332, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:32:26,221 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 2.903e+02 3.413e+02 4.363e+02 7.333e+02, threshold=6.826e+02, percent-clipped=2.0 2023-06-21 09:32:43,750 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=15.0 2023-06-21 09:33:59,309 INFO [train.py:996] (3/4) Epoch 6, batch 3100, loss[loss=0.2636, simple_loss=0.3488, pruned_loss=0.08916, over 21633.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3255, pruned_loss=0.08923, over 4282004.20 frames. ], batch size: 389, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:34:04,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=933438.0, ans=0.0 2023-06-21 09:34:28,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=933498.0, ans=0.125 2023-06-21 09:34:37,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=933558.0, ans=0.2 2023-06-21 09:35:40,889 INFO [train.py:996] (3/4) Epoch 6, batch 3150, loss[loss=0.2985, simple_loss=0.3587, pruned_loss=0.1192, over 21397.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.328, pruned_loss=0.09042, over 4280938.31 frames. ], batch size: 159, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:35:55,931 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 3.015e+02 3.533e+02 4.107e+02 6.510e+02, threshold=7.067e+02, percent-clipped=0.0 2023-06-21 09:35:59,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=933738.0, ans=0.125 2023-06-21 09:36:03,341 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-21 09:36:30,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=933858.0, ans=0.2 2023-06-21 09:37:02,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=933918.0, ans=0.0 2023-06-21 09:37:16,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=933978.0, ans=0.125 2023-06-21 09:37:22,843 INFO [train.py:996] (3/4) Epoch 6, batch 3200, loss[loss=0.2135, simple_loss=0.3099, pruned_loss=0.05859, over 21739.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3315, pruned_loss=0.09131, over 4284669.54 frames. ], batch size: 351, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:37:34,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=934038.0, ans=0.125 2023-06-21 09:37:49,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=934098.0, ans=0.0 2023-06-21 09:38:31,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=934218.0, ans=0.125 2023-06-21 09:38:47,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=934278.0, ans=0.05 2023-06-21 09:39:08,400 INFO [train.py:996] (3/4) Epoch 6, batch 3250, loss[loss=0.2393, simple_loss=0.3009, pruned_loss=0.08887, over 21844.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3316, pruned_loss=0.0927, over 4281608.02 frames. ], batch size: 98, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:39:18,203 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 2.861e+02 3.274e+02 3.932e+02 7.956e+02, threshold=6.547e+02, percent-clipped=1.0 2023-06-21 09:39:22,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=934338.0, ans=0.0 2023-06-21 09:39:50,474 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=22.5 2023-06-21 09:40:00,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=934458.0, ans=0.125 2023-06-21 09:40:14,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=934518.0, ans=0.125 2023-06-21 09:40:28,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=934578.0, ans=0.125 2023-06-21 09:40:29,497 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-21 09:40:44,343 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.60 vs. limit=15.0 2023-06-21 09:40:49,787 INFO [train.py:996] (3/4) Epoch 6, batch 3300, loss[loss=0.2581, simple_loss=0.3517, pruned_loss=0.08224, over 21589.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3249, pruned_loss=0.09274, over 4274441.86 frames. ], batch size: 441, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:41:42,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=934758.0, ans=0.025 2023-06-21 09:42:21,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=934878.0, ans=0.1 2023-06-21 09:42:30,695 INFO [train.py:996] (3/4) Epoch 6, batch 3350, loss[loss=0.257, simple_loss=0.328, pruned_loss=0.09303, over 21381.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3272, pruned_loss=0.09211, over 4281433.43 frames. ], batch size: 131, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:42:32,823 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:42:45,015 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.921e+02 3.413e+02 3.921e+02 6.338e+02, threshold=6.826e+02, percent-clipped=0.0 2023-06-21 09:42:55,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=934998.0, ans=0.125 2023-06-21 09:43:06,075 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=22.5 2023-06-21 09:43:26,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=935058.0, ans=0.125 2023-06-21 09:44:01,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=935178.0, ans=0.035 2023-06-21 09:44:06,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=935178.0, ans=0.05 2023-06-21 09:44:17,665 INFO [train.py:996] (3/4) Epoch 6, batch 3400, loss[loss=0.2335, simple_loss=0.3015, pruned_loss=0.08272, over 21364.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3281, pruned_loss=0.09385, over 4288688.20 frames. ], batch size: 144, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:44:18,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=935238.0, ans=0.125 2023-06-21 09:44:58,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=935358.0, ans=0.1 2023-06-21 09:45:32,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=935418.0, ans=0.125 2023-06-21 09:46:04,884 INFO [train.py:996] (3/4) Epoch 6, batch 3450, loss[loss=0.2416, simple_loss=0.2849, pruned_loss=0.09914, over 20136.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.322, pruned_loss=0.09258, over 4277273.49 frames. ], batch size: 707, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:46:05,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=935538.0, ans=0.0 2023-06-21 09:46:16,690 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 2.951e+02 3.358e+02 4.026e+02 6.824e+02, threshold=6.715e+02, percent-clipped=0.0 2023-06-21 09:46:28,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=935598.0, ans=0.0 2023-06-21 09:46:34,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=935598.0, ans=0.125 2023-06-21 09:47:01,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=935658.0, ans=10.0 2023-06-21 09:47:40,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=935778.0, ans=0.0 2023-06-21 09:47:47,059 INFO [train.py:996] (3/4) Epoch 6, batch 3500, loss[loss=0.235, simple_loss=0.2934, pruned_loss=0.08836, over 21275.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3302, pruned_loss=0.09626, over 4281996.77 frames. ], batch size: 608, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:48:02,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=935838.0, ans=0.1 2023-06-21 09:48:06,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=935898.0, ans=0.125 2023-06-21 09:48:17,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=935898.0, ans=0.04949747468305833 2023-06-21 09:48:19,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=935898.0, ans=0.125 2023-06-21 09:48:32,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=935958.0, ans=0.125 2023-06-21 09:48:41,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=935958.0, ans=0.1 2023-06-21 09:48:41,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=935958.0, ans=0.0 2023-06-21 09:48:42,160 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=12.0 2023-06-21 09:48:45,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=936018.0, ans=0.125 2023-06-21 09:48:54,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=936018.0, ans=0.125 2023-06-21 09:49:05,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=936078.0, ans=0.1 2023-06-21 09:49:28,506 INFO [train.py:996] (3/4) Epoch 6, batch 3550, loss[loss=0.2211, simple_loss=0.2894, pruned_loss=0.07642, over 21392.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.3336, pruned_loss=0.09784, over 4287444.93 frames. ], batch size: 194, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:49:44,154 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.567e+02 3.119e+02 3.460e+02 4.086e+02 7.821e+02, threshold=6.921e+02, percent-clipped=5.0 2023-06-21 09:50:41,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=936318.0, ans=0.125 2023-06-21 09:51:13,817 INFO [train.py:996] (3/4) Epoch 6, batch 3600, loss[loss=0.2414, simple_loss=0.3092, pruned_loss=0.08678, over 21668.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3281, pruned_loss=0.09669, over 4288636.45 frames. ], batch size: 298, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:51:45,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=936498.0, ans=0.0 2023-06-21 09:52:03,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=936558.0, ans=0.5 2023-06-21 09:52:24,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=936618.0, ans=0.125 2023-06-21 09:52:56,098 INFO [train.py:996] (3/4) Epoch 6, batch 3650, loss[loss=0.2236, simple_loss=0.3044, pruned_loss=0.07141, over 21608.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3294, pruned_loss=0.09727, over 4283984.64 frames. ], batch size: 230, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:53:08,883 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.288e+02 3.039e+02 3.609e+02 4.641e+02 6.973e+02, threshold=7.218e+02, percent-clipped=1.0 2023-06-21 09:53:18,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=936798.0, ans=0.2 2023-06-21 09:53:58,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=936918.0, ans=0.0 2023-06-21 09:54:32,627 INFO [train.py:996] (3/4) Epoch 6, batch 3700, loss[loss=0.2403, simple_loss=0.3153, pruned_loss=0.08266, over 21839.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3288, pruned_loss=0.09615, over 4288343.51 frames. ], batch size: 332, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:54:48,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=937038.0, ans=0.125 2023-06-21 09:54:55,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=937098.0, ans=0.125 2023-06-21 09:55:42,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=937218.0, ans=0.125 2023-06-21 09:55:47,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=937218.0, ans=0.125 2023-06-21 09:56:18,862 INFO [train.py:996] (3/4) Epoch 6, batch 3750, loss[loss=0.2327, simple_loss=0.2874, pruned_loss=0.089, over 21277.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.327, pruned_loss=0.09571, over 4296162.13 frames. ], batch size: 549, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:56:20,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=937338.0, ans=0.1 2023-06-21 09:56:24,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=937338.0, ans=0.04949747468305833 2023-06-21 09:56:31,728 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 2.989e+02 3.545e+02 4.107e+02 7.890e+02, threshold=7.090e+02, percent-clipped=2.0 2023-06-21 09:56:35,969 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.31 vs. limit=5.0 2023-06-21 09:56:40,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=937398.0, ans=0.125 2023-06-21 09:57:17,715 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=22.5 2023-06-21 09:57:27,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=937518.0, ans=0.125 2023-06-21 09:57:42,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=937518.0, ans=0.2 2023-06-21 09:58:01,408 INFO [train.py:996] (3/4) Epoch 6, batch 3800, loss[loss=0.2565, simple_loss=0.3245, pruned_loss=0.09424, over 21815.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.324, pruned_loss=0.09342, over 4294506.42 frames. ], batch size: 247, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:59:35,675 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.37 vs. limit=15.0 2023-06-21 09:59:42,405 INFO [train.py:996] (3/4) Epoch 6, batch 3850, loss[loss=0.233, simple_loss=0.2959, pruned_loss=0.08512, over 21875.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3237, pruned_loss=0.09397, over 4279536.75 frames. ], batch size: 107, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:59:55,378 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 3.412e+02 4.254e+02 5.791e+02 1.316e+03, threshold=8.507e+02, percent-clipped=12.0 2023-06-21 10:00:21,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=937998.0, ans=0.0 2023-06-21 10:00:51,909 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=1.99 vs. limit=12.0 2023-06-21 10:00:53,858 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.34 vs. limit=15.0 2023-06-21 10:01:18,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=938178.0, ans=0.2 2023-06-21 10:01:23,209 INFO [train.py:996] (3/4) Epoch 6, batch 3900, loss[loss=0.2525, simple_loss=0.3086, pruned_loss=0.09821, over 21594.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3195, pruned_loss=0.09387, over 4274441.79 frames. ], batch size: 212, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 10:02:08,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=938358.0, ans=0.125 2023-06-21 10:02:25,116 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-21 10:02:29,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=938418.0, ans=0.2 2023-06-21 10:02:48,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=938478.0, ans=0.07 2023-06-21 10:03:04,370 INFO [train.py:996] (3/4) Epoch 6, batch 3950, loss[loss=0.171, simple_loss=0.2532, pruned_loss=0.04444, over 21479.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3214, pruned_loss=0.0927, over 4274665.62 frames. ], batch size: 212, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 10:03:17,122 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.976e+02 2.886e+02 3.404e+02 4.103e+02 5.613e+02, threshold=6.809e+02, percent-clipped=0.0 2023-06-21 10:04:10,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=938658.0, ans=0.125 2023-06-21 10:04:10,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=938658.0, ans=0.1 2023-06-21 10:04:16,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=938718.0, ans=0.125 2023-06-21 10:04:18,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=938718.0, ans=0.125 2023-06-21 10:04:45,825 INFO [train.py:996] (3/4) Epoch 6, batch 4000, loss[loss=0.2199, simple_loss=0.2783, pruned_loss=0.08072, over 21286.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3134, pruned_loss=0.08841, over 4275212.64 frames. ], batch size: 144, lr: 5.28e-03, grad_scale: 32.0 2023-06-21 10:05:36,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=938958.0, ans=0.09899494936611666 2023-06-21 10:05:43,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=938958.0, ans=0.125 2023-06-21 10:06:12,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=939078.0, ans=0.1 2023-06-21 10:06:21,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=939078.0, ans=0.0 2023-06-21 10:06:26,113 INFO [train.py:996] (3/4) Epoch 6, batch 4050, loss[loss=0.3252, simple_loss=0.3765, pruned_loss=0.1369, over 21617.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3114, pruned_loss=0.08619, over 4278863.30 frames. ], batch size: 507, lr: 5.28e-03, grad_scale: 32.0 2023-06-21 10:06:27,147 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.70 vs. limit=22.5 2023-06-21 10:06:43,677 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 2.907e+02 3.499e+02 4.123e+02 8.601e+02, threshold=6.998e+02, percent-clipped=5.0 2023-06-21 10:07:35,278 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.66 vs. limit=10.0 2023-06-21 10:08:13,467 INFO [train.py:996] (3/4) Epoch 6, batch 4100, loss[loss=0.2249, simple_loss=0.2973, pruned_loss=0.07625, over 21543.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3136, pruned_loss=0.08678, over 4279413.21 frames. ], batch size: 212, lr: 5.28e-03, grad_scale: 32.0 2023-06-21 10:08:23,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=939438.0, ans=0.125 2023-06-21 10:09:14,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=939558.0, ans=0.125 2023-06-21 10:09:25,044 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.54 vs. limit=22.5 2023-06-21 10:09:27,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=939618.0, ans=0.0 2023-06-21 10:09:29,819 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-06-21 10:09:45,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=939678.0, ans=0.125 2023-06-21 10:09:54,960 INFO [train.py:996] (3/4) Epoch 6, batch 4150, loss[loss=0.242, simple_loss=0.3248, pruned_loss=0.07955, over 21661.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3142, pruned_loss=0.08503, over 4279546.72 frames. ], batch size: 263, lr: 5.28e-03, grad_scale: 32.0 2023-06-21 10:10:17,699 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 3.037e+02 3.666e+02 4.331e+02 9.059e+02, threshold=7.332e+02, percent-clipped=3.0 2023-06-21 10:10:18,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=939738.0, ans=15.0 2023-06-21 10:10:25,049 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-21 10:10:52,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=939858.0, ans=0.125 2023-06-21 10:11:44,077 INFO [train.py:996] (3/4) Epoch 6, batch 4200, loss[loss=0.2233, simple_loss=0.2931, pruned_loss=0.07676, over 15633.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3142, pruned_loss=0.08402, over 4267790.77 frames. ], batch size: 61, lr: 5.27e-03, grad_scale: 32.0 2023-06-21 10:12:03,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=940038.0, ans=0.0 2023-06-21 10:12:14,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=940098.0, ans=0.125 2023-06-21 10:13:05,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=940278.0, ans=0.125 2023-06-21 10:13:20,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=940278.0, ans=0.125 2023-06-21 10:13:27,789 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=22.5 2023-06-21 10:13:33,265 INFO [train.py:996] (3/4) Epoch 6, batch 4250, loss[loss=0.2019, simple_loss=0.2686, pruned_loss=0.0676, over 21224.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3214, pruned_loss=0.0863, over 4270508.49 frames. ], batch size: 176, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:13:52,366 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 3.226e+02 3.853e+02 4.783e+02 9.792e+02, threshold=7.707e+02, percent-clipped=2.0 2023-06-21 10:13:56,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=940398.0, ans=0.125 2023-06-21 10:15:15,851 INFO [train.py:996] (3/4) Epoch 6, batch 4300, loss[loss=0.2974, simple_loss=0.3889, pruned_loss=0.1029, over 21473.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3302, pruned_loss=0.08926, over 4269715.28 frames. ], batch size: 471, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:15:31,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=940638.0, ans=0.0 2023-06-21 10:15:53,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=940758.0, ans=0.125 2023-06-21 10:15:53,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=940758.0, ans=0.0 2023-06-21 10:16:02,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=940758.0, ans=0.1 2023-06-21 10:16:05,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=940758.0, ans=0.125 2023-06-21 10:16:11,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=940758.0, ans=0.1 2023-06-21 10:16:24,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=940818.0, ans=0.125 2023-06-21 10:16:52,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=940878.0, ans=0.125 2023-06-21 10:17:01,938 INFO [train.py:996] (3/4) Epoch 6, batch 4350, loss[loss=0.2172, simple_loss=0.2774, pruned_loss=0.07852, over 21543.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3273, pruned_loss=0.08856, over 4259051.14 frames. ], batch size: 247, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:17:16,328 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.286e+02 2.990e+02 3.501e+02 4.556e+02 7.699e+02, threshold=7.002e+02, percent-clipped=0.0 2023-06-21 10:17:45,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=941058.0, ans=0.125 2023-06-21 10:18:33,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=941178.0, ans=0.125 2023-06-21 10:18:43,521 INFO [train.py:996] (3/4) Epoch 6, batch 4400, loss[loss=0.2324, simple_loss=0.3138, pruned_loss=0.0755, over 21378.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3222, pruned_loss=0.08709, over 4259662.71 frames. ], batch size: 176, lr: 5.27e-03, grad_scale: 32.0 2023-06-21 10:18:52,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=941238.0, ans=0.125 2023-06-21 10:18:53,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=941238.0, ans=0.0 2023-06-21 10:19:33,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=941358.0, ans=0.125 2023-06-21 10:19:58,557 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.61 vs. limit=10.0 2023-06-21 10:20:26,294 INFO [train.py:996] (3/4) Epoch 6, batch 4450, loss[loss=0.2324, simple_loss=0.3212, pruned_loss=0.07177, over 21570.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3292, pruned_loss=0.08854, over 4261494.49 frames. ], batch size: 230, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:20:28,673 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=12.0 2023-06-21 10:20:47,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.365e+02 2.781e+02 3.285e+02 3.917e+02 7.316e+02, threshold=6.570e+02, percent-clipped=2.0 2023-06-21 10:21:13,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=941658.0, ans=0.125 2023-06-21 10:21:27,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=941718.0, ans=0.125 2023-06-21 10:21:34,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=941718.0, ans=0.1 2023-06-21 10:21:37,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=941718.0, ans=0.125 2023-06-21 10:22:06,936 INFO [train.py:996] (3/4) Epoch 6, batch 4500, loss[loss=0.2334, simple_loss=0.3377, pruned_loss=0.06458, over 20116.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3306, pruned_loss=0.09045, over 4275785.35 frames. ], batch size: 702, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:22:29,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=941898.0, ans=0.125 2023-06-21 10:23:17,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=942018.0, ans=10.0 2023-06-21 10:23:30,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=942078.0, ans=0.2 2023-06-21 10:23:53,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=942138.0, ans=0.125 2023-06-21 10:23:54,920 INFO [train.py:996] (3/4) Epoch 6, batch 4550, loss[loss=0.2939, simple_loss=0.3638, pruned_loss=0.112, over 21329.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3339, pruned_loss=0.0915, over 4279502.61 frames. ], batch size: 548, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:24:16,138 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.720e+02 3.042e+02 3.501e+02 7.303e+02, threshold=6.084e+02, percent-clipped=1.0 2023-06-21 10:24:32,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=942198.0, ans=0.2 2023-06-21 10:24:57,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=942318.0, ans=0.0 2023-06-21 10:25:09,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=942318.0, ans=0.0 2023-06-21 10:25:37,654 INFO [train.py:996] (3/4) Epoch 6, batch 4600, loss[loss=0.2316, simple_loss=0.3213, pruned_loss=0.07094, over 21650.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3363, pruned_loss=0.09291, over 4283643.88 frames. ], batch size: 389, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:25:47,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=942438.0, ans=0.0 2023-06-21 10:26:21,267 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.99 vs. limit=12.0 2023-06-21 10:26:38,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=942558.0, ans=0.125 2023-06-21 10:26:52,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=942618.0, ans=0.0 2023-06-21 10:27:10,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=942678.0, ans=0.125 2023-06-21 10:27:13,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=942678.0, ans=0.125 2023-06-21 10:27:18,153 INFO [train.py:996] (3/4) Epoch 6, batch 4650, loss[loss=0.1943, simple_loss=0.2616, pruned_loss=0.06344, over 21316.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3284, pruned_loss=0.09039, over 4292096.13 frames. ], batch size: 176, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:27:33,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=942738.0, ans=0.125 2023-06-21 10:27:44,637 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 2.693e+02 3.104e+02 3.574e+02 6.080e+02, threshold=6.208e+02, percent-clipped=0.0 2023-06-21 10:28:35,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=942918.0, ans=0.2 2023-06-21 10:28:59,766 INFO [train.py:996] (3/4) Epoch 6, batch 4700, loss[loss=0.2304, simple_loss=0.2815, pruned_loss=0.08962, over 20085.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.318, pruned_loss=0.08776, over 4280941.80 frames. ], batch size: 707, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:29:46,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-21 10:30:00,168 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-06-21 10:30:29,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=943278.0, ans=0.95 2023-06-21 10:30:39,979 INFO [train.py:996] (3/4) Epoch 6, batch 4750, loss[loss=0.2491, simple_loss=0.3116, pruned_loss=0.09329, over 22049.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3124, pruned_loss=0.08785, over 4282824.72 frames. ], batch size: 119, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:31:00,521 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.759e+02 3.391e+02 4.138e+02 8.179e+02, threshold=6.782e+02, percent-clipped=2.0 2023-06-21 10:31:09,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=943398.0, ans=0.125 2023-06-21 10:31:48,578 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-21 10:31:50,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.22 vs. limit=10.0 2023-06-21 10:31:53,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=943518.0, ans=15.0 2023-06-21 10:32:04,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=943578.0, ans=0.04949747468305833 2023-06-21 10:32:20,449 INFO [train.py:996] (3/4) Epoch 6, batch 4800, loss[loss=0.2295, simple_loss=0.3079, pruned_loss=0.07557, over 21744.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3139, pruned_loss=0.08892, over 4293027.71 frames. ], batch size: 247, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:32:32,713 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.86 vs. limit=15.0 2023-06-21 10:33:16,171 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:33:27,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=943818.0, ans=0.125 2023-06-21 10:34:00,638 INFO [train.py:996] (3/4) Epoch 6, batch 4850, loss[loss=0.2486, simple_loss=0.3104, pruned_loss=0.09342, over 21674.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3153, pruned_loss=0.08878, over 4295689.08 frames. ], batch size: 230, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:34:26,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=943998.0, ans=0.2 2023-06-21 10:34:28,261 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 3.010e+02 3.635e+02 4.678e+02 6.819e+02, threshold=7.270e+02, percent-clipped=1.0 2023-06-21 10:35:42,676 INFO [train.py:996] (3/4) Epoch 6, batch 4900, loss[loss=0.2652, simple_loss=0.3413, pruned_loss=0.09457, over 21292.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.318, pruned_loss=0.09015, over 4299962.65 frames. ], batch size: 159, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:36:06,378 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.23 vs. limit=15.0 2023-06-21 10:36:14,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=944298.0, ans=0.2 2023-06-21 10:36:26,496 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-21 10:36:49,967 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=22.5 2023-06-21 10:36:54,550 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=15.0 2023-06-21 10:37:24,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=944478.0, ans=0.2 2023-06-21 10:37:36,970 INFO [train.py:996] (3/4) Epoch 6, batch 4950, loss[loss=0.2119, simple_loss=0.314, pruned_loss=0.05495, over 21636.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3226, pruned_loss=0.08788, over 4295822.47 frames. ], batch size: 414, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:37:53,135 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.27 vs. limit=15.0 2023-06-21 10:37:53,940 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 2.748e+02 3.392e+02 4.071e+02 6.752e+02, threshold=6.784e+02, percent-clipped=0.0 2023-06-21 10:38:20,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=944658.0, ans=0.09899494936611666 2023-06-21 10:38:52,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=944778.0, ans=0.2 2023-06-21 10:39:10,129 INFO [train.py:996] (3/4) Epoch 6, batch 5000, loss[loss=0.2734, simple_loss=0.341, pruned_loss=0.103, over 21753.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3209, pruned_loss=0.0843, over 4290675.73 frames. ], batch size: 112, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:39:15,091 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:39:48,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=944958.0, ans=0.125 2023-06-21 10:40:26,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=945078.0, ans=0.125 2023-06-21 10:40:43,877 INFO [train.py:996] (3/4) Epoch 6, batch 5050, loss[loss=0.2397, simple_loss=0.3073, pruned_loss=0.08601, over 21433.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.321, pruned_loss=0.08596, over 4289094.60 frames. ], batch size: 194, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:40:49,455 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=22.5 2023-06-21 10:40:50,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=945138.0, ans=0.5 2023-06-21 10:41:00,724 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.826e+02 3.138e+02 3.685e+02 6.329e+02, threshold=6.276e+02, percent-clipped=0.0 2023-06-21 10:41:08,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=945198.0, ans=0.125 2023-06-21 10:41:09,913 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=12.0 2023-06-21 10:42:16,824 INFO [train.py:996] (3/4) Epoch 6, batch 5100, loss[loss=0.2522, simple_loss=0.3206, pruned_loss=0.09191, over 21776.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3186, pruned_loss=0.08602, over 4294248.79 frames. ], batch size: 112, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:43:08,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=945558.0, ans=0.0 2023-06-21 10:43:13,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=945618.0, ans=0.125 2023-06-21 10:43:56,807 INFO [train.py:996] (3/4) Epoch 6, batch 5150, loss[loss=0.2496, simple_loss=0.315, pruned_loss=0.09212, over 21729.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3158, pruned_loss=0.08631, over 4292974.74 frames. ], batch size: 247, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:44:08,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=945738.0, ans=0.2 2023-06-21 10:44:19,126 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.377e+02 2.990e+02 3.465e+02 4.317e+02 6.616e+02, threshold=6.931e+02, percent-clipped=1.0 2023-06-21 10:44:21,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=945798.0, ans=10.0 2023-06-21 10:44:33,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=945858.0, ans=0.125 2023-06-21 10:44:41,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=945858.0, ans=0.5 2023-06-21 10:44:45,900 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=12.0 2023-06-21 10:45:01,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=945918.0, ans=0.125 2023-06-21 10:45:15,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=945978.0, ans=0.125 2023-06-21 10:45:37,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=945978.0, ans=0.09899494936611666 2023-06-21 10:45:42,238 INFO [train.py:996] (3/4) Epoch 6, batch 5200, loss[loss=0.2653, simple_loss=0.3679, pruned_loss=0.08128, over 21208.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3188, pruned_loss=0.08722, over 4289657.82 frames. ], batch size: 548, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:47:21,848 INFO [train.py:996] (3/4) Epoch 6, batch 5250, loss[loss=0.2231, simple_loss=0.3071, pruned_loss=0.06952, over 21408.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3236, pruned_loss=0.08633, over 4292012.92 frames. ], batch size: 211, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:47:39,383 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.923e+02 3.658e+02 4.333e+02 7.638e+02, threshold=7.316e+02, percent-clipped=1.0 2023-06-21 10:47:53,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=946398.0, ans=0.025 2023-06-21 10:48:45,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=946578.0, ans=0.02 2023-06-21 10:49:00,833 INFO [train.py:996] (3/4) Epoch 6, batch 5300, loss[loss=0.2302, simple_loss=0.2964, pruned_loss=0.08198, over 21894.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3224, pruned_loss=0.08618, over 4296419.34 frames. ], batch size: 107, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:49:48,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-06-21 10:50:04,961 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-21 10:50:39,174 INFO [train.py:996] (3/4) Epoch 6, batch 5350, loss[loss=0.238, simple_loss=0.3096, pruned_loss=0.08321, over 21723.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3217, pruned_loss=0.08827, over 4300461.09 frames. ], batch size: 112, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:50:50,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=946938.0, ans=0.125 2023-06-21 10:50:58,324 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.908e+02 3.182e+02 3.564e+02 5.714e+02, threshold=6.365e+02, percent-clipped=0.0 2023-06-21 10:51:13,172 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=22.5 2023-06-21 10:51:25,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=947058.0, ans=0.0 2023-06-21 10:51:57,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=15.0 2023-06-21 10:52:11,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=947178.0, ans=0.2 2023-06-21 10:52:18,091 INFO [train.py:996] (3/4) Epoch 6, batch 5400, loss[loss=0.2231, simple_loss=0.3021, pruned_loss=0.0721, over 21684.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3202, pruned_loss=0.08895, over 4289295.27 frames. ], batch size: 389, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 10:52:20,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=947238.0, ans=0.125 2023-06-21 10:52:26,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=947238.0, ans=0.0 2023-06-21 10:52:58,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=947358.0, ans=0.2 2023-06-21 10:53:36,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=947478.0, ans=0.125 2023-06-21 10:53:53,281 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.43 vs. limit=22.5 2023-06-21 10:53:53,494 INFO [train.py:996] (3/4) Epoch 6, batch 5450, loss[loss=0.227, simple_loss=0.3109, pruned_loss=0.0716, over 21662.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3202, pruned_loss=0.08745, over 4295297.72 frames. ], batch size: 263, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 10:54:03,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=947538.0, ans=0.2 2023-06-21 10:54:17,105 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.387e+02 2.977e+02 3.575e+02 4.401e+02 6.671e+02, threshold=7.149e+02, percent-clipped=1.0 2023-06-21 10:54:31,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=947598.0, ans=0.1 2023-06-21 10:55:28,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=947778.0, ans=0.125 2023-06-21 10:55:28,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=947778.0, ans=0.125 2023-06-21 10:55:34,823 INFO [train.py:996] (3/4) Epoch 6, batch 5500, loss[loss=0.2243, simple_loss=0.3143, pruned_loss=0.06719, over 21435.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3248, pruned_loss=0.08473, over 4293937.35 frames. ], batch size: 211, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 10:55:49,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=947838.0, ans=0.05 2023-06-21 10:55:51,603 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-21 10:55:52,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=947838.0, ans=0.0 2023-06-21 10:56:08,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=947898.0, ans=0.2 2023-06-21 10:56:34,047 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=12.0 2023-06-21 10:57:01,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=948078.0, ans=0.07 2023-06-21 10:57:20,571 INFO [train.py:996] (3/4) Epoch 6, batch 5550, loss[loss=0.3279, simple_loss=0.402, pruned_loss=0.1269, over 21457.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3253, pruned_loss=0.08238, over 4291463.39 frames. ], batch size: 507, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 10:57:31,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=948138.0, ans=0.125 2023-06-21 10:57:40,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=948198.0, ans=0.125 2023-06-21 10:57:45,151 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.700e+02 3.215e+02 3.984e+02 5.956e+02, threshold=6.431e+02, percent-clipped=0.0 2023-06-21 10:58:10,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=948258.0, ans=0.125 2023-06-21 10:59:07,563 INFO [train.py:996] (3/4) Epoch 6, batch 5600, loss[loss=0.2253, simple_loss=0.3249, pruned_loss=0.06285, over 21200.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3272, pruned_loss=0.08111, over 4287655.48 frames. ], batch size: 548, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 10:59:14,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=948438.0, ans=0.2 2023-06-21 10:59:17,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=948438.0, ans=0.0 2023-06-21 10:59:42,041 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.50 vs. limit=10.0 2023-06-21 11:00:35,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=948678.0, ans=0.125 2023-06-21 11:00:37,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=948678.0, ans=0.125 2023-06-21 11:00:46,510 INFO [train.py:996] (3/4) Epoch 6, batch 5650, loss[loss=0.2987, simple_loss=0.3941, pruned_loss=0.1017, over 21285.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3316, pruned_loss=0.0834, over 4286489.22 frames. ], batch size: 548, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:00:50,739 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=22.5 2023-06-21 11:01:10,938 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.904e+02 3.593e+02 4.833e+02 7.419e+02, threshold=7.185e+02, percent-clipped=8.0 2023-06-21 11:01:12,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=948798.0, ans=0.125 2023-06-21 11:01:43,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=948918.0, ans=0.5 2023-06-21 11:02:28,399 INFO [train.py:996] (3/4) Epoch 6, batch 5700, loss[loss=0.2185, simple_loss=0.3023, pruned_loss=0.06736, over 21633.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3305, pruned_loss=0.08477, over 4281030.42 frames. ], batch size: 230, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:02:44,220 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-21 11:02:45,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=949038.0, ans=0.1 2023-06-21 11:02:46,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=949038.0, ans=0.125 2023-06-21 11:03:13,661 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-21 11:03:31,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=949218.0, ans=0.0 2023-06-21 11:04:14,742 INFO [train.py:996] (3/4) Epoch 6, batch 5750, loss[loss=0.1956, simple_loss=0.2913, pruned_loss=0.04989, over 21741.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3231, pruned_loss=0.08165, over 4281899.18 frames. ], batch size: 332, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:04:34,858 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.640e+02 3.214e+02 3.808e+02 7.764e+02, threshold=6.428e+02, percent-clipped=2.0 2023-06-21 11:04:40,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=949398.0, ans=0.2 2023-06-21 11:04:57,550 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.58 vs. limit=22.5 2023-06-21 11:05:33,430 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=15.0 2023-06-21 11:05:56,169 INFO [train.py:996] (3/4) Epoch 6, batch 5800, loss[loss=0.2546, simple_loss=0.3396, pruned_loss=0.08474, over 21662.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.321, pruned_loss=0.07981, over 4276965.60 frames. ], batch size: 230, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:06:02,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=949638.0, ans=0.2 2023-06-21 11:06:33,048 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-21 11:06:48,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=949758.0, ans=0.125 2023-06-21 11:07:31,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=949878.0, ans=0.125 2023-06-21 11:07:32,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=949878.0, ans=0.125 2023-06-21 11:07:37,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=949938.0, ans=0.0 2023-06-21 11:07:39,063 INFO [train.py:996] (3/4) Epoch 6, batch 5850, loss[loss=0.2136, simple_loss=0.3363, pruned_loss=0.04547, over 21168.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3191, pruned_loss=0.07554, over 4276678.63 frames. ], batch size: 548, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:08:03,595 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 2.386e+02 2.763e+02 3.432e+02 5.220e+02, threshold=5.525e+02, percent-clipped=0.0 2023-06-21 11:08:23,963 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=12.0 2023-06-21 11:08:38,036 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-06-21 11:08:38,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-21 11:09:01,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=950178.0, ans=0.0 2023-06-21 11:09:04,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=950178.0, ans=0.125 2023-06-21 11:09:08,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=950178.0, ans=0.1 2023-06-21 11:09:15,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=950178.0, ans=0.1 2023-06-21 11:09:18,233 INFO [train.py:996] (3/4) Epoch 6, batch 5900, loss[loss=0.2441, simple_loss=0.3178, pruned_loss=0.08526, over 21594.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3116, pruned_loss=0.06949, over 4277051.85 frames. ], batch size: 471, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:10:05,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=950358.0, ans=0.125 2023-06-21 11:10:29,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=950418.0, ans=0.0 2023-06-21 11:10:55,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=950538.0, ans=0.05 2023-06-21 11:10:57,226 INFO [train.py:996] (3/4) Epoch 6, batch 5950, loss[loss=0.2373, simple_loss=0.2947, pruned_loss=0.08996, over 22005.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3119, pruned_loss=0.07436, over 4285320.77 frames. ], batch size: 103, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 11:11:22,245 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 2.586e+02 3.198e+02 3.905e+02 6.345e+02, threshold=6.395e+02, percent-clipped=3.0 2023-06-21 11:12:39,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=950838.0, ans=0.035 2023-06-21 11:12:40,605 INFO [train.py:996] (3/4) Epoch 6, batch 6000, loss[loss=0.2246, simple_loss=0.2843, pruned_loss=0.08247, over 21655.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3086, pruned_loss=0.07848, over 4275735.17 frames. ], batch size: 264, lr: 5.24e-03, grad_scale: 32.0 2023-06-21 11:12:40,605 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 11:12:49,925 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.2498, 2.4606, 2.0822, 3.0437], device='cuda:3') 2023-06-21 11:12:57,292 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2656, simple_loss=0.3626, pruned_loss=0.08426, over 1796401.00 frames. 2023-06-21 11:12:57,293 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-21 11:12:57,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=950838.0, ans=0.1 2023-06-21 11:12:59,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=950838.0, ans=0.1 2023-06-21 11:13:23,976 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.76 vs. limit=15.0 2023-06-21 11:13:56,518 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-21 11:13:59,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=951018.0, ans=0.125 2023-06-21 11:14:04,774 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=12.0 2023-06-21 11:14:34,041 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=15.0 2023-06-21 11:14:43,500 INFO [train.py:996] (3/4) Epoch 6, batch 6050, loss[loss=0.1854, simple_loss=0.2514, pruned_loss=0.05966, over 21421.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3037, pruned_loss=0.07943, over 4264274.22 frames. ], batch size: 195, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:14:59,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=15.0 2023-06-21 11:15:16,954 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.804e+02 3.230e+02 3.761e+02 6.873e+02, threshold=6.459e+02, percent-clipped=1.0 2023-06-21 11:16:08,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=951378.0, ans=0.1 2023-06-21 11:16:16,196 INFO [train.py:996] (3/4) Epoch 6, batch 6100, loss[loss=0.2759, simple_loss=0.3411, pruned_loss=0.1054, over 21912.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3017, pruned_loss=0.07801, over 4267644.74 frames. ], batch size: 124, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:16:18,987 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.51 vs. limit=6.0 2023-06-21 11:17:24,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=951618.0, ans=0.125 2023-06-21 11:17:27,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=951618.0, ans=0.125 2023-06-21 11:17:35,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=951618.0, ans=0.0 2023-06-21 11:17:55,285 INFO [train.py:996] (3/4) Epoch 6, batch 6150, loss[loss=0.2153, simple_loss=0.2934, pruned_loss=0.06859, over 21535.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3048, pruned_loss=0.08038, over 4273071.33 frames. ], batch size: 389, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:18:33,351 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.602e+02 3.011e+02 3.655e+02 5.167e+02, threshold=6.022e+02, percent-clipped=0.0 2023-06-21 11:18:33,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=951798.0, ans=0.125 2023-06-21 11:19:39,827 INFO [train.py:996] (3/4) Epoch 6, batch 6200, loss[loss=0.2585, simple_loss=0.3424, pruned_loss=0.08734, over 21710.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3102, pruned_loss=0.08134, over 4278833.22 frames. ], batch size: 414, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:19:47,000 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-21 11:21:14,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=952278.0, ans=0.0 2023-06-21 11:21:20,427 INFO [train.py:996] (3/4) Epoch 6, batch 6250, loss[loss=0.2315, simple_loss=0.2882, pruned_loss=0.0874, over 20263.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3157, pruned_loss=0.08159, over 4276318.71 frames. ], batch size: 702, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:21:39,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=952338.0, ans=0.0 2023-06-21 11:21:53,722 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 3.114e+02 4.042e+02 5.400e+02 9.374e+02, threshold=8.084e+02, percent-clipped=17.0 2023-06-21 11:21:54,858 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.70 vs. limit=22.5 2023-06-21 11:22:06,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=952458.0, ans=0.0 2023-06-21 11:22:25,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=952518.0, ans=0.125 2023-06-21 11:22:40,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=952578.0, ans=0.2 2023-06-21 11:22:57,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=952638.0, ans=0.125 2023-06-21 11:22:58,321 INFO [train.py:996] (3/4) Epoch 6, batch 6300, loss[loss=0.2522, simple_loss=0.3098, pruned_loss=0.09732, over 21772.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3191, pruned_loss=0.08116, over 4279463.98 frames. ], batch size: 247, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:23:00,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=952638.0, ans=10.0 2023-06-21 11:23:21,521 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:23:45,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=952758.0, ans=0.125 2023-06-21 11:23:45,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=952758.0, ans=0.125 2023-06-21 11:23:47,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=952758.0, ans=0.125 2023-06-21 11:24:02,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=952818.0, ans=0.125 2023-06-21 11:24:48,243 INFO [train.py:996] (3/4) Epoch 6, batch 6350, loss[loss=0.2992, simple_loss=0.3636, pruned_loss=0.1173, over 21452.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3208, pruned_loss=0.08441, over 4283803.02 frames. ], batch size: 131, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:24:57,446 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=22.5 2023-06-21 11:25:14,861 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=22.5 2023-06-21 11:25:15,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=952998.0, ans=0.0 2023-06-21 11:25:17,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 3.069e+02 3.635e+02 4.276e+02 7.885e+02, threshold=7.269e+02, percent-clipped=0.0 2023-06-21 11:25:35,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=953058.0, ans=0.0 2023-06-21 11:25:38,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=953058.0, ans=0.07 2023-06-21 11:26:28,455 INFO [train.py:996] (3/4) Epoch 6, batch 6400, loss[loss=0.2944, simple_loss=0.3581, pruned_loss=0.1154, over 21315.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3269, pruned_loss=0.08944, over 4284992.66 frames. ], batch size: 143, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:26:53,177 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=22.5 2023-06-21 11:27:04,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=953358.0, ans=15.0 2023-06-21 11:27:07,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-21 11:27:59,104 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=22.5 2023-06-21 11:28:12,562 INFO [train.py:996] (3/4) Epoch 6, batch 6450, loss[loss=0.2285, simple_loss=0.3052, pruned_loss=0.07593, over 21869.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3298, pruned_loss=0.08883, over 4287010.16 frames. ], batch size: 372, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:28:15,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=953538.0, ans=0.125 2023-06-21 11:28:27,826 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-21 11:28:36,811 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.859e+02 3.374e+02 4.192e+02 6.332e+02, threshold=6.748e+02, percent-clipped=0.0 2023-06-21 11:29:51,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=953778.0, ans=0.0 2023-06-21 11:29:51,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=953778.0, ans=0.0 2023-06-21 11:29:54,933 INFO [train.py:996] (3/4) Epoch 6, batch 6500, loss[loss=0.2321, simple_loss=0.283, pruned_loss=0.09059, over 21393.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3229, pruned_loss=0.0866, over 4284963.59 frames. ], batch size: 131, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:30:51,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=954018.0, ans=0.125 2023-06-21 11:31:13,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=954078.0, ans=0.125 2023-06-21 11:31:20,640 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.06 vs. limit=10.0 2023-06-21 11:31:35,547 INFO [train.py:996] (3/4) Epoch 6, batch 6550, loss[loss=0.2716, simple_loss=0.3365, pruned_loss=0.1033, over 21627.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3221, pruned_loss=0.0856, over 4284554.52 frames. ], batch size: 230, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:31:48,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=954138.0, ans=0.125 2023-06-21 11:31:59,233 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.779e+02 3.082e+02 3.818e+02 7.032e+02, threshold=6.164e+02, percent-clipped=1.0 2023-06-21 11:32:50,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=954318.0, ans=0.125 2023-06-21 11:33:14,272 INFO [train.py:996] (3/4) Epoch 6, batch 6600, loss[loss=0.207, simple_loss=0.262, pruned_loss=0.07594, over 21237.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3165, pruned_loss=0.08484, over 4267750.98 frames. ], batch size: 548, lr: 5.23e-03, grad_scale: 16.0 2023-06-21 11:33:15,264 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=22.5 2023-06-21 11:33:19,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=954438.0, ans=0.07 2023-06-21 11:33:27,305 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:34:12,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=954618.0, ans=0.125 2023-06-21 11:34:12,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=954618.0, ans=0.125 2023-06-21 11:34:32,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=954678.0, ans=0.125 2023-06-21 11:34:52,641 INFO [train.py:996] (3/4) Epoch 6, batch 6650, loss[loss=0.199, simple_loss=0.2604, pruned_loss=0.06876, over 21782.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3084, pruned_loss=0.08229, over 4267140.56 frames. ], batch size: 118, lr: 5.23e-03, grad_scale: 16.0 2023-06-21 11:35:04,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=954738.0, ans=0.1 2023-06-21 11:35:07,799 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-06-21 11:35:21,649 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.585e+02 3.021e+02 3.677e+02 6.066e+02, threshold=6.041e+02, percent-clipped=0.0 2023-06-21 11:35:23,540 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:35:46,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=954858.0, ans=0.95 2023-06-21 11:36:07,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=954918.0, ans=0.125 2023-06-21 11:36:18,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=954978.0, ans=0.0 2023-06-21 11:36:30,472 INFO [train.py:996] (3/4) Epoch 6, batch 6700, loss[loss=0.1975, simple_loss=0.2704, pruned_loss=0.06228, over 21509.00 frames. ], tot_loss[loss=0.233, simple_loss=0.302, pruned_loss=0.08196, over 4273168.44 frames. ], batch size: 212, lr: 5.23e-03, grad_scale: 16.0 2023-06-21 11:36:59,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=955098.0, ans=0.125 2023-06-21 11:38:09,244 INFO [train.py:996] (3/4) Epoch 6, batch 6750, loss[loss=0.2885, simple_loss=0.3806, pruned_loss=0.09819, over 19817.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.2996, pruned_loss=0.08258, over 4263083.08 frames. ], batch size: 703, lr: 5.23e-03, grad_scale: 16.0 2023-06-21 11:38:09,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=955338.0, ans=0.125 2023-06-21 11:38:32,821 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.820e+02 3.249e+02 3.969e+02 6.943e+02, threshold=6.498e+02, percent-clipped=2.0 2023-06-21 11:38:53,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=955458.0, ans=0.125 2023-06-21 11:39:47,076 INFO [train.py:996] (3/4) Epoch 6, batch 6800, loss[loss=0.2399, simple_loss=0.2958, pruned_loss=0.09197, over 21854.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3023, pruned_loss=0.08395, over 4271377.04 frames. ], batch size: 107, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:39:55,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=955638.0, ans=0.0 2023-06-21 11:39:58,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=955638.0, ans=0.125 2023-06-21 11:40:25,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=955758.0, ans=0.125 2023-06-21 11:40:58,396 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-21 11:41:18,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=955878.0, ans=0.0 2023-06-21 11:41:24,271 INFO [train.py:996] (3/4) Epoch 6, batch 6850, loss[loss=0.2715, simple_loss=0.3259, pruned_loss=0.1085, over 21802.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3021, pruned_loss=0.08477, over 4267454.36 frames. ], batch size: 414, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:41:47,317 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:41:48,456 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 2.884e+02 3.443e+02 4.128e+02 6.086e+02, threshold=6.887e+02, percent-clipped=0.0 2023-06-21 11:42:56,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=956178.0, ans=0.0 2023-06-21 11:43:04,861 INFO [train.py:996] (3/4) Epoch 6, batch 6900, loss[loss=0.255, simple_loss=0.3176, pruned_loss=0.09618, over 21526.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3047, pruned_loss=0.08521, over 4278306.89 frames. ], batch size: 131, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:43:55,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=956358.0, ans=0.125 2023-06-21 11:44:09,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=956418.0, ans=0.125 2023-06-21 11:44:17,508 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-21 11:44:21,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=956418.0, ans=0.125 2023-06-21 11:44:42,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=956478.0, ans=0.125 2023-06-21 11:44:45,071 INFO [train.py:996] (3/4) Epoch 6, batch 6950, loss[loss=0.316, simple_loss=0.3762, pruned_loss=0.1278, over 21444.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.307, pruned_loss=0.08177, over 4276397.05 frames. ], batch size: 471, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:44:56,823 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=12.0 2023-06-21 11:45:13,812 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 2.660e+02 3.171e+02 3.598e+02 5.873e+02, threshold=6.343e+02, percent-clipped=0.0 2023-06-21 11:46:24,056 INFO [train.py:996] (3/4) Epoch 6, batch 7000, loss[loss=0.2354, simple_loss=0.2954, pruned_loss=0.08773, over 21627.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3098, pruned_loss=0.08516, over 4277929.51 frames. ], batch size: 298, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:47:09,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=956958.0, ans=0.125 2023-06-21 11:47:49,828 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.37 vs. limit=10.0 2023-06-21 11:48:04,821 INFO [train.py:996] (3/4) Epoch 6, batch 7050, loss[loss=0.2095, simple_loss=0.2699, pruned_loss=0.07453, over 16375.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3085, pruned_loss=0.0846, over 4266322.61 frames. ], batch size: 61, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:48:16,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=957138.0, ans=0.125 2023-06-21 11:48:25,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-06-21 11:48:32,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=957198.0, ans=0.125 2023-06-21 11:48:38,895 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 3.010e+02 3.429e+02 4.410e+02 6.547e+02, threshold=6.858e+02, percent-clipped=1.0 2023-06-21 11:49:19,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=957318.0, ans=0.125 2023-06-21 11:49:45,405 INFO [train.py:996] (3/4) Epoch 6, batch 7100, loss[loss=0.2789, simple_loss=0.3466, pruned_loss=0.1056, over 20694.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3146, pruned_loss=0.08733, over 4270574.93 frames. ], batch size: 607, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:49:51,021 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.57 vs. limit=22.5 2023-06-21 11:49:59,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=15.0 2023-06-21 11:50:41,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=957558.0, ans=0.125 2023-06-21 11:50:43,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=957558.0, ans=0.0 2023-06-21 11:50:59,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=957618.0, ans=0.125 2023-06-21 11:51:08,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=957678.0, ans=0.1 2023-06-21 11:51:25,093 INFO [train.py:996] (3/4) Epoch 6, batch 7150, loss[loss=0.2912, simple_loss=0.3503, pruned_loss=0.116, over 21394.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3119, pruned_loss=0.08484, over 4274972.05 frames. ], batch size: 549, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:52:08,486 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.713e+02 3.100e+02 3.583e+02 6.411e+02, threshold=6.200e+02, percent-clipped=0.0 2023-06-21 11:52:13,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=957798.0, ans=0.125 2023-06-21 11:52:20,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=957858.0, ans=0.0 2023-06-21 11:52:27,451 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.60 vs. limit=10.0 2023-06-21 11:52:54,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=957978.0, ans=0.125 2023-06-21 11:53:14,728 INFO [train.py:996] (3/4) Epoch 6, batch 7200, loss[loss=0.2492, simple_loss=0.3242, pruned_loss=0.08711, over 21764.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3153, pruned_loss=0.08806, over 4277482.11 frames. ], batch size: 102, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:53:53,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=958098.0, ans=0.125 2023-06-21 11:53:56,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=958158.0, ans=0.125 2023-06-21 11:53:56,233 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:54:20,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=958218.0, ans=0.1 2023-06-21 11:54:33,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-21 11:54:54,071 INFO [train.py:996] (3/4) Epoch 6, batch 7250, loss[loss=0.2454, simple_loss=0.2889, pruned_loss=0.101, over 21373.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3101, pruned_loss=0.08808, over 4277153.60 frames. ], batch size: 509, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 11:54:54,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=958338.0, ans=0.125 2023-06-21 11:55:18,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=958398.0, ans=0.0 2023-06-21 11:55:20,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=958398.0, ans=0.0 2023-06-21 11:55:28,077 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.745e+02 3.061e+02 4.034e+02 7.842e+02, threshold=6.122e+02, percent-clipped=5.0 2023-06-21 11:55:39,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=958458.0, ans=0.125 2023-06-21 11:55:44,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=958458.0, ans=0.2 2023-06-21 11:55:59,843 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.59 vs. limit=8.0 2023-06-21 11:56:33,696 INFO [train.py:996] (3/4) Epoch 6, batch 7300, loss[loss=0.206, simple_loss=0.2638, pruned_loss=0.0741, over 21807.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3029, pruned_loss=0.08677, over 4278339.58 frames. ], batch size: 352, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 11:56:45,683 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.33 vs. limit=10.0 2023-06-21 11:57:28,264 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.69 vs. limit=10.0 2023-06-21 11:57:34,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=958818.0, ans=0.05 2023-06-21 11:57:44,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=958818.0, ans=0.0 2023-06-21 11:57:47,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=958878.0, ans=0.125 2023-06-21 11:57:55,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=958878.0, ans=0.2 2023-06-21 11:58:21,333 INFO [train.py:996] (3/4) Epoch 6, batch 7350, loss[loss=0.2871, simple_loss=0.3479, pruned_loss=0.1132, over 21408.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3017, pruned_loss=0.08689, over 4272218.39 frames. ], batch size: 549, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 11:58:25,162 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:58:41,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=958998.0, ans=15.0 2023-06-21 11:58:52,011 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 2.849e+02 3.277e+02 4.064e+02 7.126e+02, threshold=6.555e+02, percent-clipped=2.0 2023-06-21 12:00:07,531 INFO [train.py:996] (3/4) Epoch 6, batch 7400, loss[loss=0.2886, simple_loss=0.3698, pruned_loss=0.1037, over 21619.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3092, pruned_loss=0.08796, over 4273018.45 frames. ], batch size: 441, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:00:32,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=959298.0, ans=0.125 2023-06-21 12:00:45,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=959358.0, ans=10.0 2023-06-21 12:00:51,149 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2023-06-21 12:01:26,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=959478.0, ans=10.0 2023-06-21 12:01:48,230 INFO [train.py:996] (3/4) Epoch 6, batch 7450, loss[loss=0.2196, simple_loss=0.279, pruned_loss=0.08009, over 21780.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3079, pruned_loss=0.08719, over 4264755.76 frames. ], batch size: 124, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:01:55,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=959538.0, ans=15.0 2023-06-21 12:02:13,854 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.784e+02 3.192e+02 3.792e+02 7.564e+02, threshold=6.383e+02, percent-clipped=1.0 2023-06-21 12:02:14,779 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=15.0 2023-06-21 12:02:19,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=959658.0, ans=0.1 2023-06-21 12:03:31,309 INFO [train.py:996] (3/4) Epoch 6, batch 7500, loss[loss=0.2861, simple_loss=0.381, pruned_loss=0.09556, over 21555.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3121, pruned_loss=0.08875, over 4269613.82 frames. ], batch size: 263, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:03:33,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=959838.0, ans=0.125 2023-06-21 12:03:46,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=959898.0, ans=0.0 2023-06-21 12:03:51,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=959898.0, ans=0.0 2023-06-21 12:04:12,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=959958.0, ans=0.05 2023-06-21 12:04:47,789 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-21 12:04:57,488 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-21 12:05:16,807 INFO [train.py:996] (3/4) Epoch 6, batch 7550, loss[loss=0.2157, simple_loss=0.3346, pruned_loss=0.0484, over 20784.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3202, pruned_loss=0.08769, over 4268809.64 frames. ], batch size: 608, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:05:17,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=960138.0, ans=0.1 2023-06-21 12:05:18,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=960138.0, ans=0.125 2023-06-21 12:05:23,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=960138.0, ans=0.2 2023-06-21 12:05:41,921 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 3.161e+02 3.642e+02 4.665e+02 7.611e+02, threshold=7.284e+02, percent-clipped=6.0 2023-06-21 12:05:48,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=960258.0, ans=0.125 2023-06-21 12:05:50,180 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:06:03,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=960258.0, ans=0.05 2023-06-21 12:06:03,547 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-21 12:06:38,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=960378.0, ans=0.2 2023-06-21 12:06:50,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=960378.0, ans=0.125 2023-06-21 12:06:54,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=960438.0, ans=0.125 2023-06-21 12:06:56,334 INFO [train.py:996] (3/4) Epoch 6, batch 7600, loss[loss=0.2264, simple_loss=0.2957, pruned_loss=0.07858, over 21895.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3191, pruned_loss=0.0871, over 4276816.30 frames. ], batch size: 351, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 12:07:43,695 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=15.0 2023-06-21 12:07:51,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=960558.0, ans=0.015 2023-06-21 12:08:16,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=960618.0, ans=15.0 2023-06-21 12:08:34,750 INFO [train.py:996] (3/4) Epoch 6, batch 7650, loss[loss=0.2428, simple_loss=0.3012, pruned_loss=0.0922, over 21597.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3188, pruned_loss=0.08902, over 4277749.32 frames. ], batch size: 212, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 12:08:53,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=960798.0, ans=0.0 2023-06-21 12:09:01,331 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 3.034e+02 3.412e+02 4.046e+02 6.566e+02, threshold=6.823e+02, percent-clipped=0.0 2023-06-21 12:09:03,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=960798.0, ans=0.125 2023-06-21 12:09:18,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=960858.0, ans=0.035 2023-06-21 12:09:30,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.23 vs. limit=10.0 2023-06-21 12:09:46,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=960918.0, ans=0.1 2023-06-21 12:10:00,191 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.83 vs. limit=6.0 2023-06-21 12:10:00,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=960978.0, ans=0.125 2023-06-21 12:10:08,026 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-21 12:10:17,237 INFO [train.py:996] (3/4) Epoch 6, batch 7700, loss[loss=0.196, simple_loss=0.3011, pruned_loss=0.04548, over 20703.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3218, pruned_loss=0.09114, over 4280172.89 frames. ], batch size: 608, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 12:10:27,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=961038.0, ans=0.1 2023-06-21 12:11:46,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=961278.0, ans=0.2 2023-06-21 12:11:52,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=961278.0, ans=0.1 2023-06-21 12:11:59,166 INFO [train.py:996] (3/4) Epoch 6, batch 7750, loss[loss=0.253, simple_loss=0.3278, pruned_loss=0.08908, over 21367.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3253, pruned_loss=0.09004, over 4272525.08 frames. ], batch size: 131, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 12:12:19,229 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-21 12:12:30,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=961398.0, ans=0.0 2023-06-21 12:12:32,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=961398.0, ans=0.1 2023-06-21 12:12:33,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=961398.0, ans=0.2 2023-06-21 12:12:34,948 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.093e+02 3.114e+02 3.576e+02 4.204e+02 7.368e+02, threshold=7.152e+02, percent-clipped=1.0 2023-06-21 12:13:05,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=961458.0, ans=0.1 2023-06-21 12:13:09,051 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=15.0 2023-06-21 12:13:24,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=961578.0, ans=0.125 2023-06-21 12:13:29,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=961578.0, ans=0.2 2023-06-21 12:13:30,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=961578.0, ans=0.0 2023-06-21 12:13:39,892 INFO [train.py:996] (3/4) Epoch 6, batch 7800, loss[loss=0.2102, simple_loss=0.2568, pruned_loss=0.0818, over 21360.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3269, pruned_loss=0.09076, over 4274239.17 frames. ], batch size: 131, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:13:55,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=961638.0, ans=0.1 2023-06-21 12:15:18,180 INFO [train.py:996] (3/4) Epoch 6, batch 7850, loss[loss=0.2602, simple_loss=0.317, pruned_loss=0.1017, over 21795.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3192, pruned_loss=0.0896, over 4280197.29 frames. ], batch size: 372, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:15:21,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=961938.0, ans=0.125 2023-06-21 12:15:33,249 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-21 12:15:40,437 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:16:02,012 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.425e+02 2.926e+02 3.453e+02 4.214e+02 9.317e+02, threshold=6.905e+02, percent-clipped=1.0 2023-06-21 12:16:36,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=962118.0, ans=0.125 2023-06-21 12:16:46,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=962178.0, ans=0.2 2023-06-21 12:17:01,337 INFO [train.py:996] (3/4) Epoch 6, batch 7900, loss[loss=0.2969, simple_loss=0.3917, pruned_loss=0.101, over 21639.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3142, pruned_loss=0.08918, over 4276965.41 frames. ], batch size: 441, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:17:26,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=962298.0, ans=0.125 2023-06-21 12:18:10,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=962418.0, ans=0.125 2023-06-21 12:18:18,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=962418.0, ans=0.125 2023-06-21 12:18:51,623 INFO [train.py:996] (3/4) Epoch 6, batch 7950, loss[loss=0.2591, simple_loss=0.3387, pruned_loss=0.0898, over 21874.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3225, pruned_loss=0.0896, over 4280736.82 frames. ], batch size: 371, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:19:29,544 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.499e+02 3.498e+02 4.372e+02 5.089e+02 1.068e+03, threshold=8.743e+02, percent-clipped=8.0 2023-06-21 12:20:44,509 INFO [train.py:996] (3/4) Epoch 6, batch 8000, loss[loss=0.2927, simple_loss=0.3621, pruned_loss=0.1117, over 21182.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3264, pruned_loss=0.09194, over 4275008.53 frames. ], batch size: 143, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:20:51,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=962838.0, ans=0.0 2023-06-21 12:21:12,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=962898.0, ans=0.125 2023-06-21 12:22:28,327 INFO [train.py:996] (3/4) Epoch 6, batch 8050, loss[loss=0.2158, simple_loss=0.2834, pruned_loss=0.07417, over 21244.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3287, pruned_loss=0.09119, over 4271993.83 frames. ], batch size: 607, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:22:55,399 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.280e+02 2.992e+02 3.416e+02 4.104e+02 8.130e+02, threshold=6.832e+02, percent-clipped=0.0 2023-06-21 12:23:03,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=963258.0, ans=0.0 2023-06-21 12:24:07,613 INFO [train.py:996] (3/4) Epoch 6, batch 8100, loss[loss=0.218, simple_loss=0.2837, pruned_loss=0.07618, over 21659.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3271, pruned_loss=0.09168, over 4278435.18 frames. ], batch size: 263, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:25:11,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=963558.0, ans=0.0 2023-06-21 12:25:19,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=963618.0, ans=0.125 2023-06-21 12:25:25,549 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.43 vs. limit=10.0 2023-06-21 12:25:32,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=963678.0, ans=0.125 2023-06-21 12:25:43,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=963678.0, ans=0.1 2023-06-21 12:25:49,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=963738.0, ans=0.0 2023-06-21 12:25:50,203 INFO [train.py:996] (3/4) Epoch 6, batch 8150, loss[loss=0.3472, simple_loss=0.4275, pruned_loss=0.1335, over 21484.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3343, pruned_loss=0.09268, over 4278988.80 frames. ], batch size: 507, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:25:53,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=963738.0, ans=0.125 2023-06-21 12:26:02,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=963738.0, ans=0.125 2023-06-21 12:26:23,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=963798.0, ans=0.125 2023-06-21 12:26:36,233 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.091e+02 3.490e+02 4.370e+02 7.436e+02, threshold=6.980e+02, percent-clipped=1.0 2023-06-21 12:27:26,836 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-06-21 12:27:27,544 INFO [train.py:996] (3/4) Epoch 6, batch 8200, loss[loss=0.2086, simple_loss=0.2685, pruned_loss=0.07432, over 21347.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3271, pruned_loss=0.09039, over 4269131.83 frames. ], batch size: 131, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:27:30,445 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=22.5 2023-06-21 12:27:44,685 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.95 vs. limit=15.0 2023-06-21 12:27:46,052 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.85 vs. limit=10.0 2023-06-21 12:28:51,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=964278.0, ans=0.125 2023-06-21 12:29:07,634 INFO [train.py:996] (3/4) Epoch 6, batch 8250, loss[loss=0.2509, simple_loss=0.3573, pruned_loss=0.07225, over 20770.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3248, pruned_loss=0.09002, over 4270356.77 frames. ], batch size: 607, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:29:16,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=964338.0, ans=15.0 2023-06-21 12:29:52,676 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:29:56,750 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.891e+02 3.432e+02 4.145e+02 7.025e+02, threshold=6.865e+02, percent-clipped=1.0 2023-06-21 12:30:01,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=964458.0, ans=0.125 2023-06-21 12:30:03,524 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:30:04,044 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=22.5 2023-06-21 12:30:38,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=964578.0, ans=10.0 2023-06-21 12:30:46,921 INFO [train.py:996] (3/4) Epoch 6, batch 8300, loss[loss=0.2005, simple_loss=0.2791, pruned_loss=0.06091, over 21379.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3256, pruned_loss=0.08789, over 4270446.36 frames. ], batch size: 194, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:31:09,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=964638.0, ans=0.125 2023-06-21 12:31:09,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=964638.0, ans=0.125 2023-06-21 12:31:55,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=964818.0, ans=0.125 2023-06-21 12:32:05,257 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=15.0 2023-06-21 12:32:06,686 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:32:16,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=964878.0, ans=0.125 2023-06-21 12:32:33,462 INFO [train.py:996] (3/4) Epoch 6, batch 8350, loss[loss=0.2284, simple_loss=0.3143, pruned_loss=0.07122, over 21561.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3246, pruned_loss=0.08588, over 4274214.43 frames. ], batch size: 195, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:32:37,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=964938.0, ans=0.125 2023-06-21 12:33:03,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=964998.0, ans=0.125 2023-06-21 12:33:17,177 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 2.748e+02 3.092e+02 3.699e+02 5.409e+02, threshold=6.184e+02, percent-clipped=0.0 2023-06-21 12:33:40,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=965118.0, ans=0.0 2023-06-21 12:33:47,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=965118.0, ans=0.125 2023-06-21 12:34:14,493 INFO [train.py:996] (3/4) Epoch 6, batch 8400, loss[loss=0.1856, simple_loss=0.269, pruned_loss=0.05108, over 21162.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3234, pruned_loss=0.08425, over 4269954.65 frames. ], batch size: 176, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:34:14,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=965238.0, ans=0.2 2023-06-21 12:34:26,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=965238.0, ans=0.0 2023-06-21 12:34:43,284 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:35:42,896 INFO [train.py:996] (3/4) Epoch 6, batch 8450, loss[loss=0.2393, simple_loss=0.3004, pruned_loss=0.08913, over 21650.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3198, pruned_loss=0.0824, over 4277481.93 frames. ], batch size: 389, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:35:45,034 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-21 12:36:28,774 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.529e+02 3.064e+02 3.775e+02 6.261e+02, threshold=6.127e+02, percent-clipped=1.0 2023-06-21 12:36:37,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=965658.0, ans=0.035 2023-06-21 12:37:06,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.65 vs. limit=10.0 2023-06-21 12:37:17,959 INFO [train.py:996] (3/4) Epoch 6, batch 8500, loss[loss=0.2201, simple_loss=0.2783, pruned_loss=0.08094, over 21429.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3172, pruned_loss=0.08419, over 4283476.81 frames. ], batch size: 194, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:37:43,824 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=15.0 2023-06-21 12:37:47,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=965898.0, ans=0.125 2023-06-21 12:38:00,899 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=22.5 2023-06-21 12:38:13,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=965958.0, ans=0.125 2023-06-21 12:38:58,627 INFO [train.py:996] (3/4) Epoch 6, batch 8550, loss[loss=0.2746, simple_loss=0.3339, pruned_loss=0.1076, over 20053.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.32, pruned_loss=0.08721, over 4284022.81 frames. ], batch size: 702, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:39:47,343 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.008e+02 3.313e+02 4.045e+02 7.159e+02, threshold=6.625e+02, percent-clipped=3.0 2023-06-21 12:40:00,718 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=15.0 2023-06-21 12:40:57,901 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:41:10,954 INFO [train.py:996] (3/4) Epoch 6, batch 8600, loss[loss=0.3005, simple_loss=0.3667, pruned_loss=0.1172, over 21431.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3254, pruned_loss=0.08955, over 4280375.99 frames. ], batch size: 471, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:41:13,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=966438.0, ans=10.0 2023-06-21 12:42:45,743 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-21 12:42:55,494 INFO [train.py:996] (3/4) Epoch 6, batch 8650, loss[loss=0.311, simple_loss=0.39, pruned_loss=0.116, over 21477.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3324, pruned_loss=0.09165, over 4278664.83 frames. ], batch size: 507, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:43:23,699 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.940e+02 3.541e+02 4.015e+02 7.663e+02, threshold=7.081e+02, percent-clipped=3.0 2023-06-21 12:43:27,618 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=15.0 2023-06-21 12:43:31,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=966858.0, ans=0.125 2023-06-21 12:44:29,614 INFO [train.py:996] (3/4) Epoch 6, batch 8700, loss[loss=0.2313, simple_loss=0.2873, pruned_loss=0.0876, over 21231.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3226, pruned_loss=0.08761, over 4280525.52 frames. ], batch size: 471, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:44:35,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=967038.0, ans=0.125 2023-06-21 12:44:44,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=967098.0, ans=0.0 2023-06-21 12:46:04,008 INFO [train.py:996] (3/4) Epoch 6, batch 8750, loss[loss=0.2461, simple_loss=0.3016, pruned_loss=0.09533, over 21310.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.318, pruned_loss=0.0877, over 4278154.91 frames. ], batch size: 176, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:46:34,033 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 3.061e+02 3.811e+02 4.792e+02 9.884e+02, threshold=7.621e+02, percent-clipped=4.0 2023-06-21 12:47:42,810 INFO [train.py:996] (3/4) Epoch 6, batch 8800, loss[loss=0.3098, simple_loss=0.3827, pruned_loss=0.1185, over 21730.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3258, pruned_loss=0.09049, over 4278689.12 frames. ], batch size: 441, lr: 5.20e-03, grad_scale: 32.0 2023-06-21 12:48:12,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=967758.0, ans=0.09899494936611666 2023-06-21 12:48:32,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=967758.0, ans=0.1 2023-06-21 12:48:45,855 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.36 vs. limit=15.0 2023-06-21 12:48:53,962 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-21 12:49:13,211 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.03 vs. limit=6.0 2023-06-21 12:49:18,426 INFO [train.py:996] (3/4) Epoch 6, batch 8850, loss[loss=0.256, simple_loss=0.326, pruned_loss=0.09298, over 21555.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3331, pruned_loss=0.09236, over 4271753.76 frames. ], batch size: 230, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:49:20,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=967938.0, ans=0.1 2023-06-21 12:49:32,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=967998.0, ans=0.0 2023-06-21 12:49:45,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-21 12:49:48,831 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.854e+02 3.378e+02 4.143e+02 7.151e+02, threshold=6.757e+02, percent-clipped=0.0 2023-06-21 12:49:49,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=968058.0, ans=0.125 2023-06-21 12:50:15,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=968058.0, ans=0.1 2023-06-21 12:50:26,987 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=22.5 2023-06-21 12:50:54,516 INFO [train.py:996] (3/4) Epoch 6, batch 8900, loss[loss=0.2331, simple_loss=0.2916, pruned_loss=0.08735, over 21521.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3278, pruned_loss=0.0902, over 4268260.73 frames. ], batch size: 441, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:51:01,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=968238.0, ans=0.1 2023-06-21 12:51:05,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=968238.0, ans=0.125 2023-06-21 12:51:07,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=968238.0, ans=0.125 2023-06-21 12:51:15,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=968298.0, ans=0.125 2023-06-21 12:51:18,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=968298.0, ans=0.125 2023-06-21 12:52:22,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=968478.0, ans=0.0 2023-06-21 12:52:27,951 INFO [train.py:996] (3/4) Epoch 6, batch 8950, loss[loss=0.213, simple_loss=0.2995, pruned_loss=0.0632, over 21605.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3252, pruned_loss=0.08933, over 4269270.14 frames. ], batch size: 263, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:52:38,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=968538.0, ans=0.1 2023-06-21 12:52:39,275 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.19 vs. limit=6.0 2023-06-21 12:53:12,534 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.104e+02 3.637e+02 4.159e+02 7.258e+02, threshold=7.275e+02, percent-clipped=2.0 2023-06-21 12:53:57,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=968778.0, ans=0.125 2023-06-21 12:53:58,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=968778.0, ans=0.05 2023-06-21 12:54:02,941 INFO [train.py:996] (3/4) Epoch 6, batch 9000, loss[loss=0.2173, simple_loss=0.2777, pruned_loss=0.07849, over 21168.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3196, pruned_loss=0.08868, over 4276769.67 frames. ], batch size: 176, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:54:02,941 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 12:54:25,120 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2624, simple_loss=0.3599, pruned_loss=0.08239, over 1796401.00 frames. 2023-06-21 12:54:25,121 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-21 12:55:08,601 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=22.5 2023-06-21 12:55:26,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=969018.0, ans=0.125 2023-06-21 12:55:40,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=969078.0, ans=0.0 2023-06-21 12:56:01,357 INFO [train.py:996] (3/4) Epoch 6, batch 9050, loss[loss=0.2109, simple_loss=0.2944, pruned_loss=0.06366, over 21685.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3157, pruned_loss=0.08481, over 4275960.02 frames. ], batch size: 298, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:56:01,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=969138.0, ans=0.0 2023-06-21 12:56:12,996 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=12.0 2023-06-21 12:56:26,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=969198.0, ans=0.125 2023-06-21 12:56:38,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=969198.0, ans=0.2 2023-06-21 12:56:41,593 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.936e+02 3.440e+02 3.853e+02 8.730e+02, threshold=6.881e+02, percent-clipped=1.0 2023-06-21 12:56:41,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=969258.0, ans=0.125 2023-06-21 12:56:43,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=969258.0, ans=0.125 2023-06-21 12:56:43,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=969258.0, ans=0.0 2023-06-21 12:57:24,025 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.80 vs. limit=10.0 2023-06-21 12:57:43,019 INFO [train.py:996] (3/4) Epoch 6, batch 9100, loss[loss=0.2362, simple_loss=0.3351, pruned_loss=0.06867, over 21732.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3232, pruned_loss=0.08805, over 4279861.16 frames. ], batch size: 351, lr: 5.19e-03, grad_scale: 16.0 2023-06-21 12:57:45,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=969438.0, ans=0.2 2023-06-21 12:57:52,907 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:59:02,778 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.43 vs. limit=15.0 2023-06-21 12:59:16,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=969678.0, ans=0.125 2023-06-21 12:59:27,363 INFO [train.py:996] (3/4) Epoch 6, batch 9150, loss[loss=0.2461, simple_loss=0.3241, pruned_loss=0.08407, over 21433.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3281, pruned_loss=0.0869, over 4272612.46 frames. ], batch size: 160, lr: 5.19e-03, grad_scale: 16.0 2023-06-21 12:59:37,290 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-21 12:59:39,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=969738.0, ans=0.125 2023-06-21 12:59:57,820 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.788e+02 3.229e+02 4.446e+02 7.555e+02, threshold=6.457e+02, percent-clipped=3.0 2023-06-21 13:00:06,087 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-21 13:00:53,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=969978.0, ans=0.2 2023-06-21 13:01:00,364 INFO [train.py:996] (3/4) Epoch 6, batch 9200, loss[loss=0.3403, simple_loss=0.3985, pruned_loss=0.141, over 21375.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3283, pruned_loss=0.08538, over 4276201.48 frames. ], batch size: 507, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:01:08,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=970038.0, ans=0.1 2023-06-21 13:01:25,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=970098.0, ans=10.0 2023-06-21 13:01:53,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=970218.0, ans=0.0 2023-06-21 13:02:28,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=970278.0, ans=0.1 2023-06-21 13:02:36,672 INFO [train.py:996] (3/4) Epoch 6, batch 9250, loss[loss=0.2358, simple_loss=0.3004, pruned_loss=0.0856, over 21433.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.334, pruned_loss=0.08825, over 4267878.65 frames. ], batch size: 389, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:02:43,576 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.41 vs. limit=22.5 2023-06-21 13:02:55,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=970398.0, ans=0.0 2023-06-21 13:03:05,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=970398.0, ans=0.0 2023-06-21 13:03:07,703 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 3.070e+02 3.502e+02 4.094e+02 6.605e+02, threshold=7.004e+02, percent-clipped=1.0 2023-06-21 13:03:13,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=970458.0, ans=0.0 2023-06-21 13:03:29,068 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-21 13:03:42,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=970518.0, ans=0.0 2023-06-21 13:04:11,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=970578.0, ans=0.125 2023-06-21 13:04:13,724 INFO [train.py:996] (3/4) Epoch 6, batch 9300, loss[loss=0.2267, simple_loss=0.305, pruned_loss=0.07413, over 21275.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3285, pruned_loss=0.0881, over 4265789.67 frames. ], batch size: 159, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:04:47,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=970698.0, ans=0.0 2023-06-21 13:05:33,034 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.71 vs. limit=10.0 2023-06-21 13:05:50,448 INFO [train.py:996] (3/4) Epoch 6, batch 9350, loss[loss=0.2631, simple_loss=0.346, pruned_loss=0.09013, over 21879.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3315, pruned_loss=0.08892, over 4267225.49 frames. ], batch size: 371, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:06:23,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=970998.0, ans=0.0 2023-06-21 13:06:31,419 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.002e+02 3.519e+02 4.065e+02 7.578e+02, threshold=7.038e+02, percent-clipped=1.0 2023-06-21 13:06:55,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=971118.0, ans=0.125 2023-06-21 13:06:57,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=971118.0, ans=0.125 2023-06-21 13:07:26,224 INFO [train.py:996] (3/4) Epoch 6, batch 9400, loss[loss=0.2131, simple_loss=0.2766, pruned_loss=0.07476, over 21247.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3337, pruned_loss=0.08993, over 4262079.20 frames. ], batch size: 159, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:08:17,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=971358.0, ans=0.2 2023-06-21 13:08:39,274 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.05 vs. limit=15.0 2023-06-21 13:08:56,682 INFO [train.py:996] (3/4) Epoch 6, batch 9450, loss[loss=0.22, simple_loss=0.2771, pruned_loss=0.08151, over 21213.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3251, pruned_loss=0.08856, over 4258321.08 frames. ], batch size: 159, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:09:41,567 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 3.152e+02 3.709e+02 4.839e+02 7.749e+02, threshold=7.417e+02, percent-clipped=1.0 2023-06-21 13:10:16,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=971778.0, ans=0.2 2023-06-21 13:10:20,492 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:10:32,103 INFO [train.py:996] (3/4) Epoch 6, batch 9500, loss[loss=0.2039, simple_loss=0.2747, pruned_loss=0.0665, over 21329.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3175, pruned_loss=0.08703, over 4248049.88 frames. ], batch size: 176, lr: 5.19e-03, grad_scale: 8.0 2023-06-21 13:11:37,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=972018.0, ans=0.125 2023-06-21 13:12:07,946 INFO [train.py:996] (3/4) Epoch 6, batch 9550, loss[loss=0.2827, simple_loss=0.3573, pruned_loss=0.1041, over 21606.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3224, pruned_loss=0.08955, over 4254442.43 frames. ], batch size: 389, lr: 5.19e-03, grad_scale: 8.0 2023-06-21 13:12:08,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=972138.0, ans=0.0 2023-06-21 13:13:00,557 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.847e+02 3.325e+02 4.189e+02 8.114e+02, threshold=6.651e+02, percent-clipped=1.0 2023-06-21 13:13:43,388 INFO [train.py:996] (3/4) Epoch 6, batch 9600, loss[loss=0.2213, simple_loss=0.2897, pruned_loss=0.0764, over 21882.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3255, pruned_loss=0.09183, over 4267046.72 frames. ], batch size: 298, lr: 5.19e-03, grad_scale: 16.0 2023-06-21 13:13:45,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=972438.0, ans=0.125 2023-06-21 13:14:22,425 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-06-21 13:15:06,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=972678.0, ans=0.125 2023-06-21 13:15:25,385 INFO [train.py:996] (3/4) Epoch 6, batch 9650, loss[loss=0.2518, simple_loss=0.3261, pruned_loss=0.08878, over 21715.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3255, pruned_loss=0.09108, over 4272183.00 frames. ], batch size: 351, lr: 5.19e-03, grad_scale: 16.0 2023-06-21 13:15:39,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=972738.0, ans=15.0 2023-06-21 13:16:12,604 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.288e+02 2.974e+02 3.476e+02 4.202e+02 8.291e+02, threshold=6.952e+02, percent-clipped=2.0 2023-06-21 13:16:29,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=972918.0, ans=0.0 2023-06-21 13:17:06,087 INFO [train.py:996] (3/4) Epoch 6, batch 9700, loss[loss=0.2134, simple_loss=0.2932, pruned_loss=0.0668, over 21782.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3287, pruned_loss=0.09171, over 4279492.08 frames. ], batch size: 298, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:17:34,510 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-21 13:17:40,477 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=22.5 2023-06-21 13:17:44,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=973158.0, ans=0.09899494936611666 2023-06-21 13:17:51,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=973158.0, ans=0.0 2023-06-21 13:18:00,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=973218.0, ans=0.125 2023-06-21 13:18:00,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=973218.0, ans=0.125 2023-06-21 13:18:15,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=973278.0, ans=0.2 2023-06-21 13:18:31,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=973278.0, ans=0.1 2023-06-21 13:18:35,242 INFO [train.py:996] (3/4) Epoch 6, batch 9750, loss[loss=0.297, simple_loss=0.3736, pruned_loss=0.1103, over 21869.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3226, pruned_loss=0.09079, over 4273491.48 frames. ], batch size: 107, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:18:52,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=973338.0, ans=0.2 2023-06-21 13:19:10,360 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:19:16,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=973398.0, ans=0.0 2023-06-21 13:19:23,214 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.880e+02 3.330e+02 4.100e+02 8.108e+02, threshold=6.660e+02, percent-clipped=1.0 2023-06-21 13:19:52,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=973578.0, ans=0.0 2023-06-21 13:20:09,969 INFO [train.py:996] (3/4) Epoch 6, batch 9800, loss[loss=0.2628, simple_loss=0.3211, pruned_loss=0.1023, over 21655.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3222, pruned_loss=0.09048, over 4271842.44 frames. ], batch size: 230, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:20:22,191 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=15.0 2023-06-21 13:20:26,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-21 13:20:31,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=973698.0, ans=0.1 2023-06-21 13:20:44,672 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=10.0 2023-06-21 13:20:55,793 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-21 13:21:22,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=973818.0, ans=0.1 2023-06-21 13:21:40,039 INFO [train.py:996] (3/4) Epoch 6, batch 9850, loss[loss=0.2019, simple_loss=0.2652, pruned_loss=0.06927, over 21791.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3184, pruned_loss=0.08973, over 4276571.06 frames. ], batch size: 102, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:22:32,261 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.980e+02 2.858e+02 3.118e+02 3.826e+02 5.863e+02, threshold=6.237e+02, percent-clipped=0.0 2023-06-21 13:22:49,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=974118.0, ans=0.0 2023-06-21 13:23:15,701 INFO [train.py:996] (3/4) Epoch 6, batch 9900, loss[loss=0.2601, simple_loss=0.3216, pruned_loss=0.09934, over 21330.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3151, pruned_loss=0.08954, over 4261877.68 frames. ], batch size: 131, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:23:15,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=974238.0, ans=10.0 2023-06-21 13:23:57,167 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-06-21 13:24:20,337 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-21 13:24:24,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=974418.0, ans=0.125 2023-06-21 13:24:56,239 INFO [train.py:996] (3/4) Epoch 6, batch 9950, loss[loss=0.2532, simple_loss=0.3013, pruned_loss=0.1025, over 21578.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3168, pruned_loss=0.0919, over 4267759.56 frames. ], batch size: 263, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:25:02,275 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-21 13:25:37,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=974658.0, ans=0.2 2023-06-21 13:25:39,901 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 2.967e+02 3.408e+02 4.209e+02 6.972e+02, threshold=6.817e+02, percent-clipped=1.0 2023-06-21 13:26:29,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=974778.0, ans=0.125 2023-06-21 13:26:31,838 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.23 vs. limit=6.0 2023-06-21 13:26:32,388 INFO [train.py:996] (3/4) Epoch 6, batch 10000, loss[loss=0.2375, simple_loss=0.2991, pruned_loss=0.08797, over 21164.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3118, pruned_loss=0.09018, over 4266925.47 frames. ], batch size: 143, lr: 5.18e-03, grad_scale: 32.0 2023-06-21 13:27:05,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=974898.0, ans=0.0 2023-06-21 13:27:05,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=974898.0, ans=0.2 2023-06-21 13:27:09,777 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.72 vs. limit=22.5 2023-06-21 13:28:07,388 INFO [train.py:996] (3/4) Epoch 6, batch 10050, loss[loss=0.1952, simple_loss=0.2687, pruned_loss=0.06087, over 21399.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3134, pruned_loss=0.09044, over 4266766.05 frames. ], batch size: 194, lr: 5.18e-03, grad_scale: 32.0 2023-06-21 13:28:18,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=975138.0, ans=0.0 2023-06-21 13:28:38,320 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=22.5 2023-06-21 13:28:43,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=975198.0, ans=0.125 2023-06-21 13:28:48,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=975258.0, ans=0.0 2023-06-21 13:28:51,173 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.710e+02 3.231e+02 4.212e+02 7.416e+02, threshold=6.463e+02, percent-clipped=2.0 2023-06-21 13:29:51,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=975378.0, ans=0.0 2023-06-21 13:29:53,556 INFO [train.py:996] (3/4) Epoch 6, batch 10100, loss[loss=0.2267, simple_loss=0.3101, pruned_loss=0.0716, over 21759.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3121, pruned_loss=0.0885, over 4265418.22 frames. ], batch size: 351, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:29:54,467 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-06-21 13:30:12,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=975498.0, ans=0.2 2023-06-21 13:30:15,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=975498.0, ans=0.125 2023-06-21 13:30:16,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.45 vs. limit=10.0 2023-06-21 13:30:17,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=975498.0, ans=0.0 2023-06-21 13:30:30,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=975558.0, ans=0.125 2023-06-21 13:30:53,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=975618.0, ans=0.0 2023-06-21 13:31:29,493 INFO [train.py:996] (3/4) Epoch 6, batch 10150, loss[loss=0.2406, simple_loss=0.3167, pruned_loss=0.08226, over 21513.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3177, pruned_loss=0.0903, over 4265737.38 frames. ], batch size: 389, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:31:35,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=975738.0, ans=0.05 2023-06-21 13:31:36,626 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.28 vs. limit=15.0 2023-06-21 13:31:43,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=975798.0, ans=0.125 2023-06-21 13:31:53,378 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=15.0 2023-06-21 13:32:04,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.136e+02 3.616e+02 4.302e+02 7.230e+02, threshold=7.231e+02, percent-clipped=1.0 2023-06-21 13:32:29,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=975918.0, ans=0.125 2023-06-21 13:32:37,341 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-06-21 13:32:38,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=975978.0, ans=0.125 2023-06-21 13:32:57,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=975978.0, ans=0.025 2023-06-21 13:33:04,210 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.77 vs. limit=15.0 2023-06-21 13:33:04,722 INFO [train.py:996] (3/4) Epoch 6, batch 10200, loss[loss=0.2355, simple_loss=0.3183, pruned_loss=0.07635, over 21839.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3163, pruned_loss=0.08765, over 4256967.47 frames. ], batch size: 317, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:33:28,645 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.85 vs. limit=12.0 2023-06-21 13:34:10,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.51 vs. limit=15.0 2023-06-21 13:34:40,824 INFO [train.py:996] (3/4) Epoch 6, batch 10250, loss[loss=0.2161, simple_loss=0.2792, pruned_loss=0.07653, over 21808.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.311, pruned_loss=0.08143, over 4258751.65 frames. ], batch size: 102, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:35:21,115 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 2.417e+02 2.778e+02 3.535e+02 6.658e+02, threshold=5.557e+02, percent-clipped=0.0 2023-06-21 13:35:32,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=976458.0, ans=0.0 2023-06-21 13:35:52,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=976518.0, ans=0.2 2023-06-21 13:35:53,581 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=15.0 2023-06-21 13:36:08,311 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-21 13:36:18,478 INFO [train.py:996] (3/4) Epoch 6, batch 10300, loss[loss=0.2654, simple_loss=0.3539, pruned_loss=0.08841, over 21892.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3159, pruned_loss=0.08345, over 4265328.30 frames. ], batch size: 372, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:36:21,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=976638.0, ans=0.0 2023-06-21 13:36:29,007 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.62 vs. limit=10.0 2023-06-21 13:36:31,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=976638.0, ans=0.0 2023-06-21 13:36:54,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=976758.0, ans=0.2 2023-06-21 13:37:24,455 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-21 13:37:51,333 INFO [train.py:996] (3/4) Epoch 6, batch 10350, loss[loss=0.2029, simple_loss=0.2826, pruned_loss=0.06158, over 21658.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3167, pruned_loss=0.08372, over 4263065.89 frames. ], batch size: 263, lr: 5.17e-03, grad_scale: 16.0 2023-06-21 13:38:38,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=977058.0, ans=0.125 2023-06-21 13:38:41,224 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.013e+02 3.424e+02 4.062e+02 6.181e+02, threshold=6.848e+02, percent-clipped=5.0 2023-06-21 13:38:58,256 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-21 13:39:12,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=977178.0, ans=0.125 2023-06-21 13:39:13,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=977178.0, ans=0.125 2023-06-21 13:39:27,353 INFO [train.py:996] (3/4) Epoch 6, batch 10400, loss[loss=0.2048, simple_loss=0.2685, pruned_loss=0.07059, over 21630.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3118, pruned_loss=0.08207, over 4254235.55 frames. ], batch size: 263, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:39:29,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=977238.0, ans=0.0 2023-06-21 13:40:19,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=977358.0, ans=0.125 2023-06-21 13:41:09,999 INFO [train.py:996] (3/4) Epoch 6, batch 10450, loss[loss=0.2424, simple_loss=0.2985, pruned_loss=0.09313, over 16985.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3162, pruned_loss=0.08547, over 4257367.82 frames. ], batch size: 61, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:41:11,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=977538.0, ans=0.125 2023-06-21 13:42:01,146 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.285e+02 3.743e+02 4.607e+02 9.328e+02, threshold=7.486e+02, percent-clipped=7.0 2023-06-21 13:42:10,853 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:42:52,012 INFO [train.py:996] (3/4) Epoch 6, batch 10500, loss[loss=0.2091, simple_loss=0.2844, pruned_loss=0.06689, over 21173.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.316, pruned_loss=0.08435, over 4253405.12 frames. ], batch size: 548, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:42:53,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-21 13:43:54,044 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:44:18,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2023-06-21 13:44:27,335 INFO [train.py:996] (3/4) Epoch 6, batch 10550, loss[loss=0.2235, simple_loss=0.282, pruned_loss=0.08255, over 21335.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3108, pruned_loss=0.08466, over 4241833.36 frames. ], batch size: 473, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:44:53,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=978198.0, ans=0.1 2023-06-21 13:45:09,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=978258.0, ans=0.0 2023-06-21 13:45:10,436 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.50 vs. limit=12.0 2023-06-21 13:45:12,156 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.729e+02 3.054e+02 3.524e+02 6.998e+02, threshold=6.108e+02, percent-clipped=0.0 2023-06-21 13:46:03,784 INFO [train.py:996] (3/4) Epoch 6, batch 10600, loss[loss=0.2432, simple_loss=0.3414, pruned_loss=0.07251, over 19903.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3054, pruned_loss=0.08288, over 4246136.16 frames. ], batch size: 703, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:46:41,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=978498.0, ans=0.1 2023-06-21 13:47:44,897 INFO [train.py:996] (3/4) Epoch 6, batch 10650, loss[loss=0.1822, simple_loss=0.2478, pruned_loss=0.05836, over 21214.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3088, pruned_loss=0.08136, over 4245564.32 frames. ], batch size: 159, lr: 5.17e-03, grad_scale: 16.0 2023-06-21 13:47:49,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=978738.0, ans=0.0 2023-06-21 13:48:26,911 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.319e+02 2.894e+02 3.773e+02 4.928e+02 8.046e+02, threshold=7.546e+02, percent-clipped=12.0 2023-06-21 13:48:45,341 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.23 vs. limit=12.0 2023-06-21 13:48:51,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=978918.0, ans=0.0 2023-06-21 13:48:58,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=978918.0, ans=0.2 2023-06-21 13:49:22,175 INFO [train.py:996] (3/4) Epoch 6, batch 10700, loss[loss=0.2892, simple_loss=0.3571, pruned_loss=0.1106, over 21911.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3079, pruned_loss=0.08136, over 4244761.19 frames. ], batch size: 372, lr: 5.17e-03, grad_scale: 16.0 2023-06-21 13:50:01,598 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=22.5 2023-06-21 13:50:20,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=979218.0, ans=0.125 2023-06-21 13:50:36,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=979218.0, ans=0.125 2023-06-21 13:51:00,437 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=12.0 2023-06-21 13:51:05,748 INFO [train.py:996] (3/4) Epoch 6, batch 10750, loss[loss=0.2619, simple_loss=0.3501, pruned_loss=0.08682, over 21795.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3203, pruned_loss=0.08721, over 4255673.58 frames. ], batch size: 282, lr: 5.17e-03, grad_scale: 16.0 2023-06-21 13:51:16,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=979338.0, ans=0.125 2023-06-21 13:51:39,593 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:51:42,219 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.277e+02 3.198e+02 3.607e+02 4.478e+02 7.932e+02, threshold=7.214e+02, percent-clipped=1.0 2023-06-21 13:52:43,770 INFO [train.py:996] (3/4) Epoch 6, batch 10800, loss[loss=0.2586, simple_loss=0.3331, pruned_loss=0.09203, over 21822.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3251, pruned_loss=0.08771, over 4260278.19 frames. ], batch size: 282, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:52:50,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=979638.0, ans=0.125 2023-06-21 13:53:37,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-21 13:54:07,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=979878.0, ans=0.07 2023-06-21 13:54:12,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=22.5 2023-06-21 13:54:14,990 INFO [train.py:996] (3/4) Epoch 6, batch 10850, loss[loss=0.2588, simple_loss=0.316, pruned_loss=0.1008, over 21486.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.327, pruned_loss=0.08845, over 4260961.01 frames. ], batch size: 509, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:54:33,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=979998.0, ans=0.1 2023-06-21 13:54:33,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=979998.0, ans=0.09899494936611666 2023-06-21 13:54:54,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=980058.0, ans=0.0 2023-06-21 13:55:06,864 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.323e+02 2.788e+02 3.255e+02 3.917e+02 5.822e+02, threshold=6.509e+02, percent-clipped=0.0 2023-06-21 13:55:16,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=980118.0, ans=0.07 2023-06-21 13:55:51,256 INFO [train.py:996] (3/4) Epoch 6, batch 10900, loss[loss=0.2467, simple_loss=0.3131, pruned_loss=0.09014, over 21730.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3189, pruned_loss=0.08638, over 4246008.30 frames. ], batch size: 316, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:56:34,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=980358.0, ans=0.125 2023-06-21 13:57:25,448 INFO [train.py:996] (3/4) Epoch 6, batch 10950, loss[loss=0.2352, simple_loss=0.2973, pruned_loss=0.08657, over 21473.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3135, pruned_loss=0.08414, over 4247742.86 frames. ], batch size: 389, lr: 5.16e-03, grad_scale: 32.0 2023-06-21 13:57:54,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=980598.0, ans=0.1 2023-06-21 13:58:00,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=980598.0, ans=0.1 2023-06-21 13:58:15,824 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 2.774e+02 3.261e+02 3.678e+02 5.101e+02, threshold=6.522e+02, percent-clipped=0.0 2023-06-21 13:58:16,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=980658.0, ans=0.125 2023-06-21 13:58:26,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=980718.0, ans=0.02 2023-06-21 13:58:27,486 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.00 vs. limit=15.0 2023-06-21 13:58:59,434 INFO [train.py:996] (3/4) Epoch 6, batch 11000, loss[loss=0.2959, simple_loss=0.3607, pruned_loss=0.1155, over 21732.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3126, pruned_loss=0.08551, over 4254932.68 frames. ], batch size: 112, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 13:59:00,435 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=12.0 2023-06-21 13:59:35,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=980898.0, ans=0.0 2023-06-21 13:59:51,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=980958.0, ans=0.1 2023-06-21 13:59:55,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=980958.0, ans=0.0 2023-06-21 14:00:32,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=981078.0, ans=0.0 2023-06-21 14:00:36,195 INFO [train.py:996] (3/4) Epoch 6, batch 11050, loss[loss=0.2239, simple_loss=0.287, pruned_loss=0.0804, over 21858.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3116, pruned_loss=0.08639, over 4263101.23 frames. ], batch size: 98, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:01:28,781 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.910e+02 3.183e+02 3.721e+02 5.949e+02, threshold=6.365e+02, percent-clipped=0.0 2023-06-21 14:01:52,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=981378.0, ans=0.2 2023-06-21 14:02:10,604 INFO [train.py:996] (3/4) Epoch 6, batch 11100, loss[loss=0.2876, simple_loss=0.3413, pruned_loss=0.117, over 21291.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3094, pruned_loss=0.08635, over 4254130.58 frames. ], batch size: 471, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:02:44,695 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-21 14:03:48,514 INFO [train.py:996] (3/4) Epoch 6, batch 11150, loss[loss=0.2341, simple_loss=0.293, pruned_loss=0.08764, over 21882.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3067, pruned_loss=0.086, over 4251111.27 frames. ], batch size: 107, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:04:41,517 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 2.660e+02 3.131e+02 3.688e+02 5.663e+02, threshold=6.262e+02, percent-clipped=0.0 2023-06-21 14:04:57,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=981918.0, ans=0.1 2023-06-21 14:05:00,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=981918.0, ans=0.125 2023-06-21 14:05:05,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=981978.0, ans=0.035 2023-06-21 14:05:24,955 INFO [train.py:996] (3/4) Epoch 6, batch 11200, loss[loss=0.2469, simple_loss=0.3047, pruned_loss=0.09454, over 21759.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3053, pruned_loss=0.08532, over 4250829.83 frames. ], batch size: 317, lr: 5.16e-03, grad_scale: 32.0 2023-06-21 14:06:19,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=982158.0, ans=0.2 2023-06-21 14:06:35,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=982278.0, ans=0.2 2023-06-21 14:06:47,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.70 vs. limit=15.0 2023-06-21 14:06:57,263 INFO [train.py:996] (3/4) Epoch 6, batch 11250, loss[loss=0.2815, simple_loss=0.3503, pruned_loss=0.1064, over 21782.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3053, pruned_loss=0.08533, over 4252833.81 frames. ], batch size: 118, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:07:03,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=982338.0, ans=0.2 2023-06-21 14:07:39,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=982458.0, ans=0.125 2023-06-21 14:07:45,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=982458.0, ans=0.2 2023-06-21 14:07:46,952 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.620e+02 2.906e+02 3.338e+02 5.205e+02, threshold=5.813e+02, percent-clipped=0.0 2023-06-21 14:07:58,520 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.90 vs. limit=12.0 2023-06-21 14:08:02,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=982518.0, ans=0.0 2023-06-21 14:08:24,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=982578.0, ans=0.0 2023-06-21 14:08:28,286 INFO [train.py:996] (3/4) Epoch 6, batch 11300, loss[loss=0.2181, simple_loss=0.2886, pruned_loss=0.07382, over 20808.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3068, pruned_loss=0.08576, over 4257652.90 frames. ], batch size: 609, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:08:30,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=982638.0, ans=0.125 2023-06-21 14:08:55,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=982698.0, ans=0.125 2023-06-21 14:08:58,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=982698.0, ans=0.1 2023-06-21 14:09:34,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=982818.0, ans=0.0 2023-06-21 14:09:43,959 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=15.0 2023-06-21 14:10:03,369 INFO [train.py:996] (3/4) Epoch 6, batch 11350, loss[loss=0.2041, simple_loss=0.315, pruned_loss=0.04656, over 20759.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3061, pruned_loss=0.08361, over 4263630.79 frames. ], batch size: 607, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:10:10,229 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:10:39,034 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=22.5 2023-06-21 14:10:51,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=983058.0, ans=0.125 2023-06-21 14:10:53,898 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.216e+02 2.788e+02 3.178e+02 3.739e+02 7.652e+02, threshold=6.355e+02, percent-clipped=2.0 2023-06-21 14:11:05,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=983118.0, ans=0.0 2023-06-21 14:11:08,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=983118.0, ans=0.0 2023-06-21 14:11:35,962 INFO [train.py:996] (3/4) Epoch 6, batch 11400, loss[loss=0.2302, simple_loss=0.3171, pruned_loss=0.07164, over 21718.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3118, pruned_loss=0.08576, over 4260189.70 frames. ], batch size: 298, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:11:53,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=983238.0, ans=0.0 2023-06-21 14:12:00,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=983238.0, ans=0.035 2023-06-21 14:12:31,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=983358.0, ans=0.0 2023-06-21 14:13:18,813 INFO [train.py:996] (3/4) Epoch 6, batch 11450, loss[loss=0.2362, simple_loss=0.3185, pruned_loss=0.07694, over 21699.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3133, pruned_loss=0.08507, over 4260641.62 frames. ], batch size: 351, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:14:00,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=983658.0, ans=0.0 2023-06-21 14:14:03,045 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.362e+02 2.838e+02 3.448e+02 4.254e+02 7.137e+02, threshold=6.896e+02, percent-clipped=4.0 2023-06-21 14:14:25,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=983718.0, ans=0.125 2023-06-21 14:14:37,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=983778.0, ans=0.125 2023-06-21 14:14:51,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=983778.0, ans=0.125 2023-06-21 14:14:55,038 INFO [train.py:996] (3/4) Epoch 6, batch 11500, loss[loss=0.2299, simple_loss=0.3178, pruned_loss=0.07098, over 21769.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3188, pruned_loss=0.08788, over 4264789.41 frames. ], batch size: 298, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:16:23,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=984078.0, ans=0.0 2023-06-21 14:16:37,022 INFO [train.py:996] (3/4) Epoch 6, batch 11550, loss[loss=0.2788, simple_loss=0.3766, pruned_loss=0.09048, over 21850.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3256, pruned_loss=0.08836, over 4268308.67 frames. ], batch size: 316, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:16:45,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=984138.0, ans=0.125 2023-06-21 14:17:16,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=984258.0, ans=0.2 2023-06-21 14:17:17,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=984258.0, ans=0.04949747468305833 2023-06-21 14:17:23,347 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 2.955e+02 3.350e+02 4.139e+02 7.597e+02, threshold=6.701e+02, percent-clipped=2.0 2023-06-21 14:18:01,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=984378.0, ans=0.1 2023-06-21 14:18:09,014 INFO [train.py:996] (3/4) Epoch 6, batch 11600, loss[loss=0.2825, simple_loss=0.3672, pruned_loss=0.09896, over 21389.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3398, pruned_loss=0.08992, over 4262561.22 frames. ], batch size: 194, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:18:23,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=984498.0, ans=0.1 2023-06-21 14:18:43,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=984558.0, ans=0.1 2023-06-21 14:19:45,203 INFO [train.py:996] (3/4) Epoch 6, batch 11650, loss[loss=0.207, simple_loss=0.2833, pruned_loss=0.06538, over 21805.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3447, pruned_loss=0.09047, over 4266496.31 frames. ], batch size: 124, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:20:29,595 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 3.000e+02 3.616e+02 4.303e+02 7.688e+02, threshold=7.232e+02, percent-clipped=3.0 2023-06-21 14:21:21,202 INFO [train.py:996] (3/4) Epoch 6, batch 11700, loss[loss=0.231, simple_loss=0.2931, pruned_loss=0.08442, over 21654.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3358, pruned_loss=0.09068, over 4271443.94 frames. ], batch size: 282, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:22:20,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=985218.0, ans=0.1 2023-06-21 14:22:56,815 INFO [train.py:996] (3/4) Epoch 6, batch 11750, loss[loss=0.2403, simple_loss=0.3097, pruned_loss=0.08541, over 21881.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3277, pruned_loss=0.09, over 4266609.62 frames. ], batch size: 372, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:22:58,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=985338.0, ans=0.125 2023-06-21 14:23:33,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=985458.0, ans=0.0 2023-06-21 14:23:57,408 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.950e+02 3.559e+02 4.361e+02 6.685e+02, threshold=7.118e+02, percent-clipped=0.0 2023-06-21 14:24:13,910 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.77 vs. limit=5.0 2023-06-21 14:24:33,552 INFO [train.py:996] (3/4) Epoch 6, batch 11800, loss[loss=0.2597, simple_loss=0.3379, pruned_loss=0.09076, over 19989.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3285, pruned_loss=0.09109, over 4259546.18 frames. ], batch size: 702, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:24:37,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=985638.0, ans=0.0 2023-06-21 14:25:36,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=985758.0, ans=0.125 2023-06-21 14:25:53,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=985878.0, ans=0.1 2023-06-21 14:25:56,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=985878.0, ans=0.125 2023-06-21 14:25:59,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=985878.0, ans=0.1 2023-06-21 14:26:04,869 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.50 vs. limit=15.0 2023-06-21 14:26:14,994 INFO [train.py:996] (3/4) Epoch 6, batch 11850, loss[loss=0.2605, simple_loss=0.3237, pruned_loss=0.0986, over 21329.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3286, pruned_loss=0.09007, over 4261863.59 frames. ], batch size: 176, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:26:30,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=985998.0, ans=0.05 2023-06-21 14:27:10,030 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.768e+02 3.132e+02 3.956e+02 6.532e+02, threshold=6.263e+02, percent-clipped=0.0 2023-06-21 14:27:48,663 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.22 vs. limit=10.0 2023-06-21 14:27:50,721 INFO [train.py:996] (3/4) Epoch 6, batch 11900, loss[loss=0.2577, simple_loss=0.3426, pruned_loss=0.08642, over 21835.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3278, pruned_loss=0.08766, over 4268478.64 frames. ], batch size: 371, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:28:45,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=986358.0, ans=0.04949747468305833 2023-06-21 14:29:22,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=986478.0, ans=0.125 2023-06-21 14:29:23,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=986478.0, ans=0.0 2023-06-21 14:29:24,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=986478.0, ans=0.0 2023-06-21 14:29:27,037 INFO [train.py:996] (3/4) Epoch 6, batch 11950, loss[loss=0.163, simple_loss=0.2408, pruned_loss=0.04264, over 21194.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3275, pruned_loss=0.08406, over 4270913.89 frames. ], batch size: 143, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:29:32,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=986538.0, ans=0.2 2023-06-21 14:29:43,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=986538.0, ans=0.09899494936611666 2023-06-21 14:30:13,802 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-21 14:30:14,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=986658.0, ans=0.125 2023-06-21 14:30:21,172 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.87 vs. limit=6.0 2023-06-21 14:30:23,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.615e+02 3.254e+02 4.193e+02 8.163e+02, threshold=6.508e+02, percent-clipped=5.0 2023-06-21 14:30:56,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=986778.0, ans=0.0 2023-06-21 14:31:03,745 INFO [train.py:996] (3/4) Epoch 6, batch 12000, loss[loss=0.235, simple_loss=0.2899, pruned_loss=0.09003, over 15609.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3225, pruned_loss=0.08255, over 4262543.46 frames. ], batch size: 61, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:31:03,745 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 14:31:23,358 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2642, simple_loss=0.3586, pruned_loss=0.08492, over 1796401.00 frames. 2023-06-21 14:31:23,359 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-21 14:31:53,692 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.00 vs. limit=15.0 2023-06-21 14:33:01,527 INFO [train.py:996] (3/4) Epoch 6, batch 12050, loss[loss=0.2485, simple_loss=0.3078, pruned_loss=0.09464, over 21366.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3207, pruned_loss=0.08571, over 4263241.53 frames. ], batch size: 143, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:33:05,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=987138.0, ans=0.125 2023-06-21 14:33:24,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=987138.0, ans=0.0 2023-06-21 14:33:30,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=987198.0, ans=0.0 2023-06-21 14:33:30,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=987198.0, ans=0.1 2023-06-21 14:33:42,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=987258.0, ans=0.0 2023-06-21 14:33:46,478 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-06-21 14:33:53,156 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.343e+02 3.010e+02 3.446e+02 4.017e+02 8.146e+02, threshold=6.892e+02, percent-clipped=4.0 2023-06-21 14:34:07,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=987318.0, ans=0.125 2023-06-21 14:34:13,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=987318.0, ans=0.125 2023-06-21 14:34:43,742 INFO [train.py:996] (3/4) Epoch 6, batch 12100, loss[loss=0.287, simple_loss=0.3634, pruned_loss=0.1053, over 21864.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3258, pruned_loss=0.09056, over 4269681.75 frames. ], batch size: 371, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:34:58,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=987438.0, ans=0.125 2023-06-21 14:35:12,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=987498.0, ans=0.125 2023-06-21 14:36:15,711 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-21 14:36:27,296 INFO [train.py:996] (3/4) Epoch 6, batch 12150, loss[loss=0.2768, simple_loss=0.3996, pruned_loss=0.07704, over 19720.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3299, pruned_loss=0.0899, over 4262738.41 frames. ], batch size: 702, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:36:57,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=987798.0, ans=0.1 2023-06-21 14:37:01,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=987858.0, ans=0.2 2023-06-21 14:37:12,869 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=11.69 vs. limit=15.0 2023-06-21 14:37:19,844 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.338e+02 3.020e+02 3.614e+02 4.015e+02 8.551e+02, threshold=7.228e+02, percent-clipped=5.0 2023-06-21 14:37:37,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=987918.0, ans=0.0 2023-06-21 14:37:37,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=987918.0, ans=0.2 2023-06-21 14:38:01,607 INFO [train.py:996] (3/4) Epoch 6, batch 12200, loss[loss=0.2383, simple_loss=0.293, pruned_loss=0.09176, over 21845.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3261, pruned_loss=0.08852, over 4260066.26 frames. ], batch size: 98, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:39:16,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=988218.0, ans=0.0 2023-06-21 14:39:16,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=988218.0, ans=0.0 2023-06-21 14:39:36,219 INFO [train.py:996] (3/4) Epoch 6, batch 12250, loss[loss=0.1769, simple_loss=0.2505, pruned_loss=0.05166, over 21527.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3166, pruned_loss=0.08469, over 4265061.77 frames. ], batch size: 195, lr: 5.14e-03, grad_scale: 16.0 2023-06-21 14:39:43,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=988338.0, ans=0.125 2023-06-21 14:39:45,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=988338.0, ans=0.125 2023-06-21 14:39:52,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=988398.0, ans=0.0 2023-06-21 14:39:59,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=988398.0, ans=0.1 2023-06-21 14:40:09,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=988458.0, ans=0.125 2023-06-21 14:40:17,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=988458.0, ans=0.125 2023-06-21 14:40:22,486 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 2.454e+02 2.966e+02 3.954e+02 7.953e+02, threshold=5.931e+02, percent-clipped=3.0 2023-06-21 14:40:47,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=988518.0, ans=0.2 2023-06-21 14:40:53,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=988578.0, ans=0.1 2023-06-21 14:41:10,040 INFO [train.py:996] (3/4) Epoch 6, batch 12300, loss[loss=0.2885, simple_loss=0.3697, pruned_loss=0.1036, over 21844.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3091, pruned_loss=0.07859, over 4251307.07 frames. ], batch size: 371, lr: 5.14e-03, grad_scale: 16.0 2023-06-21 14:42:39,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=988878.0, ans=0.0 2023-06-21 14:42:44,766 INFO [train.py:996] (3/4) Epoch 6, batch 12350, loss[loss=0.2611, simple_loss=0.3325, pruned_loss=0.09487, over 21560.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3149, pruned_loss=0.07924, over 4254539.24 frames. ], batch size: 548, lr: 5.14e-03, grad_scale: 16.0 2023-06-21 14:43:08,122 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-21 14:43:24,943 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.00 vs. limit=10.0 2023-06-21 14:43:28,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=989058.0, ans=0.1 2023-06-21 14:43:36,232 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.794e+02 2.741e+02 3.281e+02 4.325e+02 6.278e+02, threshold=6.562e+02, percent-clipped=1.0 2023-06-21 14:44:08,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=989178.0, ans=0.035 2023-06-21 14:44:18,042 INFO [train.py:996] (3/4) Epoch 6, batch 12400, loss[loss=0.2861, simple_loss=0.3353, pruned_loss=0.1185, over 21344.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3169, pruned_loss=0.08356, over 4263284.52 frames. ], batch size: 176, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:44:51,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=989358.0, ans=0.0 2023-06-21 14:45:36,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=989478.0, ans=0.125 2023-06-21 14:45:38,113 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.51 vs. limit=15.0 2023-06-21 14:45:52,839 INFO [train.py:996] (3/4) Epoch 6, batch 12450, loss[loss=0.2978, simple_loss=0.365, pruned_loss=0.1153, over 21481.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3204, pruned_loss=0.08697, over 4272044.50 frames. ], batch size: 131, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:45:55,198 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-21 14:46:30,551 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=12.0 2023-06-21 14:46:55,548 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.846e+02 3.218e+02 3.959e+02 6.466e+02, threshold=6.436e+02, percent-clipped=0.0 2023-06-21 14:47:35,191 INFO [train.py:996] (3/4) Epoch 6, batch 12500, loss[loss=0.2932, simple_loss=0.3885, pruned_loss=0.09894, over 21922.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3301, pruned_loss=0.08973, over 4269284.44 frames. ], batch size: 317, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:48:53,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=990078.0, ans=0.0 2023-06-21 14:49:19,716 INFO [train.py:996] (3/4) Epoch 6, batch 12550, loss[loss=0.2699, simple_loss=0.3402, pruned_loss=0.09979, over 21818.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3337, pruned_loss=0.09228, over 4274687.23 frames. ], batch size: 118, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:49:50,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=990198.0, ans=15.0 2023-06-21 14:50:00,848 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2023-06-21 14:50:12,191 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.424e+02 2.946e+02 3.555e+02 3.995e+02 6.725e+02, threshold=7.110e+02, percent-clipped=1.0 2023-06-21 14:50:12,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=990258.0, ans=0.02 2023-06-21 14:50:49,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=990378.0, ans=0.0 2023-06-21 14:50:55,508 INFO [train.py:996] (3/4) Epoch 6, batch 12600, loss[loss=0.2454, simple_loss=0.3243, pruned_loss=0.08327, over 21595.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.332, pruned_loss=0.08985, over 4267914.44 frames. ], batch size: 230, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:51:15,505 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.99 vs. limit=10.0 2023-06-21 14:52:12,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=990678.0, ans=0.125 2023-06-21 14:52:13,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=990678.0, ans=0.125 2023-06-21 14:52:17,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=990678.0, ans=0.0 2023-06-21 14:52:25,059 INFO [train.py:996] (3/4) Epoch 6, batch 12650, loss[loss=0.3005, simple_loss=0.4094, pruned_loss=0.09578, over 20758.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3239, pruned_loss=0.0851, over 4274759.72 frames. ], batch size: 608, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:53:13,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=990858.0, ans=0.0 2023-06-21 14:53:16,406 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.529e+02 3.007e+02 3.447e+02 6.549e+02, threshold=6.013e+02, percent-clipped=0.0 2023-06-21 14:53:34,097 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-21 14:54:11,993 INFO [train.py:996] (3/4) Epoch 6, batch 12700, loss[loss=0.2844, simple_loss=0.3485, pruned_loss=0.1101, over 21943.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3234, pruned_loss=0.08723, over 4280615.99 frames. ], batch size: 372, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:54:16,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=991038.0, ans=0.0 2023-06-21 14:54:42,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=991158.0, ans=0.0 2023-06-21 14:55:34,695 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-06-21 14:55:35,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=991278.0, ans=0.125 2023-06-21 14:55:35,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=991278.0, ans=0.09899494936611666 2023-06-21 14:55:48,669 INFO [train.py:996] (3/4) Epoch 6, batch 12750, loss[loss=0.2508, simple_loss=0.3229, pruned_loss=0.0893, over 21772.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3253, pruned_loss=0.08778, over 4282596.15 frames. ], batch size: 298, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:55:52,815 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.58 vs. limit=15.0 2023-06-21 14:56:07,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=991398.0, ans=10.0 2023-06-21 14:56:36,157 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.913e+02 3.338e+02 4.032e+02 7.736e+02, threshold=6.676e+02, percent-clipped=3.0 2023-06-21 14:56:49,105 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:57:24,150 INFO [train.py:996] (3/4) Epoch 6, batch 12800, loss[loss=0.2287, simple_loss=0.3025, pruned_loss=0.07741, over 21812.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3245, pruned_loss=0.08861, over 4287968.64 frames. ], batch size: 247, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:57:26,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=991638.0, ans=0.1 2023-06-21 14:57:41,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=991698.0, ans=0.125 2023-06-21 14:57:43,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=991698.0, ans=0.02 2023-06-21 14:57:46,083 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.14 vs. limit=10.0 2023-06-21 14:57:47,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=991698.0, ans=0.0 2023-06-21 14:58:00,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=991758.0, ans=0.5 2023-06-21 14:58:03,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=991758.0, ans=0.2 2023-06-21 14:58:20,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=991818.0, ans=15.0 2023-06-21 14:58:59,824 INFO [train.py:996] (3/4) Epoch 6, batch 12850, loss[loss=0.2175, simple_loss=0.3065, pruned_loss=0.06426, over 21735.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3268, pruned_loss=0.09036, over 4289073.38 frames. ], batch size: 247, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:59:09,867 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=15.0 2023-06-21 14:59:42,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=992058.0, ans=0.035 2023-06-21 14:59:53,149 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.801e+02 3.143e+02 3.622e+02 6.427e+02, threshold=6.286e+02, percent-clipped=0.0 2023-06-21 15:00:36,422 INFO [train.py:996] (3/4) Epoch 6, batch 12900, loss[loss=0.2085, simple_loss=0.2835, pruned_loss=0.06679, over 21412.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3238, pruned_loss=0.08605, over 4280283.14 frames. ], batch size: 194, lr: 5.13e-03, grad_scale: 32.0 2023-06-21 15:00:36,911 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:00:43,619 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.80 vs. limit=22.5 2023-06-21 15:01:45,238 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-21 15:02:12,322 INFO [train.py:996] (3/4) Epoch 6, batch 12950, loss[loss=0.2585, simple_loss=0.3275, pruned_loss=0.09469, over 21933.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3224, pruned_loss=0.08431, over 4275645.83 frames. ], batch size: 317, lr: 5.13e-03, grad_scale: 32.0 2023-06-21 15:02:37,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=992598.0, ans=0.0 2023-06-21 15:03:15,920 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.760e+02 2.940e+02 3.602e+02 4.409e+02 7.106e+02, threshold=7.204e+02, percent-clipped=2.0 2023-06-21 15:03:23,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=992718.0, ans=0.0 2023-06-21 15:03:45,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=992838.0, ans=0.125 2023-06-21 15:03:46,798 INFO [train.py:996] (3/4) Epoch 6, batch 13000, loss[loss=0.245, simple_loss=0.3251, pruned_loss=0.08244, over 21621.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3248, pruned_loss=0.08613, over 4257777.83 frames. ], batch size: 441, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:03:51,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=992838.0, ans=0.1 2023-06-21 15:03:54,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=992838.0, ans=0.0 2023-06-21 15:04:28,252 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=12.0 2023-06-21 15:05:05,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=993078.0, ans=0.125 2023-06-21 15:05:17,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=993078.0, ans=0.1 2023-06-21 15:05:21,366 INFO [train.py:996] (3/4) Epoch 6, batch 13050, loss[loss=0.2649, simple_loss=0.3294, pruned_loss=0.1002, over 21872.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3198, pruned_loss=0.08346, over 4262212.31 frames. ], batch size: 371, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:05:54,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=993198.0, ans=0.2 2023-06-21 15:05:54,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=993198.0, ans=0.125 2023-06-21 15:06:01,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=993198.0, ans=0.125 2023-06-21 15:06:04,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=993258.0, ans=0.2 2023-06-21 15:06:18,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=993258.0, ans=0.0 2023-06-21 15:06:21,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=993258.0, ans=0.0 2023-06-21 15:06:23,889 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.762e+02 3.169e+02 4.003e+02 6.766e+02, threshold=6.339e+02, percent-clipped=0.0 2023-06-21 15:06:32,552 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.70 vs. limit=10.0 2023-06-21 15:06:54,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=993438.0, ans=0.0 2023-06-21 15:07:00,669 INFO [train.py:996] (3/4) Epoch 6, batch 13100, loss[loss=0.2839, simple_loss=0.351, pruned_loss=0.1084, over 21304.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.323, pruned_loss=0.08377, over 4265020.27 frames. ], batch size: 159, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:07:32,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=993498.0, ans=0.125 2023-06-21 15:07:46,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=993558.0, ans=0.125 2023-06-21 15:08:02,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=993618.0, ans=0.0 2023-06-21 15:08:05,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=993618.0, ans=0.2 2023-06-21 15:08:05,493 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:08:11,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=993618.0, ans=0.0 2023-06-21 15:08:42,999 INFO [train.py:996] (3/4) Epoch 6, batch 13150, loss[loss=0.2026, simple_loss=0.2797, pruned_loss=0.06271, over 21584.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3259, pruned_loss=0.08712, over 4265331.58 frames. ], batch size: 263, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:09:05,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=993738.0, ans=0.125 2023-06-21 15:09:11,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=993798.0, ans=0.125 2023-06-21 15:09:37,804 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 3.111e+02 3.957e+02 5.293e+02 1.278e+03, threshold=7.913e+02, percent-clipped=9.0 2023-06-21 15:09:38,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=993918.0, ans=0.125 2023-06-21 15:10:27,532 INFO [train.py:996] (3/4) Epoch 6, batch 13200, loss[loss=0.2888, simple_loss=0.3485, pruned_loss=0.1146, over 21271.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3228, pruned_loss=0.0864, over 4267994.76 frames. ], batch size: 549, lr: 5.13e-03, grad_scale: 32.0 2023-06-21 15:10:43,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=994038.0, ans=0.125 2023-06-21 15:10:44,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=994038.0, ans=0.125 2023-06-21 15:11:04,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=994158.0, ans=0.125 2023-06-21 15:11:15,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=994158.0, ans=0.125 2023-06-21 15:11:29,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=994218.0, ans=0.1 2023-06-21 15:12:03,180 INFO [train.py:996] (3/4) Epoch 6, batch 13250, loss[loss=0.2713, simple_loss=0.3301, pruned_loss=0.1063, over 21822.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3231, pruned_loss=0.08917, over 4277365.44 frames. ], batch size: 107, lr: 5.13e-03, grad_scale: 32.0 2023-06-21 15:12:18,760 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-06-21 15:12:27,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=994398.0, ans=0.125 2023-06-21 15:12:39,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=994458.0, ans=0.0 2023-06-21 15:12:51,219 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.898e+02 3.245e+02 3.819e+02 5.517e+02, threshold=6.489e+02, percent-clipped=0.0 2023-06-21 15:13:06,479 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.28 vs. limit=10.0 2023-06-21 15:13:33,148 INFO [train.py:996] (3/4) Epoch 6, batch 13300, loss[loss=0.2485, simple_loss=0.3416, pruned_loss=0.07769, over 20806.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3269, pruned_loss=0.08968, over 4279554.54 frames. ], batch size: 609, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:13:58,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=994698.0, ans=0.0 2023-06-21 15:14:41,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=994818.0, ans=0.125 2023-06-21 15:14:43,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-21 15:14:44,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=994818.0, ans=0.0 2023-06-21 15:14:46,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=994818.0, ans=0.125 2023-06-21 15:15:05,596 INFO [train.py:996] (3/4) Epoch 6, batch 13350, loss[loss=0.2767, simple_loss=0.3526, pruned_loss=0.1004, over 21733.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.332, pruned_loss=0.09272, over 4280790.89 frames. ], batch size: 298, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:15:05,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=994938.0, ans=0.1 2023-06-21 15:15:12,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=994938.0, ans=0.0 2023-06-21 15:16:00,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=995118.0, ans=0.125 2023-06-21 15:16:01,504 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.076e+02 3.846e+02 4.574e+02 8.350e+02, threshold=7.691e+02, percent-clipped=3.0 2023-06-21 15:16:16,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=995118.0, ans=0.025 2023-06-21 15:16:44,126 INFO [train.py:996] (3/4) Epoch 6, batch 13400, loss[loss=0.2603, simple_loss=0.3266, pruned_loss=0.09697, over 21304.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3334, pruned_loss=0.095, over 4282227.50 frames. ], batch size: 176, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:17:54,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=995418.0, ans=0.125 2023-06-21 15:18:07,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=995478.0, ans=0.2 2023-06-21 15:18:25,030 INFO [train.py:996] (3/4) Epoch 6, batch 13450, loss[loss=0.2543, simple_loss=0.313, pruned_loss=0.09777, over 21789.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3343, pruned_loss=0.0959, over 4276043.56 frames. ], batch size: 118, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:18:57,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=995598.0, ans=0.125 2023-06-21 15:19:28,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=995718.0, ans=0.0 2023-06-21 15:19:29,865 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 3.161e+02 3.421e+02 3.980e+02 7.603e+02, threshold=6.841e+02, percent-clipped=0.0 2023-06-21 15:20:06,007 INFO [train.py:996] (3/4) Epoch 6, batch 13500, loss[loss=0.241, simple_loss=0.3118, pruned_loss=0.08513, over 21799.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3239, pruned_loss=0.0925, over 4272825.47 frames. ], batch size: 352, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:20:07,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=995838.0, ans=0.2 2023-06-21 15:20:37,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=995898.0, ans=0.0 2023-06-21 15:20:47,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=995898.0, ans=0.0 2023-06-21 15:20:58,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=995958.0, ans=0.125 2023-06-21 15:21:00,521 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.17 vs. limit=22.5 2023-06-21 15:21:08,787 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=15.0 2023-06-21 15:21:14,854 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-21 15:21:39,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=996078.0, ans=0.1 2023-06-21 15:21:43,568 INFO [train.py:996] (3/4) Epoch 6, batch 13550, loss[loss=0.34, simple_loss=0.4302, pruned_loss=0.1249, over 21645.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3282, pruned_loss=0.0919, over 4278345.82 frames. ], batch size: 441, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:21:45,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=996138.0, ans=0.125 2023-06-21 15:21:47,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=996138.0, ans=0.07 2023-06-21 15:21:55,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=996138.0, ans=0.125 2023-06-21 15:22:44,913 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.395e+02 3.030e+02 3.606e+02 4.387e+02 7.560e+02, threshold=7.212e+02, percent-clipped=4.0 2023-06-21 15:23:18,604 INFO [train.py:996] (3/4) Epoch 6, batch 13600, loss[loss=0.2421, simple_loss=0.3052, pruned_loss=0.08947, over 21797.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3305, pruned_loss=0.09296, over 4283024.93 frames. ], batch size: 247, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:23:40,722 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2023-06-21 15:23:52,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=996498.0, ans=0.125 2023-06-21 15:23:52,356 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:24:07,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=996558.0, ans=0.025 2023-06-21 15:24:44,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=996678.0, ans=0.07 2023-06-21 15:24:58,352 INFO [train.py:996] (3/4) Epoch 6, batch 13650, loss[loss=0.1989, simple_loss=0.2537, pruned_loss=0.07204, over 19997.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3246, pruned_loss=0.08906, over 4279663.09 frames. ], batch size: 703, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:25:03,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=12.0 2023-06-21 15:25:54,149 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.919e+02 3.475e+02 4.506e+02 7.169e+02, threshold=6.950e+02, percent-clipped=0.0 2023-06-21 15:26:32,666 INFO [train.py:996] (3/4) Epoch 6, batch 13700, loss[loss=0.2068, simple_loss=0.2711, pruned_loss=0.07126, over 21466.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.318, pruned_loss=0.08855, over 4280593.74 frames. ], batch size: 211, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:27:14,794 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=22.5 2023-06-21 15:27:55,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=997278.0, ans=0.0 2023-06-21 15:28:15,469 INFO [train.py:996] (3/4) Epoch 6, batch 13750, loss[loss=0.2132, simple_loss=0.2753, pruned_loss=0.0756, over 21275.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3165, pruned_loss=0.08744, over 4284698.34 frames. ], batch size: 176, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:28:24,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=997338.0, ans=0.09899494936611666 2023-06-21 15:28:27,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=997338.0, ans=0.125 2023-06-21 15:28:43,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=997398.0, ans=10.0 2023-06-21 15:28:43,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=997398.0, ans=0.1 2023-06-21 15:29:15,007 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.229e+02 4.012e+02 5.672e+02 9.491e+02, threshold=8.024e+02, percent-clipped=9.0 2023-06-21 15:29:32,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=997578.0, ans=0.125 2023-06-21 15:29:40,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=997578.0, ans=0.125 2023-06-21 15:29:58,583 INFO [train.py:996] (3/4) Epoch 6, batch 13800, loss[loss=0.2876, simple_loss=0.376, pruned_loss=0.09959, over 21767.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3208, pruned_loss=0.08592, over 4278329.50 frames. ], batch size: 282, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:30:02,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=997638.0, ans=0.0 2023-06-21 15:30:10,381 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=15.0 2023-06-21 15:30:15,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=997698.0, ans=0.1 2023-06-21 15:30:17,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=997698.0, ans=0.0 2023-06-21 15:30:54,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=997818.0, ans=0.2 2023-06-21 15:31:35,476 INFO [train.py:996] (3/4) Epoch 6, batch 13850, loss[loss=0.3253, simple_loss=0.3959, pruned_loss=0.1273, over 21715.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.329, pruned_loss=0.08766, over 4286364.93 frames. ], batch size: 441, lr: 5.12e-03, grad_scale: 8.0 2023-06-21 15:32:01,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=997998.0, ans=0.0 2023-06-21 15:32:02,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=997998.0, ans=0.0 2023-06-21 15:32:10,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=998058.0, ans=0.2 2023-06-21 15:32:19,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=998058.0, ans=0.2 2023-06-21 15:32:30,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=998118.0, ans=0.125 2023-06-21 15:32:43,624 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.953e+02 3.447e+02 4.211e+02 7.666e+02, threshold=6.893e+02, percent-clipped=0.0 2023-06-21 15:32:56,825 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.91 vs. limit=6.0 2023-06-21 15:33:08,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=998178.0, ans=0.125 2023-06-21 15:33:10,826 INFO [train.py:996] (3/4) Epoch 6, batch 13900, loss[loss=0.2641, simple_loss=0.331, pruned_loss=0.09856, over 19943.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3324, pruned_loss=0.09156, over 4284526.23 frames. ], batch size: 702, lr: 5.12e-03, grad_scale: 8.0 2023-06-21 15:33:12,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=998238.0, ans=0.125 2023-06-21 15:33:23,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=998238.0, ans=0.125 2023-06-21 15:33:32,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=998298.0, ans=0.125 2023-06-21 15:34:17,763 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:34:41,900 INFO [train.py:996] (3/4) Epoch 6, batch 13950, loss[loss=0.2797, simple_loss=0.336, pruned_loss=0.1117, over 21432.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3339, pruned_loss=0.0937, over 4285751.18 frames. ], batch size: 211, lr: 5.12e-03, grad_scale: 8.0 2023-06-21 15:35:23,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=998658.0, ans=0.0 2023-06-21 15:35:39,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=998718.0, ans=0.125 2023-06-21 15:35:43,663 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.434e+02 3.088e+02 3.493e+02 4.359e+02 6.535e+02, threshold=6.987e+02, percent-clipped=0.0 2023-06-21 15:35:55,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=998778.0, ans=0.125 2023-06-21 15:36:06,366 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:36:10,579 INFO [train.py:996] (3/4) Epoch 6, batch 14000, loss[loss=0.2424, simple_loss=0.3251, pruned_loss=0.07982, over 21698.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3287, pruned_loss=0.09071, over 4281440.29 frames. ], batch size: 389, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:36:48,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=998898.0, ans=0.125 2023-06-21 15:36:51,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=998958.0, ans=0.125 2023-06-21 15:36:52,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=998958.0, ans=0.125 2023-06-21 15:37:41,084 INFO [train.py:996] (3/4) Epoch 6, batch 14050, loss[loss=0.2091, simple_loss=0.2769, pruned_loss=0.07063, over 21239.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3229, pruned_loss=0.08598, over 4277676.40 frames. ], batch size: 548, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:37:44,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=999138.0, ans=0.125 2023-06-21 15:37:47,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=999138.0, ans=0.125 2023-06-21 15:38:10,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=999198.0, ans=0.0 2023-06-21 15:38:47,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.26 vs. limit=12.0 2023-06-21 15:38:48,434 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.785e+02 3.183e+02 4.255e+02 6.746e+02, threshold=6.366e+02, percent-clipped=0.0 2023-06-21 15:39:03,207 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-21 15:39:12,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=999378.0, ans=0.04949747468305833 2023-06-21 15:39:16,535 INFO [train.py:996] (3/4) Epoch 6, batch 14100, loss[loss=0.2111, simple_loss=0.2793, pruned_loss=0.07148, over 21721.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3176, pruned_loss=0.08629, over 4274029.00 frames. ], batch size: 247, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:39:55,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=999558.0, ans=0.0 2023-06-21 15:40:37,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=999678.0, ans=0.2 2023-06-21 15:40:49,622 INFO [train.py:996] (3/4) Epoch 6, batch 14150, loss[loss=0.2732, simple_loss=0.3499, pruned_loss=0.09824, over 21893.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3224, pruned_loss=0.08802, over 4281844.00 frames. ], batch size: 98, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:41:08,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=999798.0, ans=0.125 2023-06-21 15:41:20,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=999798.0, ans=0.125 2023-06-21 15:41:39,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=999858.0, ans=0.0 2023-06-21 15:41:47,163 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 2.804e+02 3.332e+02 4.334e+02 8.014e+02, threshold=6.664e+02, percent-clipped=2.0 2023-06-21 15:41:49,679 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-21 15:42:23,659 INFO [train.py:996] (3/4) Epoch 6, batch 14200, loss[loss=0.2741, simple_loss=0.3348, pruned_loss=0.1067, over 21720.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3199, pruned_loss=0.08658, over 4273371.56 frames. ], batch size: 441, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:43:18,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1000218.0, ans=0.1 2023-06-21 15:43:27,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1000218.0, ans=0.0 2023-06-21 15:43:58,946 INFO [train.py:996] (3/4) Epoch 6, batch 14250, loss[loss=0.2609, simple_loss=0.3157, pruned_loss=0.1031, over 20099.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3146, pruned_loss=0.08665, over 4265105.18 frames. ], batch size: 703, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:44:10,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1000338.0, ans=0.125 2023-06-21 15:44:51,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1000458.0, ans=0.0 2023-06-21 15:44:59,212 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.649e+02 3.041e+02 3.616e+02 7.648e+02, threshold=6.082e+02, percent-clipped=1.0 2023-06-21 15:45:35,966 INFO [train.py:996] (3/4) Epoch 6, batch 14300, loss[loss=0.3599, simple_loss=0.4393, pruned_loss=0.1402, over 21651.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.316, pruned_loss=0.08637, over 4243756.05 frames. ], batch size: 389, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:45:36,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1000638.0, ans=0.0 2023-06-21 15:45:53,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1000698.0, ans=0.125 2023-06-21 15:45:56,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1000698.0, ans=0.125 2023-06-21 15:46:13,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1000758.0, ans=0.0 2023-06-21 15:46:24,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1000758.0, ans=0.0 2023-06-21 15:46:56,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1000878.0, ans=0.0 2023-06-21 15:47:10,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1000938.0, ans=0.1 2023-06-21 15:47:11,420 INFO [train.py:996] (3/4) Epoch 6, batch 14350, loss[loss=0.2359, simple_loss=0.3337, pruned_loss=0.06905, over 21839.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3205, pruned_loss=0.08606, over 4255737.35 frames. ], batch size: 316, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:47:16,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1000938.0, ans=0.0 2023-06-21 15:47:30,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1000998.0, ans=0.125 2023-06-21 15:48:18,790 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.340e+02 3.231e+02 3.841e+02 4.769e+02 8.361e+02, threshold=7.683e+02, percent-clipped=10.0 2023-06-21 15:48:22,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1001118.0, ans=0.04949747468305833 2023-06-21 15:48:46,005 INFO [train.py:996] (3/4) Epoch 6, batch 14400, loss[loss=0.2433, simple_loss=0.3152, pruned_loss=0.08567, over 20955.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3209, pruned_loss=0.08706, over 4258502.96 frames. ], batch size: 608, lr: 5.11e-03, grad_scale: 32.0 2023-06-21 15:49:15,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1001298.0, ans=0.0 2023-06-21 15:49:23,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1001358.0, ans=0.0 2023-06-21 15:50:20,552 INFO [train.py:996] (3/4) Epoch 6, batch 14450, loss[loss=0.2207, simple_loss=0.2851, pruned_loss=0.07812, over 21597.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3151, pruned_loss=0.08689, over 4266100.66 frames. ], batch size: 263, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:50:25,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1001538.0, ans=0.125 2023-06-21 15:51:29,726 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.239e+02 2.916e+02 3.251e+02 4.168e+02 6.765e+02, threshold=6.503e+02, percent-clipped=0.0 2023-06-21 15:51:48,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1001778.0, ans=0.025 2023-06-21 15:51:55,519 INFO [train.py:996] (3/4) Epoch 6, batch 14500, loss[loss=0.2212, simple_loss=0.2965, pruned_loss=0.07296, over 21756.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3119, pruned_loss=0.08665, over 4265961.39 frames. ], batch size: 112, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:51:57,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1001838.0, ans=0.0 2023-06-21 15:52:09,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1001898.0, ans=0.125 2023-06-21 15:52:13,623 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.91 vs. limit=15.0 2023-06-21 15:52:27,349 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-21 15:52:33,881 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-21 15:52:39,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1001958.0, ans=0.0 2023-06-21 15:53:07,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1002018.0, ans=0.125 2023-06-21 15:53:09,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1002018.0, ans=0.2 2023-06-21 15:53:22,097 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-21 15:53:28,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1002078.0, ans=0.125 2023-06-21 15:53:30,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1002138.0, ans=0.125 2023-06-21 15:53:31,825 INFO [train.py:996] (3/4) Epoch 6, batch 14550, loss[loss=0.2616, simple_loss=0.3371, pruned_loss=0.09302, over 21687.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3184, pruned_loss=0.08886, over 4260228.13 frames. ], batch size: 298, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:53:49,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1002138.0, ans=0.125 2023-06-21 15:54:29,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1002258.0, ans=0.1 2023-06-21 15:54:29,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1002258.0, ans=0.0 2023-06-21 15:54:30,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1002258.0, ans=0.0 2023-06-21 15:54:41,349 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.124e+02 3.127e+02 3.885e+02 5.234e+02 1.064e+03, threshold=7.771e+02, percent-clipped=10.0 2023-06-21 15:54:47,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1002318.0, ans=0.125 2023-06-21 15:54:51,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1002378.0, ans=0.0 2023-06-21 15:54:59,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1002378.0, ans=0.125 2023-06-21 15:55:07,179 INFO [train.py:996] (3/4) Epoch 6, batch 14600, loss[loss=0.3108, simple_loss=0.3787, pruned_loss=0.1214, over 21530.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.327, pruned_loss=0.09314, over 4267568.18 frames. ], batch size: 131, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:55:36,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1002498.0, ans=0.1 2023-06-21 15:55:41,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1002498.0, ans=0.0 2023-06-21 15:55:42,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1002498.0, ans=0.125 2023-06-21 15:56:05,347 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.17 vs. limit=15.0 2023-06-21 15:56:27,040 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:56:30,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1002678.0, ans=0.0 2023-06-21 15:56:41,650 INFO [train.py:996] (3/4) Epoch 6, batch 14650, loss[loss=0.2325, simple_loss=0.3176, pruned_loss=0.07373, over 21697.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.329, pruned_loss=0.09179, over 4263623.18 frames. ], batch size: 230, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:56:51,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1002738.0, ans=0.2 2023-06-21 15:57:50,599 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.901e+02 3.727e+02 5.131e+02 9.036e+02, threshold=7.453e+02, percent-clipped=4.0 2023-06-21 15:57:54,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1002918.0, ans=0.0 2023-06-21 15:57:56,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1002918.0, ans=0.125 2023-06-21 15:58:21,852 INFO [train.py:996] (3/4) Epoch 6, batch 14700, loss[loss=0.1959, simple_loss=0.281, pruned_loss=0.05546, over 21273.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3217, pruned_loss=0.08553, over 4264008.14 frames. ], batch size: 159, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:58:29,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1003038.0, ans=0.0 2023-06-21 15:59:16,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1003158.0, ans=0.125 2023-06-21 15:59:18,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1003158.0, ans=0.2 2023-06-21 15:59:22,170 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.19 vs. limit=10.0 2023-06-21 15:59:49,155 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-06-21 15:59:59,095 INFO [train.py:996] (3/4) Epoch 6, batch 14750, loss[loss=0.2894, simple_loss=0.3571, pruned_loss=0.1108, over 21510.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3256, pruned_loss=0.08839, over 4264993.87 frames. ], batch size: 194, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 16:00:05,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1003338.0, ans=0.04949747468305833 2023-06-21 16:00:38,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1003398.0, ans=0.125 2023-06-21 16:01:04,193 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 3.041e+02 3.594e+02 4.539e+02 8.460e+02, threshold=7.189e+02, percent-clipped=3.0 2023-06-21 16:01:24,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1003578.0, ans=0.125 2023-06-21 16:01:27,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1003578.0, ans=0.2 2023-06-21 16:01:36,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1003578.0, ans=0.0 2023-06-21 16:01:39,219 INFO [train.py:996] (3/4) Epoch 6, batch 14800, loss[loss=0.2462, simple_loss=0.3092, pruned_loss=0.09156, over 21804.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.337, pruned_loss=0.09437, over 4267672.72 frames. ], batch size: 107, lr: 5.11e-03, grad_scale: 32.0 2023-06-21 16:01:50,830 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-21 16:02:02,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1003698.0, ans=0.125 2023-06-21 16:02:10,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1003698.0, ans=0.2 2023-06-21 16:02:14,053 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-21 16:02:23,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1003758.0, ans=0.2 2023-06-21 16:02:46,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1003818.0, ans=0.2 2023-06-21 16:03:20,681 INFO [train.py:996] (3/4) Epoch 6, batch 14850, loss[loss=0.2149, simple_loss=0.2859, pruned_loss=0.07193, over 21704.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3301, pruned_loss=0.09399, over 4264232.50 frames. ], batch size: 298, lr: 5.10e-03, grad_scale: 32.0 2023-06-21 16:03:49,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1003998.0, ans=0.2 2023-06-21 16:04:09,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1004058.0, ans=0.0 2023-06-21 16:04:28,215 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.518e+02 3.136e+02 3.769e+02 4.672e+02 7.258e+02, threshold=7.538e+02, percent-clipped=1.0 2023-06-21 16:05:03,044 INFO [train.py:996] (3/4) Epoch 6, batch 14900, loss[loss=0.3419, simple_loss=0.3921, pruned_loss=0.1458, over 21805.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3342, pruned_loss=0.09583, over 4266809.60 frames. ], batch size: 441, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:05:51,167 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.82 vs. limit=15.0 2023-06-21 16:05:53,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1004358.0, ans=0.125 2023-06-21 16:06:05,023 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:06:40,445 INFO [train.py:996] (3/4) Epoch 6, batch 14950, loss[loss=0.2576, simple_loss=0.332, pruned_loss=0.09165, over 21278.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3345, pruned_loss=0.09487, over 4265327.53 frames. ], batch size: 159, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:06:54,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1004598.0, ans=0.0 2023-06-21 16:07:27,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1004658.0, ans=0.125 2023-06-21 16:07:41,314 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 2.931e+02 3.491e+02 4.464e+02 7.538e+02, threshold=6.982e+02, percent-clipped=0.0 2023-06-21 16:07:51,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1004718.0, ans=0.125 2023-06-21 16:07:52,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1004718.0, ans=0.125 2023-06-21 16:08:08,543 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-06-21 16:08:10,578 INFO [train.py:996] (3/4) Epoch 6, batch 15000, loss[loss=0.2472, simple_loss=0.3181, pruned_loss=0.08815, over 21825.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3375, pruned_loss=0.0967, over 4268541.17 frames. ], batch size: 332, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:08:10,578 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 16:08:27,120 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.26, simple_loss=0.3558, pruned_loss=0.08209, over 1796401.00 frames. 2023-06-21 16:08:27,120 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-21 16:08:42,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1004838.0, ans=0.04949747468305833 2023-06-21 16:08:43,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1004838.0, ans=0.2 2023-06-21 16:08:59,385 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:09:32,782 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-21 16:09:36,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1005018.0, ans=0.125 2023-06-21 16:10:04,148 INFO [train.py:996] (3/4) Epoch 6, batch 15050, loss[loss=0.2745, simple_loss=0.3682, pruned_loss=0.09047, over 20744.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3374, pruned_loss=0.09685, over 4260339.91 frames. ], batch size: 607, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:10:55,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1005258.0, ans=0.125 2023-06-21 16:11:16,857 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.430e+02 2.976e+02 3.378e+02 4.052e+02 9.524e+02, threshold=6.756e+02, percent-clipped=3.0 2023-06-21 16:11:44,043 INFO [train.py:996] (3/4) Epoch 6, batch 15100, loss[loss=0.2737, simple_loss=0.3402, pruned_loss=0.1036, over 21429.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3382, pruned_loss=0.0959, over 4258429.45 frames. ], batch size: 131, lr: 5.10e-03, grad_scale: 8.0 2023-06-21 16:11:44,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1005438.0, ans=0.125 2023-06-21 16:11:59,541 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.19 vs. limit=15.0 2023-06-21 16:12:28,122 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=22.5 2023-06-21 16:12:35,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1005558.0, ans=0.2 2023-06-21 16:12:53,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1005618.0, ans=0.125 2023-06-21 16:12:56,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1005618.0, ans=10.0 2023-06-21 16:12:59,373 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:13:23,666 INFO [train.py:996] (3/4) Epoch 6, batch 15150, loss[loss=0.2074, simple_loss=0.2673, pruned_loss=0.07373, over 21218.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3351, pruned_loss=0.09644, over 4261810.41 frames. ], batch size: 549, lr: 5.10e-03, grad_scale: 8.0 2023-06-21 16:13:48,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1005798.0, ans=0.125 2023-06-21 16:13:53,391 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.09 vs. limit=12.0 2023-06-21 16:14:02,347 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=22.5 2023-06-21 16:14:25,207 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 2.946e+02 3.329e+02 3.848e+02 7.712e+02, threshold=6.658e+02, percent-clipped=2.0 2023-06-21 16:14:51,575 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:14:57,082 INFO [train.py:996] (3/4) Epoch 6, batch 15200, loss[loss=0.169, simple_loss=0.2198, pruned_loss=0.0591, over 15710.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3264, pruned_loss=0.09209, over 4258469.69 frames. ], batch size: 60, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:15:46,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1006158.0, ans=0.2 2023-06-21 16:15:59,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1006218.0, ans=0.125 2023-06-21 16:16:30,313 INFO [train.py:996] (3/4) Epoch 6, batch 15250, loss[loss=0.2265, simple_loss=0.284, pruned_loss=0.08446, over 21596.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3211, pruned_loss=0.09039, over 4258747.94 frames. ], batch size: 263, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:17:13,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1006458.0, ans=0.1 2023-06-21 16:17:31,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1006518.0, ans=0.1 2023-06-21 16:17:32,963 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 3.045e+02 3.599e+02 4.449e+02 6.735e+02, threshold=7.197e+02, percent-clipped=2.0 2023-06-21 16:17:56,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1006578.0, ans=0.05 2023-06-21 16:18:15,119 INFO [train.py:996] (3/4) Epoch 6, batch 15300, loss[loss=0.2732, simple_loss=0.3525, pruned_loss=0.09699, over 21809.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3246, pruned_loss=0.09376, over 4254541.72 frames. ], batch size: 118, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:19:05,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1006818.0, ans=0.125 2023-06-21 16:19:06,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1006818.0, ans=0.125 2023-06-21 16:19:43,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1006938.0, ans=0.1 2023-06-21 16:19:44,622 INFO [train.py:996] (3/4) Epoch 6, batch 15350, loss[loss=0.2461, simple_loss=0.3419, pruned_loss=0.07521, over 21672.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3289, pruned_loss=0.0958, over 4256574.52 frames. ], batch size: 414, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:19:54,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1006938.0, ans=0.125 2023-06-21 16:20:05,068 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-06-21 16:20:10,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1006998.0, ans=0.0 2023-06-21 16:20:29,613 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:20:40,630 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.234e+02 2.864e+02 3.312e+02 3.850e+02 5.534e+02, threshold=6.625e+02, percent-clipped=0.0 2023-06-21 16:20:42,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1007118.0, ans=0.125 2023-06-21 16:20:56,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1007178.0, ans=0.125 2023-06-21 16:21:04,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=22.5 2023-06-21 16:21:12,510 INFO [train.py:996] (3/4) Epoch 6, batch 15400, loss[loss=0.2244, simple_loss=0.2916, pruned_loss=0.07855, over 21300.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.329, pruned_loss=0.09392, over 4253166.71 frames. ], batch size: 143, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:21:19,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1007238.0, ans=0.0 2023-06-21 16:21:19,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1007238.0, ans=0.0 2023-06-21 16:21:47,696 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-06-21 16:21:55,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1007358.0, ans=0.125 2023-06-21 16:22:06,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1007418.0, ans=0.125 2023-06-21 16:22:34,759 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.10 vs. limit=10.0 2023-06-21 16:22:51,011 INFO [train.py:996] (3/4) Epoch 6, batch 15450, loss[loss=0.2234, simple_loss=0.284, pruned_loss=0.0814, over 21623.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3267, pruned_loss=0.09298, over 4252255.18 frames. ], batch size: 548, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:23:29,739 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=12.0 2023-06-21 16:23:32,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1007658.0, ans=0.1 2023-06-21 16:23:39,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1007718.0, ans=0.125 2023-06-21 16:23:53,558 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.820e+02 3.206e+02 3.882e+02 5.798e+02, threshold=6.411e+02, percent-clipped=0.0 2023-06-21 16:24:23,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1007778.0, ans=0.125 2023-06-21 16:24:26,138 INFO [train.py:996] (3/4) Epoch 6, batch 15500, loss[loss=0.2764, simple_loss=0.3444, pruned_loss=0.1041, over 21432.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3292, pruned_loss=0.09239, over 4250237.15 frames. ], batch size: 211, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:24:33,523 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.72 vs. limit=22.5 2023-06-21 16:25:03,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1007958.0, ans=0.0 2023-06-21 16:25:07,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1007958.0, ans=0.2 2023-06-21 16:25:35,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1008018.0, ans=0.05 2023-06-21 16:25:43,438 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.62 vs. limit=15.0 2023-06-21 16:26:05,927 INFO [train.py:996] (3/4) Epoch 6, batch 15550, loss[loss=0.253, simple_loss=0.3395, pruned_loss=0.08329, over 21633.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.328, pruned_loss=0.09057, over 4259755.71 frames. ], batch size: 389, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:26:22,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1008198.0, ans=0.125 2023-06-21 16:27:08,309 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.337e+02 2.789e+02 3.164e+02 3.648e+02 6.720e+02, threshold=6.328e+02, percent-clipped=2.0 2023-06-21 16:27:19,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1008378.0, ans=0.125 2023-06-21 16:27:39,799 INFO [train.py:996] (3/4) Epoch 6, batch 15600, loss[loss=0.2298, simple_loss=0.2909, pruned_loss=0.08436, over 21595.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3219, pruned_loss=0.08876, over 4264097.41 frames. ], batch size: 247, lr: 5.09e-03, grad_scale: 32.0 2023-06-21 16:27:45,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1008438.0, ans=0.125 2023-06-21 16:27:47,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1008438.0, ans=0.1 2023-06-21 16:27:52,813 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-06-21 16:27:52,924 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.42 vs. limit=12.0 2023-06-21 16:27:58,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1008498.0, ans=0.125 2023-06-21 16:27:58,548 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-06-21 16:28:06,176 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.93 vs. limit=6.0 2023-06-21 16:28:54,625 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.30 vs. limit=6.0 2023-06-21 16:29:13,692 INFO [train.py:996] (3/4) Epoch 6, batch 15650, loss[loss=0.2477, simple_loss=0.3087, pruned_loss=0.09341, over 21599.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3199, pruned_loss=0.08808, over 4271476.58 frames. ], batch size: 332, lr: 5.09e-03, grad_scale: 32.0 2023-06-21 16:30:15,993 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.365e+02 3.062e+02 3.560e+02 4.421e+02 6.753e+02, threshold=7.119e+02, percent-clipped=3.0 2023-06-21 16:30:30,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1008978.0, ans=0.125 2023-06-21 16:30:40,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1008978.0, ans=0.1 2023-06-21 16:30:47,541 INFO [train.py:996] (3/4) Epoch 6, batch 15700, loss[loss=0.2087, simple_loss=0.2973, pruned_loss=0.06001, over 21623.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3151, pruned_loss=0.08639, over 4267581.63 frames. ], batch size: 247, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:30:57,559 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-21 16:31:11,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1009098.0, ans=0.0 2023-06-21 16:31:16,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1009158.0, ans=0.125 2023-06-21 16:31:32,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1009158.0, ans=0.125 2023-06-21 16:31:42,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1009218.0, ans=0.125 2023-06-21 16:32:21,335 INFO [train.py:996] (3/4) Epoch 6, batch 15750, loss[loss=0.221, simple_loss=0.2917, pruned_loss=0.0751, over 21732.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3108, pruned_loss=0.08598, over 4276278.25 frames. ], batch size: 351, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:32:29,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1009338.0, ans=0.09899494936611666 2023-06-21 16:33:05,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1009458.0, ans=0.1 2023-06-21 16:33:24,861 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.861e+02 3.404e+02 4.012e+02 5.531e+02, threshold=6.808e+02, percent-clipped=0.0 2023-06-21 16:33:54,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1009638.0, ans=0.2 2023-06-21 16:33:55,156 INFO [train.py:996] (3/4) Epoch 6, batch 15800, loss[loss=0.2541, simple_loss=0.292, pruned_loss=0.1081, over 21320.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3057, pruned_loss=0.0856, over 4258199.52 frames. ], batch size: 507, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:33:59,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1009638.0, ans=0.0 2023-06-21 16:34:20,213 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=12.0 2023-06-21 16:34:39,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1009758.0, ans=0.0 2023-06-21 16:35:29,281 INFO [train.py:996] (3/4) Epoch 6, batch 15850, loss[loss=0.252, simple_loss=0.305, pruned_loss=0.09948, over 21249.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3089, pruned_loss=0.08852, over 4269437.67 frames. ], batch size: 176, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:35:45,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1009998.0, ans=0.125 2023-06-21 16:36:09,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1010058.0, ans=0.1 2023-06-21 16:36:23,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1010118.0, ans=0.1 2023-06-21 16:36:26,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1010118.0, ans=0.0 2023-06-21 16:36:32,241 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 2.973e+02 3.336e+02 4.018e+02 6.867e+02, threshold=6.671e+02, percent-clipped=1.0 2023-06-21 16:36:49,883 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.26 vs. limit=12.0 2023-06-21 16:36:50,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1010178.0, ans=0.025 2023-06-21 16:37:00,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1010178.0, ans=0.0 2023-06-21 16:37:02,613 INFO [train.py:996] (3/4) Epoch 6, batch 15900, loss[loss=0.2755, simple_loss=0.3471, pruned_loss=0.1019, over 21520.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3061, pruned_loss=0.08804, over 4255137.04 frames. ], batch size: 389, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:37:04,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1010238.0, ans=0.125 2023-06-21 16:37:05,139 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-06-21 16:37:19,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1010298.0, ans=0.0 2023-06-21 16:37:26,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1010298.0, ans=0.2 2023-06-21 16:37:37,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1010358.0, ans=0.2 2023-06-21 16:38:36,307 INFO [train.py:996] (3/4) Epoch 6, batch 15950, loss[loss=0.1788, simple_loss=0.2675, pruned_loss=0.04506, over 21524.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3064, pruned_loss=0.08451, over 4246223.70 frames. ], batch size: 211, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:38:50,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1010598.0, ans=0.2 2023-06-21 16:38:53,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1010598.0, ans=0.125 2023-06-21 16:38:59,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1010598.0, ans=0.125 2023-06-21 16:39:11,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1010658.0, ans=0.2 2023-06-21 16:39:40,127 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.630e+02 3.032e+02 3.635e+02 5.664e+02, threshold=6.064e+02, percent-clipped=0.0 2023-06-21 16:39:41,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=15.0 2023-06-21 16:40:09,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1010838.0, ans=0.125 2023-06-21 16:40:10,518 INFO [train.py:996] (3/4) Epoch 6, batch 16000, loss[loss=0.2461, simple_loss=0.3244, pruned_loss=0.08389, over 20656.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3084, pruned_loss=0.08232, over 4259012.45 frames. ], batch size: 607, lr: 5.09e-03, grad_scale: 32.0 2023-06-21 16:41:25,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1011078.0, ans=0.125 2023-06-21 16:41:34,301 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-21 16:41:40,562 INFO [train.py:996] (3/4) Epoch 6, batch 16050, loss[loss=0.2502, simple_loss=0.3313, pruned_loss=0.0845, over 20734.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3116, pruned_loss=0.08062, over 4261186.40 frames. ], batch size: 607, lr: 5.09e-03, grad_scale: 32.0 2023-06-21 16:42:10,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1011258.0, ans=0.125 2023-06-21 16:42:15,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=22.5 2023-06-21 16:42:16,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1011258.0, ans=0.0 2023-06-21 16:42:19,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1011258.0, ans=0.0 2023-06-21 16:42:44,671 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.881e+02 3.788e+02 4.822e+02 9.882e+02, threshold=7.576e+02, percent-clipped=9.0 2023-06-21 16:42:46,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1011318.0, ans=0.0 2023-06-21 16:43:13,233 INFO [train.py:996] (3/4) Epoch 6, batch 16100, loss[loss=0.2475, simple_loss=0.3172, pruned_loss=0.08892, over 21760.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.317, pruned_loss=0.08277, over 4271106.97 frames. ], batch size: 389, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:43:35,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1011498.0, ans=0.125 2023-06-21 16:44:42,431 INFO [train.py:996] (3/4) Epoch 6, batch 16150, loss[loss=0.326, simple_loss=0.3658, pruned_loss=0.1431, over 21767.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3198, pruned_loss=0.08581, over 4277198.26 frames. ], batch size: 508, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:44:59,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1011798.0, ans=0.1 2023-06-21 16:45:18,239 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-21 16:45:47,591 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 3.047e+02 3.537e+02 4.143e+02 9.363e+02, threshold=7.074e+02, percent-clipped=2.0 2023-06-21 16:46:16,807 INFO [train.py:996] (3/4) Epoch 6, batch 16200, loss[loss=0.2974, simple_loss=0.3599, pruned_loss=0.1174, over 21322.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3235, pruned_loss=0.08754, over 4282538.81 frames. ], batch size: 159, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:46:45,354 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.85 vs. limit=6.0 2023-06-21 16:47:44,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1012278.0, ans=0.09899494936611666 2023-06-21 16:47:51,356 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=22.5 2023-06-21 16:47:51,837 INFO [train.py:996] (3/4) Epoch 6, batch 16250, loss[loss=0.1694, simple_loss=0.2337, pruned_loss=0.05255, over 21789.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3228, pruned_loss=0.08861, over 4274112.73 frames. ], batch size: 102, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:48:29,237 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=22.5 2023-06-21 16:48:29,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1012458.0, ans=0.1 2023-06-21 16:49:01,825 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.762e+02 3.334e+02 4.108e+02 7.386e+02, threshold=6.668e+02, percent-clipped=1.0 2023-06-21 16:49:17,827 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-21 16:49:24,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1012638.0, ans=0.125 2023-06-21 16:49:26,005 INFO [train.py:996] (3/4) Epoch 6, batch 16300, loss[loss=0.2141, simple_loss=0.3002, pruned_loss=0.06405, over 21445.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3155, pruned_loss=0.08356, over 4275917.40 frames. ], batch size: 211, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:49:52,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1012698.0, ans=0.0 2023-06-21 16:49:52,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1012698.0, ans=0.125 2023-06-21 16:50:17,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1012758.0, ans=0.0 2023-06-21 16:50:44,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1012878.0, ans=0.1 2023-06-21 16:50:56,881 INFO [train.py:996] (3/4) Epoch 6, batch 16350, loss[loss=0.2423, simple_loss=0.3256, pruned_loss=0.07952, over 20799.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3137, pruned_loss=0.08413, over 4266306.52 frames. ], batch size: 609, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:51:04,264 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.65 vs. limit=15.0 2023-06-21 16:51:35,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1013058.0, ans=0.1 2023-06-21 16:51:35,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1013058.0, ans=0.125 2023-06-21 16:52:11,283 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.615e+02 3.247e+02 3.873e+02 7.213e+02, threshold=6.493e+02, percent-clipped=3.0 2023-06-21 16:52:13,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1013118.0, ans=0.0 2023-06-21 16:52:14,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1013178.0, ans=0.2 2023-06-21 16:52:30,800 INFO [train.py:996] (3/4) Epoch 6, batch 16400, loss[loss=0.2263, simple_loss=0.2983, pruned_loss=0.07709, over 21808.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3167, pruned_loss=0.08503, over 4267081.65 frames. ], batch size: 282, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 16:53:16,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1013358.0, ans=0.125 2023-06-21 16:53:40,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1013418.0, ans=0.2 2023-06-21 16:54:04,532 INFO [train.py:996] (3/4) Epoch 6, batch 16450, loss[loss=0.2245, simple_loss=0.2926, pruned_loss=0.07815, over 21747.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3175, pruned_loss=0.08607, over 4269781.51 frames. ], batch size: 247, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 16:54:32,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1013598.0, ans=0.0 2023-06-21 16:55:13,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1013718.0, ans=0.125 2023-06-21 16:55:19,953 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.335e+02 2.857e+02 3.262e+02 3.717e+02 6.839e+02, threshold=6.523e+02, percent-clipped=2.0 2023-06-21 16:55:30,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1013778.0, ans=0.125 2023-06-21 16:55:35,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1013778.0, ans=0.125 2023-06-21 16:55:39,117 INFO [train.py:996] (3/4) Epoch 6, batch 16500, loss[loss=0.1885, simple_loss=0.2396, pruned_loss=0.06873, over 21821.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3166, pruned_loss=0.0861, over 4271878.72 frames. ], batch size: 124, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 16:56:26,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1013958.0, ans=0.2 2023-06-21 16:56:50,004 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-21 16:56:54,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1014018.0, ans=0.2 2023-06-21 16:57:17,893 INFO [train.py:996] (3/4) Epoch 6, batch 16550, loss[loss=0.2686, simple_loss=0.3511, pruned_loss=0.09305, over 21728.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3147, pruned_loss=0.08387, over 4273459.10 frames. ], batch size: 441, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 16:57:18,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1014138.0, ans=0.0 2023-06-21 16:57:30,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1014138.0, ans=0.125 2023-06-21 16:58:02,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1014258.0, ans=0.125 2023-06-21 16:58:03,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1014258.0, ans=0.1 2023-06-21 16:58:14,284 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:58:29,282 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 2.954e+02 3.435e+02 4.498e+02 9.143e+02, threshold=6.870e+02, percent-clipped=8.0 2023-06-21 16:58:47,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1014378.0, ans=15.0 2023-06-21 16:58:54,060 INFO [train.py:996] (3/4) Epoch 6, batch 16600, loss[loss=0.4195, simple_loss=0.5202, pruned_loss=0.1595, over 19794.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3234, pruned_loss=0.08753, over 4275689.65 frames. ], batch size: 702, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 17:00:27,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1014678.0, ans=0.125 2023-06-21 17:00:30,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1014678.0, ans=0.1 2023-06-21 17:00:34,745 INFO [train.py:996] (3/4) Epoch 6, batch 16650, loss[loss=0.2839, simple_loss=0.3577, pruned_loss=0.105, over 21366.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3337, pruned_loss=0.09075, over 4280396.86 frames. ], batch size: 549, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 17:01:01,903 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 17:01:37,218 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=15.0 2023-06-21 17:01:47,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1014918.0, ans=0.0 2023-06-21 17:01:52,122 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.123e+02 3.593e+02 4.644e+02 7.930e+02, threshold=7.186e+02, percent-clipped=2.0 2023-06-21 17:02:21,297 INFO [train.py:996] (3/4) Epoch 6, batch 16700, loss[loss=0.2634, simple_loss=0.3692, pruned_loss=0.07881, over 20764.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3359, pruned_loss=0.09113, over 4281806.86 frames. ], batch size: 607, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 17:02:32,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1015038.0, ans=0.125 2023-06-21 17:03:47,052 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=22.5 2023-06-21 17:03:57,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1015338.0, ans=0.2 2023-06-21 17:03:58,827 INFO [train.py:996] (3/4) Epoch 6, batch 16750, loss[loss=0.2772, simple_loss=0.3511, pruned_loss=0.1016, over 21795.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3381, pruned_loss=0.09378, over 4279236.75 frames. ], batch size: 124, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 17:04:34,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1015398.0, ans=0.05 2023-06-21 17:05:11,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1015518.0, ans=0.125 2023-06-21 17:05:12,288 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.369e+02 3.506e+02 4.253e+02 6.038e+02 1.079e+03, threshold=8.506e+02, percent-clipped=10.0 2023-06-21 17:05:26,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1015578.0, ans=0.0 2023-06-21 17:05:32,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1015578.0, ans=0.035 2023-06-21 17:05:34,885 INFO [train.py:996] (3/4) Epoch 6, batch 16800, loss[loss=0.2704, simple_loss=0.3449, pruned_loss=0.09794, over 21767.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3409, pruned_loss=0.09304, over 4276170.18 frames. ], batch size: 441, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 17:06:51,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1015878.0, ans=0.1 2023-06-21 17:07:08,793 INFO [train.py:996] (3/4) Epoch 6, batch 16850, loss[loss=0.3086, simple_loss=0.3488, pruned_loss=0.1342, over 21844.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3372, pruned_loss=0.09305, over 4283222.64 frames. ], batch size: 508, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:07:57,550 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.33 vs. limit=22.5 2023-06-21 17:07:59,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1016058.0, ans=0.0 2023-06-21 17:08:02,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1016058.0, ans=0.1 2023-06-21 17:08:21,510 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 2.931e+02 3.423e+02 4.482e+02 7.655e+02, threshold=6.845e+02, percent-clipped=0.0 2023-06-21 17:08:40,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1016178.0, ans=0.1 2023-06-21 17:08:41,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1016178.0, ans=0.0 2023-06-21 17:08:43,845 INFO [train.py:996] (3/4) Epoch 6, batch 16900, loss[loss=0.1917, simple_loss=0.2765, pruned_loss=0.05348, over 21612.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3322, pruned_loss=0.09191, over 4291736.26 frames. ], batch size: 263, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:08:46,344 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.40 vs. limit=15.0 2023-06-21 17:09:50,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1016418.0, ans=0.125 2023-06-21 17:09:55,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1016478.0, ans=0.2 2023-06-21 17:10:14,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1016478.0, ans=0.0 2023-06-21 17:10:16,671 INFO [train.py:996] (3/4) Epoch 6, batch 16950, loss[loss=0.206, simple_loss=0.2728, pruned_loss=0.06965, over 21184.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3249, pruned_loss=0.0905, over 4288345.09 frames. ], batch size: 608, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:10:21,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1016538.0, ans=0.125 2023-06-21 17:10:41,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1016598.0, ans=0.125 2023-06-21 17:11:11,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1016658.0, ans=0.2 2023-06-21 17:11:13,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1016718.0, ans=0.0 2023-06-21 17:11:18,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1016718.0, ans=0.125 2023-06-21 17:11:26,922 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 2.733e+02 3.000e+02 3.564e+02 5.984e+02, threshold=6.000e+02, percent-clipped=0.0 2023-06-21 17:11:31,263 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.22 vs. limit=15.0 2023-06-21 17:11:39,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1016778.0, ans=0.125 2023-06-21 17:11:50,091 INFO [train.py:996] (3/4) Epoch 6, batch 17000, loss[loss=0.2595, simple_loss=0.3206, pruned_loss=0.09924, over 21914.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3209, pruned_loss=0.09014, over 4286997.71 frames. ], batch size: 351, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:13:20,259 INFO [train.py:996] (3/4) Epoch 6, batch 17050, loss[loss=0.2872, simple_loss=0.3688, pruned_loss=0.1029, over 21841.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3279, pruned_loss=0.09286, over 4288801.47 frames. ], batch size: 351, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:13:23,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1017138.0, ans=0.125 2023-06-21 17:13:32,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1017138.0, ans=10.0 2023-06-21 17:13:48,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1017198.0, ans=0.125 2023-06-21 17:13:51,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1017198.0, ans=0.1 2023-06-21 17:14:05,854 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-21 17:14:17,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1017318.0, ans=0.125 2023-06-21 17:14:30,021 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.382e+02 3.084e+02 3.616e+02 4.433e+02 7.180e+02, threshold=7.232e+02, percent-clipped=5.0 2023-06-21 17:14:51,968 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.81 vs. limit=15.0 2023-06-21 17:14:52,389 INFO [train.py:996] (3/4) Epoch 6, batch 17100, loss[loss=0.2259, simple_loss=0.2918, pruned_loss=0.08007, over 21923.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3274, pruned_loss=0.09364, over 4296334.56 frames. ], batch size: 316, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:14:55,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1017438.0, ans=0.1 2023-06-21 17:14:57,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1017438.0, ans=0.1 2023-06-21 17:15:26,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1017498.0, ans=0.035 2023-06-21 17:15:47,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1017558.0, ans=0.125 2023-06-21 17:15:47,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1017558.0, ans=0.04949747468305833 2023-06-21 17:16:10,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=22.5 2023-06-21 17:16:26,686 INFO [train.py:996] (3/4) Epoch 6, batch 17150, loss[loss=0.2443, simple_loss=0.3236, pruned_loss=0.08247, over 21781.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3238, pruned_loss=0.09365, over 4302202.14 frames. ], batch size: 351, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:16:48,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1017738.0, ans=0.125 2023-06-21 17:16:49,273 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-21 17:17:09,967 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.31 vs. limit=10.0 2023-06-21 17:17:35,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1017918.0, ans=0.125 2023-06-21 17:17:39,114 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.347e+02 2.784e+02 3.036e+02 3.616e+02 5.334e+02, threshold=6.072e+02, percent-clipped=0.0 2023-06-21 17:18:05,510 INFO [train.py:996] (3/4) Epoch 6, batch 17200, loss[loss=0.2629, simple_loss=0.3345, pruned_loss=0.09565, over 21586.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3245, pruned_loss=0.09326, over 4296839.80 frames. ], batch size: 263, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:18:50,246 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-06-21 17:19:09,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1018278.0, ans=0.125 2023-06-21 17:19:40,271 INFO [train.py:996] (3/4) Epoch 6, batch 17250, loss[loss=0.2465, simple_loss=0.3221, pruned_loss=0.08547, over 21714.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3289, pruned_loss=0.09578, over 4295211.32 frames. ], batch size: 298, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:19:45,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1018338.0, ans=0.0 2023-06-21 17:19:45,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1018338.0, ans=0.0 2023-06-21 17:20:08,520 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-21 17:20:54,742 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.506e+02 3.363e+02 4.058e+02 5.457e+02 1.011e+03, threshold=8.116e+02, percent-clipped=16.0 2023-06-21 17:21:09,798 INFO [train.py:996] (3/4) Epoch 6, batch 17300, loss[loss=0.2905, simple_loss=0.3608, pruned_loss=0.1101, over 21736.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3383, pruned_loss=0.09994, over 4293979.66 frames. ], batch size: 247, lr: 5.07e-03, grad_scale: 16.0 2023-06-21 17:21:52,388 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.49 vs. limit=15.0 2023-06-21 17:21:53,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1018758.0, ans=0.125 2023-06-21 17:22:33,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1018878.0, ans=0.0 2023-06-21 17:22:40,268 INFO [train.py:996] (3/4) Epoch 6, batch 17350, loss[loss=0.2874, simple_loss=0.373, pruned_loss=0.1009, over 21481.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3388, pruned_loss=0.09877, over 4287693.76 frames. ], batch size: 471, lr: 5.07e-03, grad_scale: 16.0 2023-06-21 17:22:57,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1018938.0, ans=0.125 2023-06-21 17:23:17,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1019058.0, ans=0.125 2023-06-21 17:23:45,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1019118.0, ans=0.125 2023-06-21 17:23:47,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=1019118.0, ans=0.1 2023-06-21 17:23:56,316 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 2.958e+02 3.315e+02 3.844e+02 7.686e+02, threshold=6.630e+02, percent-clipped=0.0 2023-06-21 17:24:08,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1019178.0, ans=0.04949747468305833 2023-06-21 17:24:11,636 INFO [train.py:996] (3/4) Epoch 6, batch 17400, loss[loss=0.3008, simple_loss=0.3909, pruned_loss=0.1053, over 21219.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3347, pruned_loss=0.09469, over 4283683.90 frames. ], batch size: 548, lr: 5.07e-03, grad_scale: 16.0 2023-06-21 17:24:21,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1019238.0, ans=0.125 2023-06-21 17:24:28,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1019238.0, ans=0.09899494936611666 2023-06-21 17:24:33,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1019298.0, ans=0.125 2023-06-21 17:24:34,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1019298.0, ans=0.0 2023-06-21 17:25:11,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1019418.0, ans=0.1 2023-06-21 17:25:27,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1019478.0, ans=6.0 2023-06-21 17:25:45,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1019538.0, ans=0.0 2023-06-21 17:25:46,422 INFO [train.py:996] (3/4) Epoch 6, batch 17450, loss[loss=0.2096, simple_loss=0.3033, pruned_loss=0.05795, over 21722.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3309, pruned_loss=0.09204, over 4270688.75 frames. ], batch size: 332, lr: 5.07e-03, grad_scale: 16.0 2023-06-21 17:25:48,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1019538.0, ans=0.1 2023-06-21 17:25:54,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1019538.0, ans=0.1 2023-06-21 17:26:46,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1019718.0, ans=0.125 2023-06-21 17:26:55,684 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-21 17:27:05,007 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.067e+02 2.736e+02 3.255e+02 3.942e+02 6.172e+02, threshold=6.510e+02, percent-clipped=0.0 2023-06-21 17:27:16,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1019778.0, ans=0.125 2023-06-21 17:27:18,935 INFO [train.py:996] (3/4) Epoch 6, batch 17500, loss[loss=0.2846, simple_loss=0.3448, pruned_loss=0.1123, over 21815.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3249, pruned_loss=0.08887, over 4281710.04 frames. ], batch size: 112, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:27:42,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1019898.0, ans=0.125 2023-06-21 17:28:16,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1020018.0, ans=0.125 2023-06-21 17:28:32,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1020078.0, ans=0.1 2023-06-21 17:28:44,993 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-21 17:28:50,808 INFO [train.py:996] (3/4) Epoch 6, batch 17550, loss[loss=0.2335, simple_loss=0.319, pruned_loss=0.07398, over 21786.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3249, pruned_loss=0.08757, over 4290363.36 frames. ], batch size: 332, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:28:51,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1020138.0, ans=0.125 2023-06-21 17:29:36,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1020258.0, ans=0.0 2023-06-21 17:29:58,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1020318.0, ans=0.0 2023-06-21 17:30:05,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1020378.0, ans=0.2 2023-06-21 17:30:11,112 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.780e+02 3.218e+02 4.154e+02 6.196e+02, threshold=6.435e+02, percent-clipped=0.0 2023-06-21 17:30:11,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1020378.0, ans=0.0 2023-06-21 17:30:24,475 INFO [train.py:996] (3/4) Epoch 6, batch 17600, loss[loss=0.2657, simple_loss=0.342, pruned_loss=0.09471, over 21737.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3282, pruned_loss=0.08816, over 4290449.08 frames. ], batch size: 298, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:30:37,734 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.79 vs. limit=22.5 2023-06-21 17:31:00,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1020498.0, ans=0.125 2023-06-21 17:31:00,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1020498.0, ans=0.125 2023-06-21 17:31:14,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1020558.0, ans=0.125 2023-06-21 17:31:17,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1020558.0, ans=0.05 2023-06-21 17:31:43,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1020678.0, ans=0.0 2023-06-21 17:31:59,657 INFO [train.py:996] (3/4) Epoch 6, batch 17650, loss[loss=0.1777, simple_loss=0.2051, pruned_loss=0.07513, over 15911.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3245, pruned_loss=0.08793, over 4276626.55 frames. ], batch size: 61, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:33:20,341 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.977e+02 3.448e+02 4.059e+02 7.958e+02, threshold=6.896e+02, percent-clipped=7.0 2023-06-21 17:33:39,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1020978.0, ans=0.125 2023-06-21 17:33:47,959 INFO [train.py:996] (3/4) Epoch 6, batch 17700, loss[loss=0.1997, simple_loss=0.2798, pruned_loss=0.05985, over 21377.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3194, pruned_loss=0.08508, over 4276148.04 frames. ], batch size: 131, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:34:01,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1021098.0, ans=0.1 2023-06-21 17:34:08,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1021098.0, ans=0.0 2023-06-21 17:34:27,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1021158.0, ans=0.2 2023-06-21 17:35:13,377 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-21 17:35:18,343 INFO [train.py:996] (3/4) Epoch 6, batch 17750, loss[loss=0.2685, simple_loss=0.3477, pruned_loss=0.09462, over 21457.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3262, pruned_loss=0.08873, over 4281424.40 frames. ], batch size: 131, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:35:35,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1021398.0, ans=0.0 2023-06-21 17:35:49,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2023-06-21 17:35:50,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1021398.0, ans=0.125 2023-06-21 17:36:27,134 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=22.5 2023-06-21 17:36:36,823 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 2.843e+02 3.336e+02 3.898e+02 5.169e+02, threshold=6.672e+02, percent-clipped=0.0 2023-06-21 17:36:39,586 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.29 vs. limit=22.5 2023-06-21 17:36:49,081 INFO [train.py:996] (3/4) Epoch 6, batch 17800, loss[loss=0.2455, simple_loss=0.3166, pruned_loss=0.08718, over 21409.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.326, pruned_loss=0.08786, over 4283337.42 frames. ], batch size: 131, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:36:51,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1021638.0, ans=0.125 2023-06-21 17:37:15,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1021698.0, ans=0.125 2023-06-21 17:38:15,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1021878.0, ans=0.2 2023-06-21 17:38:19,596 INFO [train.py:996] (3/4) Epoch 6, batch 17850, loss[loss=0.2731, simple_loss=0.3352, pruned_loss=0.1056, over 20055.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.328, pruned_loss=0.08919, over 4276455.20 frames. ], batch size: 704, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:38:22,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1021938.0, ans=0.125 2023-06-21 17:38:45,493 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.71 vs. limit=15.0 2023-06-21 17:38:49,764 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.26 vs. limit=6.0 2023-06-21 17:39:37,562 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.865e+02 3.033e+02 3.417e+02 4.325e+02 8.227e+02, threshold=6.834e+02, percent-clipped=5.0 2023-06-21 17:39:43,327 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-21 17:39:54,788 INFO [train.py:996] (3/4) Epoch 6, batch 17900, loss[loss=0.2518, simple_loss=0.3337, pruned_loss=0.08493, over 21288.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3329, pruned_loss=0.09101, over 4273647.57 frames. ], batch size: 176, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:39:55,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-21 17:40:15,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-06-21 17:40:55,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1022418.0, ans=0.2 2023-06-21 17:41:01,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1022418.0, ans=0.125 2023-06-21 17:41:29,375 INFO [train.py:996] (3/4) Epoch 6, batch 17950, loss[loss=0.2245, simple_loss=0.3046, pruned_loss=0.0722, over 21681.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3322, pruned_loss=0.08745, over 4280962.92 frames. ], batch size: 247, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:42:14,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1022658.0, ans=0.0 2023-06-21 17:42:27,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1022718.0, ans=0.2 2023-06-21 17:42:35,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1022718.0, ans=0.1 2023-06-21 17:42:44,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1022778.0, ans=0.0 2023-06-21 17:42:45,459 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.768e+02 3.134e+02 4.074e+02 6.684e+02, threshold=6.268e+02, percent-clipped=0.0 2023-06-21 17:42:48,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1022778.0, ans=0.125 2023-06-21 17:42:57,280 INFO [train.py:996] (3/4) Epoch 6, batch 18000, loss[loss=0.2241, simple_loss=0.2823, pruned_loss=0.08292, over 21443.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3242, pruned_loss=0.08624, over 4278257.92 frames. ], batch size: 195, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:42:57,280 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 17:43:06,001 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.7874, 3.1066, 1.6728, 1.6694], device='cuda:3') 2023-06-21 17:43:13,463 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2661, simple_loss=0.365, pruned_loss=0.08355, over 1796401.00 frames. 2023-06-21 17:43:13,463 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-21 17:44:23,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1023018.0, ans=0.0 2023-06-21 17:44:27,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1023078.0, ans=0.1 2023-06-21 17:44:42,379 INFO [train.py:996] (3/4) Epoch 6, batch 18050, loss[loss=0.2769, simple_loss=0.3344, pruned_loss=0.1096, over 21355.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3175, pruned_loss=0.08501, over 4271669.12 frames. ], batch size: 471, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:44:44,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1023138.0, ans=0.125 2023-06-21 17:45:20,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1023198.0, ans=0.125 2023-06-21 17:45:48,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1023318.0, ans=0.125 2023-06-21 17:45:52,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1023318.0, ans=0.125 2023-06-21 17:46:01,045 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 3.145e+02 3.852e+02 4.625e+02 8.498e+02, threshold=7.705e+02, percent-clipped=3.0 2023-06-21 17:46:18,320 INFO [train.py:996] (3/4) Epoch 6, batch 18100, loss[loss=0.2372, simple_loss=0.3187, pruned_loss=0.07787, over 21410.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.323, pruned_loss=0.08764, over 4263559.53 frames. ], batch size: 194, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:46:32,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1023438.0, ans=0.2 2023-06-21 17:47:01,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1023558.0, ans=0.0 2023-06-21 17:47:06,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1023558.0, ans=0.0 2023-06-21 17:47:15,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1023618.0, ans=0.125 2023-06-21 17:47:31,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1023678.0, ans=0.0 2023-06-21 17:47:34,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1023678.0, ans=0.05 2023-06-21 17:47:52,819 INFO [train.py:996] (3/4) Epoch 6, batch 18150, loss[loss=0.2147, simple_loss=0.3029, pruned_loss=0.06324, over 21800.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3257, pruned_loss=0.08837, over 4270794.12 frames. ], batch size: 282, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:47:57,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1023738.0, ans=0.1 2023-06-21 17:48:06,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1023738.0, ans=0.125 2023-06-21 17:48:07,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=22.5 2023-06-21 17:48:13,315 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-06-21 17:48:18,816 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 17:48:22,440 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.29 vs. limit=10.0 2023-06-21 17:48:53,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1023918.0, ans=0.2 2023-06-21 17:48:55,504 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-21 17:48:55,630 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-21 17:49:04,448 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.051e+02 2.867e+02 3.393e+02 4.008e+02 7.433e+02, threshold=6.785e+02, percent-clipped=0.0 2023-06-21 17:49:13,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1023978.0, ans=0.125 2023-06-21 17:49:16,095 INFO [train.py:996] (3/4) Epoch 6, batch 18200, loss[loss=0.2197, simple_loss=0.2811, pruned_loss=0.07922, over 21881.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3189, pruned_loss=0.08778, over 4277409.81 frames. ], batch size: 98, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:50:17,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1024218.0, ans=0.1 2023-06-21 17:50:25,079 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.66 vs. limit=10.0 2023-06-21 17:50:39,280 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.21 vs. limit=22.5 2023-06-21 17:50:47,202 INFO [train.py:996] (3/4) Epoch 6, batch 18250, loss[loss=0.2068, simple_loss=0.2696, pruned_loss=0.07205, over 21482.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3098, pruned_loss=0.0844, over 4276405.93 frames. ], batch size: 194, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:50:50,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1024338.0, ans=0.2 2023-06-21 17:50:51,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1024338.0, ans=0.125 2023-06-21 17:50:58,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1024338.0, ans=0.2 2023-06-21 17:51:12,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1024398.0, ans=0.125 2023-06-21 17:52:04,327 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.822e+02 2.691e+02 3.171e+02 4.057e+02 6.181e+02, threshold=6.342e+02, percent-clipped=0.0 2023-06-21 17:52:16,382 INFO [train.py:996] (3/4) Epoch 6, batch 18300, loss[loss=0.2337, simple_loss=0.303, pruned_loss=0.08225, over 21346.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3104, pruned_loss=0.08442, over 4279669.09 frames. ], batch size: 176, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:52:44,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1024698.0, ans=0.125 2023-06-21 17:52:57,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1024698.0, ans=0.125 2023-06-21 17:53:26,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1024818.0, ans=0.5 2023-06-21 17:53:49,934 INFO [train.py:996] (3/4) Epoch 6, batch 18350, loss[loss=0.2484, simple_loss=0.3168, pruned_loss=0.09005, over 21339.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3167, pruned_loss=0.08432, over 4278298.13 frames. ], batch size: 471, lr: 5.05e-03, grad_scale: 8.0 2023-06-21 17:54:56,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1025118.0, ans=0.125 2023-06-21 17:55:12,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1025178.0, ans=0.2 2023-06-21 17:55:13,009 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.969e+02 3.614e+02 4.589e+02 8.811e+02, threshold=7.228e+02, percent-clipped=4.0 2023-06-21 17:55:24,021 INFO [train.py:996] (3/4) Epoch 6, batch 18400, loss[loss=0.1881, simple_loss=0.2936, pruned_loss=0.0413, over 20816.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3123, pruned_loss=0.08329, over 4277863.10 frames. ], batch size: 608, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:55:28,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1025238.0, ans=0.125 2023-06-21 17:55:30,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1025238.0, ans=0.0 2023-06-21 17:55:40,319 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-21 17:56:40,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1025418.0, ans=0.125 2023-06-21 17:56:57,261 INFO [train.py:996] (3/4) Epoch 6, batch 18450, loss[loss=0.2392, simple_loss=0.2981, pruned_loss=0.09017, over 22013.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3072, pruned_loss=0.07905, over 4279719.33 frames. ], batch size: 103, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:57:14,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1025538.0, ans=0.015 2023-06-21 17:57:42,122 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=22.5 2023-06-21 17:57:59,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1025718.0, ans=0.0 2023-06-21 17:58:20,295 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 2.633e+02 3.038e+02 3.698e+02 5.788e+02, threshold=6.076e+02, percent-clipped=0.0 2023-06-21 17:58:22,797 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.09 vs. limit=12.0 2023-06-21 17:58:30,583 INFO [train.py:996] (3/4) Epoch 6, batch 18500, loss[loss=0.2417, simple_loss=0.299, pruned_loss=0.09219, over 21974.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3025, pruned_loss=0.07815, over 4267134.62 frames. ], batch size: 103, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:58:36,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1025838.0, ans=0.125 2023-06-21 18:00:02,402 INFO [train.py:996] (3/4) Epoch 6, batch 18550, loss[loss=0.2067, simple_loss=0.2622, pruned_loss=0.07565, over 20704.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3, pruned_loss=0.07709, over 4243636.06 frames. ], batch size: 608, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:00:34,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1026198.0, ans=0.125 2023-06-21 18:00:53,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1026258.0, ans=0.125 2023-06-21 18:01:10,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1026318.0, ans=10.0 2023-06-21 18:01:10,923 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.49 vs. limit=15.0 2023-06-21 18:01:21,230 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=12.0 2023-06-21 18:01:26,098 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.898e+02 3.436e+02 4.284e+02 7.618e+02, threshold=6.872e+02, percent-clipped=4.0 2023-06-21 18:01:34,454 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=15.0 2023-06-21 18:01:36,729 INFO [train.py:996] (3/4) Epoch 6, batch 18600, loss[loss=0.2351, simple_loss=0.2973, pruned_loss=0.08644, over 21510.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2995, pruned_loss=0.07857, over 4244642.43 frames. ], batch size: 230, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:02:54,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1026678.0, ans=0.125 2023-06-21 18:03:05,863 INFO [train.py:996] (3/4) Epoch 6, batch 18650, loss[loss=0.2436, simple_loss=0.3053, pruned_loss=0.09095, over 20239.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3007, pruned_loss=0.07913, over 4254130.54 frames. ], batch size: 702, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:03:15,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1026738.0, ans=0.025 2023-06-21 18:03:50,665 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-21 18:04:28,180 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.661e+02 3.021e+02 3.518e+02 6.281e+02, threshold=6.043e+02, percent-clipped=0.0 2023-06-21 18:04:38,438 INFO [train.py:996] (3/4) Epoch 6, batch 18700, loss[loss=0.2216, simple_loss=0.2844, pruned_loss=0.07943, over 21818.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.2992, pruned_loss=0.08126, over 4249723.98 frames. ], batch size: 298, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:06:11,375 INFO [train.py:996] (3/4) Epoch 6, batch 18750, loss[loss=0.2661, simple_loss=0.3424, pruned_loss=0.09489, over 21654.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3018, pruned_loss=0.08403, over 4263586.11 frames. ], batch size: 263, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:06:11,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1027338.0, ans=0.0 2023-06-21 18:06:53,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1027458.0, ans=0.0 2023-06-21 18:07:34,237 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.279e+02 2.900e+02 3.294e+02 3.931e+02 7.649e+02, threshold=6.589e+02, percent-clipped=4.0 2023-06-21 18:07:35,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1027578.0, ans=0.125 2023-06-21 18:07:36,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=1027578.0, ans=0.1 2023-06-21 18:07:45,353 INFO [train.py:996] (3/4) Epoch 6, batch 18800, loss[loss=0.2098, simple_loss=0.2985, pruned_loss=0.06052, over 21657.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3099, pruned_loss=0.0859, over 4265274.13 frames. ], batch size: 247, lr: 5.05e-03, grad_scale: 32.0 2023-06-21 18:08:49,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1027818.0, ans=0.0 2023-06-21 18:09:03,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1027878.0, ans=0.5 2023-06-21 18:09:18,547 INFO [train.py:996] (3/4) Epoch 6, batch 18850, loss[loss=0.2444, simple_loss=0.3078, pruned_loss=0.09047, over 20096.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3045, pruned_loss=0.08081, over 4258024.90 frames. ], batch size: 702, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:09:42,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1027998.0, ans=0.05 2023-06-21 18:09:45,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1027998.0, ans=0.1 2023-06-21 18:10:41,514 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.769e+02 2.569e+02 2.914e+02 3.331e+02 4.866e+02, threshold=5.828e+02, percent-clipped=0.0 2023-06-21 18:10:48,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1028178.0, ans=0.025 2023-06-21 18:10:51,727 INFO [train.py:996] (3/4) Epoch 6, batch 18900, loss[loss=0.2439, simple_loss=0.2961, pruned_loss=0.09587, over 21523.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.301, pruned_loss=0.08039, over 4265653.68 frames. ], batch size: 442, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:10:53,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1028238.0, ans=0.125 2023-06-21 18:10:58,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1028238.0, ans=0.0 2023-06-21 18:11:09,007 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-21 18:11:21,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1028298.0, ans=0.125 2023-06-21 18:12:24,799 INFO [train.py:996] (3/4) Epoch 6, batch 18950, loss[loss=0.2585, simple_loss=0.3201, pruned_loss=0.09849, over 21779.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3027, pruned_loss=0.08311, over 4271135.99 frames. ], batch size: 441, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:12:37,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1028538.0, ans=0.125 2023-06-21 18:12:55,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1028598.0, ans=0.125 2023-06-21 18:13:19,800 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-21 18:13:48,639 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 3.005e+02 3.685e+02 4.757e+02 8.623e+02, threshold=7.371e+02, percent-clipped=7.0 2023-06-21 18:14:04,080 INFO [train.py:996] (3/4) Epoch 6, batch 19000, loss[loss=0.3118, simple_loss=0.3736, pruned_loss=0.125, over 21337.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3127, pruned_loss=0.08496, over 4273190.37 frames. ], batch size: 143, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:15:06,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=22.5 2023-06-21 18:15:32,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1029078.0, ans=0.0 2023-06-21 18:15:36,829 INFO [train.py:996] (3/4) Epoch 6, batch 19050, loss[loss=0.2498, simple_loss=0.3078, pruned_loss=0.09588, over 21482.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3163, pruned_loss=0.08806, over 4273416.86 frames. ], batch size: 131, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:15:37,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1029138.0, ans=0.125 2023-06-21 18:15:53,315 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.99 vs. limit=22.5 2023-06-21 18:16:06,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1029198.0, ans=0.125 2023-06-21 18:16:54,971 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.277e+02 2.879e+02 3.312e+02 3.801e+02 5.598e+02, threshold=6.624e+02, percent-clipped=0.0 2023-06-21 18:17:10,666 INFO [train.py:996] (3/4) Epoch 6, batch 19100, loss[loss=0.2428, simple_loss=0.2994, pruned_loss=0.09311, over 21601.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3162, pruned_loss=0.09017, over 4282337.05 frames. ], batch size: 263, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:18:00,701 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.24 vs. limit=12.0 2023-06-21 18:18:17,587 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=12.0 2023-06-21 18:18:51,222 INFO [train.py:996] (3/4) Epoch 6, batch 19150, loss[loss=0.2277, simple_loss=0.3063, pruned_loss=0.07457, over 21292.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3166, pruned_loss=0.09033, over 4279015.13 frames. ], batch size: 159, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:19:20,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1029798.0, ans=0.125 2023-06-21 18:20:02,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1029918.0, ans=0.0 2023-06-21 18:20:16,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1029978.0, ans=0.0 2023-06-21 18:20:17,876 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 3.009e+02 3.594e+02 4.563e+02 7.134e+02, threshold=7.188e+02, percent-clipped=1.0 2023-06-21 18:20:31,688 INFO [train.py:996] (3/4) Epoch 6, batch 19200, loss[loss=0.2438, simple_loss=0.3379, pruned_loss=0.07484, over 21250.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3278, pruned_loss=0.0913, over 4277558.07 frames. ], batch size: 143, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:20:47,727 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.22 vs. limit=12.0 2023-06-21 18:20:59,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1030098.0, ans=0.125 2023-06-21 18:21:12,018 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-21 18:21:18,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1030158.0, ans=0.0 2023-06-21 18:21:22,496 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=12.0 2023-06-21 18:21:43,747 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=22.5 2023-06-21 18:22:00,349 INFO [train.py:996] (3/4) Epoch 6, batch 19250, loss[loss=0.2129, simple_loss=0.2888, pruned_loss=0.0685, over 21296.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3265, pruned_loss=0.08536, over 4280068.52 frames. ], batch size: 143, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:23:05,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1030518.0, ans=0.125 2023-06-21 18:23:25,539 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.761e+02 2.595e+02 3.039e+02 3.485e+02 4.997e+02, threshold=6.078e+02, percent-clipped=0.0 2023-06-21 18:23:30,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1030578.0, ans=0.125 2023-06-21 18:23:37,841 INFO [train.py:996] (3/4) Epoch 6, batch 19300, loss[loss=0.2134, simple_loss=0.2955, pruned_loss=0.06564, over 21801.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3234, pruned_loss=0.08539, over 4283052.27 frames. ], batch size: 282, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:23:40,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1030638.0, ans=0.125 2023-06-21 18:23:46,920 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2023-06-21 18:23:49,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1030638.0, ans=0.1 2023-06-21 18:24:03,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1030698.0, ans=0.125 2023-06-21 18:24:09,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1030698.0, ans=0.05 2023-06-21 18:24:12,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1030758.0, ans=0.125 2023-06-21 18:24:57,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1030878.0, ans=0.1 2023-06-21 18:25:17,123 INFO [train.py:996] (3/4) Epoch 6, batch 19350, loss[loss=0.2326, simple_loss=0.3224, pruned_loss=0.07139, over 21720.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3178, pruned_loss=0.08134, over 4275216.13 frames. ], batch size: 415, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:25:26,359 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:25:32,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1030998.0, ans=0.125 2023-06-21 18:25:41,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1030998.0, ans=0.1 2023-06-21 18:26:05,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1031118.0, ans=0.125 2023-06-21 18:26:38,757 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.597e+02 3.186e+02 4.013e+02 6.947e+02, threshold=6.372e+02, percent-clipped=2.0 2023-06-21 18:26:46,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1031178.0, ans=0.1 2023-06-21 18:26:50,822 INFO [train.py:996] (3/4) Epoch 6, batch 19400, loss[loss=0.2323, simple_loss=0.2924, pruned_loss=0.08607, over 21602.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3181, pruned_loss=0.08102, over 4272691.21 frames. ], batch size: 212, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:26:53,451 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=15.0 2023-06-21 18:27:27,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1031358.0, ans=0.125 2023-06-21 18:27:32,553 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.12 vs. limit=12.0 2023-06-21 18:27:42,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1031418.0, ans=0.125 2023-06-21 18:28:24,307 INFO [train.py:996] (3/4) Epoch 6, batch 19450, loss[loss=0.2543, simple_loss=0.3175, pruned_loss=0.09553, over 14822.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3149, pruned_loss=0.08292, over 4272219.85 frames. ], batch size: 60, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:28:32,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1031538.0, ans=0.2 2023-06-21 18:29:12,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1031718.0, ans=0.125 2023-06-21 18:29:21,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1031718.0, ans=0.125 2023-06-21 18:29:51,191 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.868e+02 3.154e+02 3.556e+02 5.974e+02, threshold=6.308e+02, percent-clipped=0.0 2023-06-21 18:29:58,638 INFO [train.py:996] (3/4) Epoch 6, batch 19500, loss[loss=0.2542, simple_loss=0.339, pruned_loss=0.08472, over 21161.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3109, pruned_loss=0.08398, over 4277556.40 frames. ], batch size: 548, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:30:05,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1031838.0, ans=0.0 2023-06-21 18:30:08,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1031838.0, ans=0.125 2023-06-21 18:31:11,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1032018.0, ans=0.2 2023-06-21 18:31:21,233 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.93 vs. limit=15.0 2023-06-21 18:31:34,627 INFO [train.py:996] (3/4) Epoch 6, batch 19550, loss[loss=0.172, simple_loss=0.2406, pruned_loss=0.05172, over 21561.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3055, pruned_loss=0.08191, over 4281833.02 frames. ], batch size: 195, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:31:42,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1032138.0, ans=0.0 2023-06-21 18:31:53,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1032198.0, ans=0.125 2023-06-21 18:32:22,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1032318.0, ans=0.125 2023-06-21 18:33:00,133 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.941e+02 3.432e+02 4.314e+02 8.392e+02, threshold=6.865e+02, percent-clipped=2.0 2023-06-21 18:33:07,588 INFO [train.py:996] (3/4) Epoch 6, batch 19600, loss[loss=0.2947, simple_loss=0.3517, pruned_loss=0.1188, over 21280.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3065, pruned_loss=0.08247, over 4276278.84 frames. ], batch size: 143, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:33:07,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1032438.0, ans=0.1 2023-06-21 18:34:09,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1032618.0, ans=0.125 2023-06-21 18:34:10,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1032618.0, ans=0.1 2023-06-21 18:34:34,040 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=12.0 2023-06-21 18:34:42,383 INFO [train.py:996] (3/4) Epoch 6, batch 19650, loss[loss=0.2541, simple_loss=0.3184, pruned_loss=0.09491, over 21815.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3119, pruned_loss=0.08655, over 4278145.70 frames. ], batch size: 351, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:34:53,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1032738.0, ans=0.2 2023-06-21 18:35:44,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1032918.0, ans=0.125 2023-06-21 18:36:03,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1032978.0, ans=0.125 2023-06-21 18:36:10,806 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.518e+02 3.307e+02 3.868e+02 4.631e+02 9.125e+02, threshold=7.736e+02, percent-clipped=5.0 2023-06-21 18:36:23,457 INFO [train.py:996] (3/4) Epoch 6, batch 19700, loss[loss=0.2408, simple_loss=0.3224, pruned_loss=0.07959, over 21682.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3166, pruned_loss=0.08729, over 4274452.62 frames. ], batch size: 298, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:37:06,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1033158.0, ans=0.5 2023-06-21 18:37:57,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1033338.0, ans=0.125 2023-06-21 18:37:58,189 INFO [train.py:996] (3/4) Epoch 6, batch 19750, loss[loss=0.2406, simple_loss=0.3167, pruned_loss=0.08224, over 21759.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.326, pruned_loss=0.08888, over 4271035.41 frames. ], batch size: 124, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:38:07,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1033338.0, ans=0.125 2023-06-21 18:38:20,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1033398.0, ans=0.125 2023-06-21 18:38:23,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1033398.0, ans=0.0 2023-06-21 18:39:13,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1033578.0, ans=0.125 2023-06-21 18:39:24,368 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 3.146e+02 3.443e+02 4.036e+02 6.788e+02, threshold=6.886e+02, percent-clipped=0.0 2023-06-21 18:39:26,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1033578.0, ans=0.125 2023-06-21 18:39:31,805 INFO [train.py:996] (3/4) Epoch 6, batch 19800, loss[loss=0.2899, simple_loss=0.3626, pruned_loss=0.1086, over 21484.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3247, pruned_loss=0.08955, over 4270081.33 frames. ], batch size: 471, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:39:38,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1033638.0, ans=0.0 2023-06-21 18:40:13,891 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-21 18:40:51,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1033878.0, ans=10.0 2023-06-21 18:41:11,659 INFO [train.py:996] (3/4) Epoch 6, batch 19850, loss[loss=0.2208, simple_loss=0.2969, pruned_loss=0.07234, over 21717.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3169, pruned_loss=0.08464, over 4264997.55 frames. ], batch size: 332, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:41:27,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1033938.0, ans=0.125 2023-06-21 18:41:28,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1033938.0, ans=0.125 2023-06-21 18:41:28,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1033938.0, ans=0.125 2023-06-21 18:42:33,692 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.771e+02 3.274e+02 3.883e+02 8.276e+02, threshold=6.549e+02, percent-clipped=3.0 2023-06-21 18:42:44,454 INFO [train.py:996] (3/4) Epoch 6, batch 19900, loss[loss=0.2297, simple_loss=0.2947, pruned_loss=0.08237, over 21858.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3172, pruned_loss=0.08205, over 4259307.83 frames. ], batch size: 98, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:42:49,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1034238.0, ans=0.035 2023-06-21 18:43:18,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1034298.0, ans=0.2 2023-06-21 18:43:41,512 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:44:15,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1034478.0, ans=0.2 2023-06-21 18:44:19,351 INFO [train.py:996] (3/4) Epoch 6, batch 19950, loss[loss=0.2042, simple_loss=0.2636, pruned_loss=0.07237, over 21189.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3104, pruned_loss=0.0816, over 4259993.24 frames. ], batch size: 143, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:44:22,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1034538.0, ans=0.125 2023-06-21 18:44:52,924 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-06-21 18:45:21,027 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:45:47,376 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.175e+02 2.877e+02 3.269e+02 3.817e+02 6.552e+02, threshold=6.538e+02, percent-clipped=1.0 2023-06-21 18:45:53,428 INFO [train.py:996] (3/4) Epoch 6, batch 20000, loss[loss=0.2477, simple_loss=0.3253, pruned_loss=0.08502, over 21871.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3121, pruned_loss=0.08175, over 4259651.56 frames. ], batch size: 351, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:46:21,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1034898.0, ans=0.125 2023-06-21 18:46:40,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1034958.0, ans=0.1 2023-06-21 18:46:42,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1034958.0, ans=0.0 2023-06-21 18:46:51,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1035018.0, ans=0.2 2023-06-21 18:46:57,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1035018.0, ans=0.0 2023-06-21 18:47:05,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1035078.0, ans=0.125 2023-06-21 18:47:26,471 INFO [train.py:996] (3/4) Epoch 6, batch 20050, loss[loss=0.2293, simple_loss=0.2997, pruned_loss=0.07949, over 21833.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3147, pruned_loss=0.0847, over 4269053.30 frames. ], batch size: 282, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:47:52,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1035198.0, ans=0.0 2023-06-21 18:48:41,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1035378.0, ans=0.035 2023-06-21 18:48:54,571 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.295e+02 2.772e+02 3.158e+02 3.739e+02 6.638e+02, threshold=6.316e+02, percent-clipped=1.0 2023-06-21 18:48:56,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1035378.0, ans=0.125 2023-06-21 18:49:00,941 INFO [train.py:996] (3/4) Epoch 6, batch 20100, loss[loss=0.2637, simple_loss=0.3497, pruned_loss=0.08883, over 21830.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3174, pruned_loss=0.08714, over 4280430.74 frames. ], batch size: 316, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:49:27,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1035498.0, ans=0.2 2023-06-21 18:49:59,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1035618.0, ans=0.0 2023-06-21 18:50:25,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1035678.0, ans=0.125 2023-06-21 18:50:25,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1035678.0, ans=0.125 2023-06-21 18:50:26,086 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.94 vs. limit=22.5 2023-06-21 18:50:45,518 INFO [train.py:996] (3/4) Epoch 6, batch 20150, loss[loss=0.2715, simple_loss=0.3372, pruned_loss=0.1029, over 21665.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.328, pruned_loss=0.09078, over 4279370.57 frames. ], batch size: 263, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:51:00,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1035798.0, ans=0.0 2023-06-21 18:51:09,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1035798.0, ans=0.125 2023-06-21 18:51:32,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=22.5 2023-06-21 18:52:17,965 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.643e+02 3.447e+02 4.089e+02 4.717e+02 8.481e+02, threshold=8.179e+02, percent-clipped=7.0 2023-06-21 18:52:22,518 INFO [train.py:996] (3/4) Epoch 6, batch 20200, loss[loss=0.2492, simple_loss=0.3326, pruned_loss=0.08294, over 21786.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3337, pruned_loss=0.09356, over 4277811.05 frames. ], batch size: 332, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 18:53:05,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1036158.0, ans=0.1 2023-06-21 18:53:11,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1036158.0, ans=0.0 2023-06-21 18:53:13,746 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=22.5 2023-06-21 18:54:02,170 INFO [train.py:996] (3/4) Epoch 6, batch 20250, loss[loss=0.2532, simple_loss=0.3127, pruned_loss=0.09687, over 16493.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3336, pruned_loss=0.09131, over 4268655.77 frames. ], batch size: 60, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 18:54:20,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1036398.0, ans=0.125 2023-06-21 18:54:30,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1036398.0, ans=0.0 2023-06-21 18:55:03,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1036518.0, ans=0.125 2023-06-21 18:55:15,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1036578.0, ans=0.0 2023-06-21 18:55:19,565 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:55:22,005 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.562e+02 2.955e+02 3.343e+02 6.024e+02, threshold=5.910e+02, percent-clipped=0.0 2023-06-21 18:55:31,046 INFO [train.py:996] (3/4) Epoch 6, batch 20300, loss[loss=0.2426, simple_loss=0.3432, pruned_loss=0.07105, over 21267.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3301, pruned_loss=0.08757, over 4267261.26 frames. ], batch size: 548, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 18:56:16,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1036758.0, ans=0.2 2023-06-21 18:56:35,467 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=22.5 2023-06-21 18:56:39,794 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-06-21 18:56:59,740 INFO [train.py:996] (3/4) Epoch 6, batch 20350, loss[loss=0.2511, simple_loss=0.3244, pruned_loss=0.08893, over 21690.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3305, pruned_loss=0.08848, over 4263296.57 frames. ], batch size: 389, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 18:57:13,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1036938.0, ans=0.0 2023-06-21 18:57:19,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1036998.0, ans=0.125 2023-06-21 18:57:19,270 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:57:49,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1037058.0, ans=0.125 2023-06-21 18:58:16,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1037178.0, ans=0.125 2023-06-21 18:58:34,348 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.265e+02 2.934e+02 3.391e+02 4.276e+02 6.956e+02, threshold=6.781e+02, percent-clipped=4.0 2023-06-21 18:58:37,444 INFO [train.py:996] (3/4) Epoch 6, batch 20400, loss[loss=0.254, simple_loss=0.3585, pruned_loss=0.07474, over 19801.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3354, pruned_loss=0.09308, over 4266525.68 frames. ], batch size: 704, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 18:58:39,569 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:58:48,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1037238.0, ans=0.125 2023-06-21 18:59:05,759 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.51 vs. limit=15.0 2023-06-21 18:59:42,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1037418.0, ans=0.125 2023-06-21 18:59:53,168 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-21 19:00:05,794 INFO [train.py:996] (3/4) Epoch 6, batch 20450, loss[loss=0.2498, simple_loss=0.3135, pruned_loss=0.09301, over 21485.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3352, pruned_loss=0.09463, over 4245698.48 frames. ], batch size: 194, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:00:22,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1037598.0, ans=0.09899494936611666 2023-06-21 19:00:36,538 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-21 19:00:37,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1037598.0, ans=0.125 2023-06-21 19:00:39,955 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.97 vs. limit=6.0 2023-06-21 19:01:21,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1037778.0, ans=0.0 2023-06-21 19:01:27,056 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=12.0 2023-06-21 19:01:36,786 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.469e+02 3.567e+02 4.308e+02 5.221e+02 9.242e+02, threshold=8.616e+02, percent-clipped=7.0 2023-06-21 19:01:39,721 INFO [train.py:996] (3/4) Epoch 6, batch 20500, loss[loss=0.2427, simple_loss=0.3043, pruned_loss=0.09053, over 21454.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3304, pruned_loss=0.09487, over 4250506.92 frames. ], batch size: 131, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:02:16,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1037958.0, ans=0.125 2023-06-21 19:02:43,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1038018.0, ans=0.0 2023-06-21 19:02:54,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1038078.0, ans=0.0 2023-06-21 19:03:10,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1038078.0, ans=0.2 2023-06-21 19:03:14,145 INFO [train.py:996] (3/4) Epoch 6, batch 20550, loss[loss=0.2354, simple_loss=0.3218, pruned_loss=0.0745, over 21847.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3234, pruned_loss=0.09314, over 4253218.04 frames. ], batch size: 372, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:03:32,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1038198.0, ans=0.1 2023-06-21 19:03:36,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1038198.0, ans=0.05 2023-06-21 19:03:39,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1038198.0, ans=0.125 2023-06-21 19:03:43,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1038198.0, ans=0.2 2023-06-21 19:04:24,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1038318.0, ans=0.125 2023-06-21 19:04:45,835 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 3.022e+02 3.769e+02 4.529e+02 7.328e+02, threshold=7.538e+02, percent-clipped=0.0 2023-06-21 19:04:48,966 INFO [train.py:996] (3/4) Epoch 6, batch 20600, loss[loss=0.2625, simple_loss=0.3128, pruned_loss=0.1061, over 21360.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3243, pruned_loss=0.09051, over 4238422.09 frames. ], batch size: 176, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:04:55,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1038438.0, ans=0.04949747468305833 2023-06-21 19:05:07,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1038498.0, ans=0.0 2023-06-21 19:05:23,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.17 vs. limit=12.0 2023-06-21 19:05:46,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1038618.0, ans=0.1 2023-06-21 19:06:12,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1038678.0, ans=0.1 2023-06-21 19:06:21,548 INFO [train.py:996] (3/4) Epoch 6, batch 20650, loss[loss=0.2351, simple_loss=0.2865, pruned_loss=0.09182, over 21223.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.321, pruned_loss=0.09126, over 4248513.33 frames. ], batch size: 176, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 19:06:28,458 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.68 vs. limit=15.0 2023-06-21 19:06:41,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1038798.0, ans=0.1 2023-06-21 19:06:43,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1038798.0, ans=0.09899494936611666 2023-06-21 19:07:00,547 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.01 vs. limit=12.0 2023-06-21 19:07:20,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1038918.0, ans=0.2 2023-06-21 19:07:55,776 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.764e+02 3.114e+02 3.548e+02 5.059e+02, threshold=6.228e+02, percent-clipped=0.0 2023-06-21 19:07:57,288 INFO [train.py:996] (3/4) Epoch 6, batch 20700, loss[loss=0.1981, simple_loss=0.2699, pruned_loss=0.0631, over 21368.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3137, pruned_loss=0.08757, over 4248824.37 frames. ], batch size: 194, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 19:08:06,066 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.94 vs. limit=22.5 2023-06-21 19:08:29,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1039098.0, ans=0.1 2023-06-21 19:08:42,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1039158.0, ans=0.125 2023-06-21 19:09:07,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1039218.0, ans=0.125 2023-06-21 19:09:37,897 INFO [train.py:996] (3/4) Epoch 6, batch 20750, loss[loss=0.2547, simple_loss=0.3517, pruned_loss=0.07882, over 21697.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.317, pruned_loss=0.08734, over 4253593.65 frames. ], batch size: 298, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 19:11:11,333 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.333e+02 3.291e+02 4.399e+02 5.919e+02 1.160e+03, threshold=8.798e+02, percent-clipped=22.0 2023-06-21 19:11:12,894 INFO [train.py:996] (3/4) Epoch 6, batch 20800, loss[loss=0.2419, simple_loss=0.3017, pruned_loss=0.09107, over 21664.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3198, pruned_loss=0.08835, over 4255830.98 frames. ], batch size: 333, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:11:17,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1039638.0, ans=0.025 2023-06-21 19:11:25,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1039638.0, ans=0.125 2023-06-21 19:11:37,886 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2023-06-21 19:12:10,963 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=22.5 2023-06-21 19:12:33,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1039878.0, ans=0.1 2023-06-21 19:12:39,694 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:12:45,674 INFO [train.py:996] (3/4) Epoch 6, batch 20850, loss[loss=0.1847, simple_loss=0.2505, pruned_loss=0.05941, over 16609.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3108, pruned_loss=0.08525, over 4251391.33 frames. ], batch size: 60, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:13:20,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1039998.0, ans=0.125 2023-06-21 19:13:24,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1040058.0, ans=0.1 2023-06-21 19:13:37,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1040058.0, ans=0.0 2023-06-21 19:13:41,250 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.89 vs. limit=15.0 2023-06-21 19:14:17,406 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.782e+02 3.461e+02 4.341e+02 9.177e+02, threshold=6.922e+02, percent-clipped=2.0 2023-06-21 19:14:18,809 INFO [train.py:996] (3/4) Epoch 6, batch 20900, loss[loss=0.232, simple_loss=0.304, pruned_loss=0.08003, over 21669.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3123, pruned_loss=0.08664, over 4256401.47 frames. ], batch size: 263, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:14:19,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1040238.0, ans=0.0 2023-06-21 19:14:22,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1040238.0, ans=0.0 2023-06-21 19:15:01,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1040358.0, ans=0.125 2023-06-21 19:15:33,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1040418.0, ans=0.0 2023-06-21 19:15:33,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1040418.0, ans=0.1 2023-06-21 19:15:35,006 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-21 19:15:44,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1040478.0, ans=0.125 2023-06-21 19:15:46,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1040478.0, ans=0.0 2023-06-21 19:15:51,577 INFO [train.py:996] (3/4) Epoch 6, batch 20950, loss[loss=0.2454, simple_loss=0.3117, pruned_loss=0.08949, over 21391.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3092, pruned_loss=0.08329, over 4263859.80 frames. ], batch size: 471, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:16:00,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1040538.0, ans=0.0 2023-06-21 19:16:13,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1040598.0, ans=0.125 2023-06-21 19:16:39,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1040718.0, ans=0.1 2023-06-21 19:16:41,195 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.43 vs. limit=12.0 2023-06-21 19:17:11,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1040778.0, ans=0.0 2023-06-21 19:17:17,312 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=22.5 2023-06-21 19:17:17,881 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.529e+02 2.877e+02 3.319e+02 6.338e+02, threshold=5.753e+02, percent-clipped=0.0 2023-06-21 19:17:19,463 INFO [train.py:996] (3/4) Epoch 6, batch 21000, loss[loss=0.1779, simple_loss=0.2435, pruned_loss=0.05611, over 17030.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3079, pruned_loss=0.08321, over 4271528.78 frames. ], batch size: 66, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:17:19,464 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 19:17:33,947 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.8423, 2.5857, 1.4022, 1.5840], device='cuda:3') 2023-06-21 19:17:35,754 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2688, simple_loss=0.3681, pruned_loss=0.08473, over 1796401.00 frames. 2023-06-21 19:17:35,755 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-21 19:18:20,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1040958.0, ans=0.125 2023-06-21 19:18:20,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1040958.0, ans=0.125 2023-06-21 19:18:41,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1041018.0, ans=0.025 2023-06-21 19:18:57,197 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-06-21 19:19:01,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1041078.0, ans=0.0 2023-06-21 19:19:08,162 INFO [train.py:996] (3/4) Epoch 6, batch 21050, loss[loss=0.2198, simple_loss=0.2823, pruned_loss=0.07867, over 21222.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3057, pruned_loss=0.08394, over 4276895.87 frames. ], batch size: 548, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:19:26,810 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=15.0 2023-06-21 19:19:38,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1041198.0, ans=10.0 2023-06-21 19:19:40,819 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-21 19:20:21,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1041378.0, ans=0.125 2023-06-21 19:20:34,869 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.795e+02 3.116e+02 3.832e+02 6.783e+02, threshold=6.232e+02, percent-clipped=3.0 2023-06-21 19:20:36,441 INFO [train.py:996] (3/4) Epoch 6, batch 21100, loss[loss=0.1999, simple_loss=0.27, pruned_loss=0.06495, over 21725.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3015, pruned_loss=0.08261, over 4272919.51 frames. ], batch size: 112, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:20:44,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1041438.0, ans=0.125 2023-06-21 19:20:45,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1041438.0, ans=0.025 2023-06-21 19:21:17,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-21 19:22:10,049 INFO [train.py:996] (3/4) Epoch 6, batch 21150, loss[loss=0.2106, simple_loss=0.2868, pruned_loss=0.06722, over 15237.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.297, pruned_loss=0.08274, over 4260549.58 frames. ], batch size: 60, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:22:26,160 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-21 19:22:52,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1041858.0, ans=0.0 2023-06-21 19:23:22,763 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:23:34,783 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:23:37,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1041978.0, ans=0.125 2023-06-21 19:23:41,594 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.856e+02 3.274e+02 4.026e+02 6.885e+02, threshold=6.548e+02, percent-clipped=2.0 2023-06-21 19:23:43,128 INFO [train.py:996] (3/4) Epoch 6, batch 21200, loss[loss=0.177, simple_loss=0.2441, pruned_loss=0.05497, over 20731.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2933, pruned_loss=0.08154, over 4251723.93 frames. ], batch size: 608, lr: 5.01e-03, grad_scale: 32.0 2023-06-21 19:23:43,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1042038.0, ans=0.125 2023-06-21 19:23:53,257 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.96 vs. limit=8.0 2023-06-21 19:25:12,305 INFO [train.py:996] (3/4) Epoch 6, batch 21250, loss[loss=0.2422, simple_loss=0.2939, pruned_loss=0.09531, over 21697.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2918, pruned_loss=0.08186, over 4239955.24 frames. ], batch size: 124, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:25:17,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1042338.0, ans=0.125 2023-06-21 19:26:15,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1042518.0, ans=0.125 2023-06-21 19:26:41,154 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 3.008e+02 3.484e+02 4.132e+02 8.300e+02, threshold=6.969e+02, percent-clipped=3.0 2023-06-21 19:26:41,174 INFO [train.py:996] (3/4) Epoch 6, batch 21300, loss[loss=0.244, simple_loss=0.3162, pruned_loss=0.08592, over 21867.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.2995, pruned_loss=0.08433, over 4253242.74 frames. ], batch size: 351, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:26:41,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1042638.0, ans=0.125 2023-06-21 19:27:18,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1042698.0, ans=0.2 2023-06-21 19:27:55,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1042878.0, ans=0.1 2023-06-21 19:28:14,538 INFO [train.py:996] (3/4) Epoch 6, batch 21350, loss[loss=0.2132, simple_loss=0.308, pruned_loss=0.05915, over 21831.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3054, pruned_loss=0.08544, over 4270529.03 frames. ], batch size: 316, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:28:54,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1043058.0, ans=0.0 2023-06-21 19:29:16,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1043118.0, ans=0.1 2023-06-21 19:29:49,098 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.778e+02 3.087e+02 3.779e+02 5.135e+02, threshold=6.175e+02, percent-clipped=0.0 2023-06-21 19:29:49,119 INFO [train.py:996] (3/4) Epoch 6, batch 21400, loss[loss=0.285, simple_loss=0.3542, pruned_loss=0.1079, over 21766.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3091, pruned_loss=0.08504, over 4275838.32 frames. ], batch size: 441, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:30:08,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1043298.0, ans=0.1 2023-06-21 19:30:17,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1043298.0, ans=0.125 2023-06-21 19:31:11,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1043478.0, ans=0.125 2023-06-21 19:31:22,529 INFO [train.py:996] (3/4) Epoch 6, batch 21450, loss[loss=0.2686, simple_loss=0.3273, pruned_loss=0.1049, over 21719.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3132, pruned_loss=0.08761, over 4281269.16 frames. ], batch size: 473, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:31:32,462 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-21 19:31:58,911 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:32:08,643 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-06-21 19:32:44,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1043778.0, ans=0.0 2023-06-21 19:32:44,706 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=22.5 2023-06-21 19:32:55,665 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 2.827e+02 3.165e+02 3.718e+02 5.694e+02, threshold=6.329e+02, percent-clipped=0.0 2023-06-21 19:32:55,686 INFO [train.py:996] (3/4) Epoch 6, batch 21500, loss[loss=0.2353, simple_loss=0.2949, pruned_loss=0.08784, over 21296.00 frames. ], tot_loss[loss=0.244, simple_loss=0.311, pruned_loss=0.08848, over 4281943.88 frames. ], batch size: 144, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:33:04,734 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-21 19:33:51,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1043958.0, ans=0.1 2023-06-21 19:34:25,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1044078.0, ans=0.0 2023-06-21 19:34:29,720 INFO [train.py:996] (3/4) Epoch 6, batch 21550, loss[loss=0.1891, simple_loss=0.25, pruned_loss=0.06408, over 21264.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3041, pruned_loss=0.08574, over 4279606.76 frames. ], batch size: 159, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:35:18,629 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=19.56 vs. limit=22.5 2023-06-21 19:35:22,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1044258.0, ans=0.125 2023-06-21 19:35:26,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1044258.0, ans=0.04949747468305833 2023-06-21 19:35:35,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1044318.0, ans=0.1 2023-06-21 19:35:49,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1044378.0, ans=0.0 2023-06-21 19:35:59,452 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=12.0 2023-06-21 19:36:04,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.859e+02 3.362e+02 4.302e+02 8.120e+02, threshold=6.725e+02, percent-clipped=2.0 2023-06-21 19:36:04,932 INFO [train.py:996] (3/4) Epoch 6, batch 21600, loss[loss=0.2158, simple_loss=0.3113, pruned_loss=0.06017, over 21897.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3005, pruned_loss=0.08415, over 4278251.99 frames. ], batch size: 372, lr: 5.00e-03, grad_scale: 32.0 2023-06-21 19:36:40,258 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-21 19:37:21,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1044618.0, ans=0.0 2023-06-21 19:37:33,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1044678.0, ans=0.2 2023-06-21 19:37:39,253 INFO [train.py:996] (3/4) Epoch 6, batch 21650, loss[loss=0.1938, simple_loss=0.2562, pruned_loss=0.06567, over 20718.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3041, pruned_loss=0.08159, over 4273902.82 frames. ], batch size: 607, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:37:55,530 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=22.5 2023-06-21 19:38:15,429 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.77 vs. limit=15.0 2023-06-21 19:38:48,180 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=12.0 2023-06-21 19:38:57,961 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:39:00,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1044978.0, ans=0.95 2023-06-21 19:39:03,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1044978.0, ans=0.125 2023-06-21 19:39:13,276 INFO [train.py:996] (3/4) Epoch 6, batch 21700, loss[loss=0.2257, simple_loss=0.2813, pruned_loss=0.08501, over 21302.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3038, pruned_loss=0.07891, over 4277616.39 frames. ], batch size: 144, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:39:14,694 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.795e+02 3.316e+02 4.085e+02 7.380e+02, threshold=6.631e+02, percent-clipped=1.0 2023-06-21 19:40:06,782 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-21 19:40:27,936 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.00 vs. limit=15.0 2023-06-21 19:40:32,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1045278.0, ans=0.125 2023-06-21 19:40:34,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1045278.0, ans=0.5 2023-06-21 19:40:39,564 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=22.5 2023-06-21 19:40:45,985 INFO [train.py:996] (3/4) Epoch 6, batch 21750, loss[loss=0.2352, simple_loss=0.2938, pruned_loss=0.08823, over 21821.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.2999, pruned_loss=0.07936, over 4273363.10 frames. ], batch size: 107, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:41:30,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1045458.0, ans=0.125 2023-06-21 19:42:00,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1045518.0, ans=0.2 2023-06-21 19:42:10,023 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-21 19:42:19,841 INFO [train.py:996] (3/4) Epoch 6, batch 21800, loss[loss=0.2904, simple_loss=0.3652, pruned_loss=0.1078, over 21654.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.2982, pruned_loss=0.08073, over 4272737.60 frames. ], batch size: 391, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:42:21,224 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.706e+02 3.025e+02 3.711e+02 5.699e+02, threshold=6.051e+02, percent-clipped=0.0 2023-06-21 19:42:35,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1045638.0, ans=0.0 2023-06-21 19:42:57,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1045698.0, ans=0.125 2023-06-21 19:43:15,925 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-21 19:43:53,867 INFO [train.py:996] (3/4) Epoch 6, batch 21850, loss[loss=0.2945, simple_loss=0.3575, pruned_loss=0.1158, over 21741.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3025, pruned_loss=0.08151, over 4274281.21 frames. ], batch size: 441, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:44:44,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1046058.0, ans=0.125 2023-06-21 19:45:24,836 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=15.0 2023-06-21 19:45:26,629 INFO [train.py:996] (3/4) Epoch 6, batch 21900, loss[loss=0.2119, simple_loss=0.2786, pruned_loss=0.07265, over 21708.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3051, pruned_loss=0.0825, over 4280146.02 frames. ], batch size: 264, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:45:26,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1046238.0, ans=0.0 2023-06-21 19:45:28,141 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.124e+02 2.966e+02 3.405e+02 4.071e+02 6.584e+02, threshold=6.811e+02, percent-clipped=2.0 2023-06-21 19:46:13,280 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.85 vs. limit=10.0 2023-06-21 19:47:00,487 INFO [train.py:996] (3/4) Epoch 6, batch 21950, loss[loss=0.175, simple_loss=0.245, pruned_loss=0.05253, over 21478.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3, pruned_loss=0.08206, over 4282214.35 frames. ], batch size: 195, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:47:00,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1046538.0, ans=0.1 2023-06-21 19:47:22,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1046538.0, ans=0.0 2023-06-21 19:47:40,892 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=22.5 2023-06-21 19:47:55,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1046658.0, ans=0.0 2023-06-21 19:47:55,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1046658.0, ans=0.1 2023-06-21 19:48:13,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1046778.0, ans=0.0 2023-06-21 19:48:20,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1046778.0, ans=0.2 2023-06-21 19:48:25,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1046778.0, ans=0.05 2023-06-21 19:48:34,352 INFO [train.py:996] (3/4) Epoch 6, batch 22000, loss[loss=0.1811, simple_loss=0.2462, pruned_loss=0.058, over 21571.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2933, pruned_loss=0.07835, over 4279914.24 frames. ], batch size: 263, lr: 5.00e-03, grad_scale: 32.0 2023-06-21 19:48:40,679 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.414e+02 2.927e+02 3.631e+02 6.492e+02, threshold=5.855e+02, percent-clipped=0.0 2023-06-21 19:49:05,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1046898.0, ans=0.0 2023-06-21 19:49:13,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1046898.0, ans=0.125 2023-06-21 19:49:13,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1046898.0, ans=0.125 2023-06-21 19:49:37,466 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:50:14,051 INFO [train.py:996] (3/4) Epoch 6, batch 22050, loss[loss=0.3041, simple_loss=0.4057, pruned_loss=0.1012, over 19947.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3, pruned_loss=0.08066, over 4269863.98 frames. ], batch size: 702, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:51:34,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1047378.0, ans=0.125 2023-06-21 19:51:43,187 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-21 19:51:48,060 INFO [train.py:996] (3/4) Epoch 6, batch 22100, loss[loss=0.2771, simple_loss=0.3329, pruned_loss=0.1107, over 21640.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3112, pruned_loss=0.08588, over 4257771.49 frames. ], batch size: 263, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:51:51,162 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 3.410e+02 3.908e+02 4.704e+02 7.568e+02, threshold=7.817e+02, percent-clipped=7.0 2023-06-21 19:52:18,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1047498.0, ans=0.125 2023-06-21 19:52:36,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1047558.0, ans=0.2 2023-06-21 19:52:38,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1047558.0, ans=0.0 2023-06-21 19:52:53,694 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:53:07,744 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.58 vs. limit=22.5 2023-06-21 19:53:21,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1047738.0, ans=0.125 2023-06-21 19:53:22,054 INFO [train.py:996] (3/4) Epoch 6, batch 22150, loss[loss=0.2982, simple_loss=0.3568, pruned_loss=0.1198, over 20716.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3137, pruned_loss=0.08767, over 4264706.46 frames. ], batch size: 607, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:54:46,449 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-21 19:55:00,792 INFO [train.py:996] (3/4) Epoch 6, batch 22200, loss[loss=0.3548, simple_loss=0.4592, pruned_loss=0.1252, over 19776.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3159, pruned_loss=0.08912, over 4275050.60 frames. ], batch size: 702, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:55:08,463 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.486e+02 3.160e+02 3.693e+02 4.495e+02 7.335e+02, threshold=7.385e+02, percent-clipped=0.0 2023-06-21 19:55:12,277 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-21 19:56:02,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1048218.0, ans=0.2 2023-06-21 19:56:34,370 INFO [train.py:996] (3/4) Epoch 6, batch 22250, loss[loss=0.2959, simple_loss=0.3694, pruned_loss=0.1112, over 21623.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3243, pruned_loss=0.09089, over 4279837.99 frames. ], batch size: 414, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:56:51,524 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-21 19:56:57,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1048398.0, ans=0.125 2023-06-21 19:57:20,100 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.04 vs. limit=12.0 2023-06-21 19:57:32,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1048518.0, ans=0.95 2023-06-21 19:57:46,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1048578.0, ans=0.1 2023-06-21 19:57:57,994 INFO [train.py:996] (3/4) Epoch 6, batch 22300, loss[loss=0.252, simple_loss=0.322, pruned_loss=0.09099, over 21423.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3264, pruned_loss=0.09309, over 4274800.94 frames. ], batch size: 131, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 19:58:04,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1048638.0, ans=0.125 2023-06-21 19:58:05,484 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.426e+02 3.056e+02 3.498e+02 3.964e+02 6.113e+02, threshold=6.996e+02, percent-clipped=0.0 2023-06-21 19:58:31,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1048758.0, ans=0.0 2023-06-21 19:59:11,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1048878.0, ans=0.1 2023-06-21 19:59:27,800 INFO [train.py:996] (3/4) Epoch 6, batch 22350, loss[loss=0.2921, simple_loss=0.3475, pruned_loss=0.1183, over 21740.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3233, pruned_loss=0.09302, over 4281495.79 frames. ], batch size: 441, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 19:59:36,584 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.62 vs. limit=15.0 2023-06-21 19:59:37,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1048938.0, ans=0.125 2023-06-21 19:59:43,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1048998.0, ans=0.125 2023-06-21 19:59:53,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1048998.0, ans=0.0 2023-06-21 20:00:01,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1049058.0, ans=0.1 2023-06-21 20:00:02,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1049058.0, ans=0.0 2023-06-21 20:00:02,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1049058.0, ans=0.125 2023-06-21 20:00:43,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1049178.0, ans=0.2 2023-06-21 20:01:01,231 INFO [train.py:996] (3/4) Epoch 6, batch 22400, loss[loss=0.2357, simple_loss=0.2979, pruned_loss=0.08678, over 21391.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3185, pruned_loss=0.08915, over 4273869.11 frames. ], batch size: 177, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:01:04,244 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 2.868e+02 3.552e+02 4.171e+02 5.869e+02, threshold=7.104e+02, percent-clipped=0.0 2023-06-21 20:01:16,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1049298.0, ans=0.015 2023-06-21 20:01:29,529 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-21 20:02:03,940 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=15.0 2023-06-21 20:02:29,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1049478.0, ans=0.125 2023-06-21 20:02:32,903 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=15.0 2023-06-21 20:02:34,800 INFO [train.py:996] (3/4) Epoch 6, batch 22450, loss[loss=0.2176, simple_loss=0.2816, pruned_loss=0.07682, over 21560.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3135, pruned_loss=0.08821, over 4273792.53 frames. ], batch size: 415, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:02:45,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1049538.0, ans=0.0 2023-06-21 20:02:51,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1049598.0, ans=0.2 2023-06-21 20:02:55,047 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-21 20:02:55,317 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.33 vs. limit=10.0 2023-06-21 20:02:56,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1049598.0, ans=0.1 2023-06-21 20:03:05,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1049658.0, ans=0.125 2023-06-21 20:03:06,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1049658.0, ans=0.0 2023-06-21 20:03:08,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1049658.0, ans=0.95 2023-06-21 20:03:11,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1049658.0, ans=0.04949747468305833 2023-06-21 20:03:46,990 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=15.0 2023-06-21 20:03:56,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1049778.0, ans=0.125 2023-06-21 20:04:00,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1049778.0, ans=0.0 2023-06-21 20:04:08,820 INFO [train.py:996] (3/4) Epoch 6, batch 22500, loss[loss=0.2712, simple_loss=0.3565, pruned_loss=0.09293, over 21612.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3076, pruned_loss=0.08693, over 4279130.34 frames. ], batch size: 263, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:04:11,776 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 2.833e+02 3.380e+02 4.088e+02 7.765e+02, threshold=6.760e+02, percent-clipped=2.0 2023-06-21 20:04:51,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1049958.0, ans=0.125 2023-06-21 20:05:08,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1050018.0, ans=0.2 2023-06-21 20:05:26,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1050078.0, ans=0.125 2023-06-21 20:05:42,955 INFO [train.py:996] (3/4) Epoch 6, batch 22550, loss[loss=0.2466, simple_loss=0.3238, pruned_loss=0.08474, over 21658.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3121, pruned_loss=0.0877, over 4285889.30 frames. ], batch size: 263, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:05:44,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1050138.0, ans=0.025 2023-06-21 20:05:54,248 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:06:08,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-21 20:06:28,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1050258.0, ans=0.2 2023-06-21 20:07:18,455 INFO [train.py:996] (3/4) Epoch 6, batch 22600, loss[loss=0.2149, simple_loss=0.2663, pruned_loss=0.0817, over 21271.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3157, pruned_loss=0.08932, over 4286071.52 frames. ], batch size: 176, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:07:21,333 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 3.122e+02 3.804e+02 4.633e+02 7.875e+02, threshold=7.609e+02, percent-clipped=4.0 2023-06-21 20:07:36,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1050498.0, ans=0.035 2023-06-21 20:07:36,597 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:08:26,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=15.0 2023-06-21 20:08:47,120 INFO [train.py:996] (3/4) Epoch 6, batch 22650, loss[loss=0.2311, simple_loss=0.2854, pruned_loss=0.08837, over 21136.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.312, pruned_loss=0.0887, over 4280785.95 frames. ], batch size: 159, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 20:08:54,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1050738.0, ans=0.125 2023-06-21 20:09:26,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1050858.0, ans=0.0 2023-06-21 20:09:57,618 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:10:07,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1050978.0, ans=0.1 2023-06-21 20:10:19,381 INFO [train.py:996] (3/4) Epoch 6, batch 22700, loss[loss=0.2632, simple_loss=0.2967, pruned_loss=0.1148, over 21517.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3058, pruned_loss=0.08801, over 4269217.69 frames. ], batch size: 512, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 20:10:23,800 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.096e+02 3.667e+02 4.331e+02 7.482e+02, threshold=7.334e+02, percent-clipped=0.0 2023-06-21 20:10:59,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1051158.0, ans=0.1 2023-06-21 20:11:02,033 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=12.0 2023-06-21 20:11:04,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1051158.0, ans=0.125 2023-06-21 20:11:14,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1051158.0, ans=0.125 2023-06-21 20:11:26,163 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-21 20:11:33,680 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.92 vs. limit=15.0 2023-06-21 20:11:35,385 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-21 20:11:39,628 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-06-21 20:11:53,635 INFO [train.py:996] (3/4) Epoch 6, batch 22750, loss[loss=0.267, simple_loss=0.3328, pruned_loss=0.1006, over 21758.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3072, pruned_loss=0.08966, over 4275336.60 frames. ], batch size: 113, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 20:11:55,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1051338.0, ans=0.125 2023-06-21 20:11:58,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1051338.0, ans=0.125 2023-06-21 20:12:35,126 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-21 20:13:00,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1051518.0, ans=0.125 2023-06-21 20:13:22,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1051578.0, ans=0.0 2023-06-21 20:13:26,410 INFO [train.py:996] (3/4) Epoch 6, batch 22800, loss[loss=0.257, simple_loss=0.3212, pruned_loss=0.09635, over 21482.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3106, pruned_loss=0.09222, over 4282489.40 frames. ], batch size: 177, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:13:30,787 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.282e+02 2.968e+02 3.368e+02 3.965e+02 6.508e+02, threshold=6.737e+02, percent-clipped=0.0 2023-06-21 20:13:55,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1051698.0, ans=0.125 2023-06-21 20:14:49,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1051878.0, ans=0.0 2023-06-21 20:14:59,377 INFO [train.py:996] (3/4) Epoch 6, batch 22850, loss[loss=0.2468, simple_loss=0.3022, pruned_loss=0.09576, over 21746.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3058, pruned_loss=0.09043, over 4281863.25 frames. ], batch size: 351, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:14:59,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1051938.0, ans=0.0 2023-06-21 20:16:29,547 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-06-21 20:16:34,406 INFO [train.py:996] (3/4) Epoch 6, batch 22900, loss[loss=0.2581, simple_loss=0.3497, pruned_loss=0.08318, over 21611.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3103, pruned_loss=0.08954, over 4282795.18 frames. ], batch size: 263, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:16:36,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1052238.0, ans=0.125 2023-06-21 20:16:39,199 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.845e+02 3.273e+02 3.939e+02 6.144e+02, threshold=6.547e+02, percent-clipped=0.0 2023-06-21 20:17:26,437 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:18:15,287 INFO [train.py:996] (3/4) Epoch 6, batch 22950, loss[loss=0.236, simple_loss=0.3494, pruned_loss=0.06127, over 21374.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3247, pruned_loss=0.08692, over 4274811.63 frames. ], batch size: 211, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:18:20,962 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=15.0 2023-06-21 20:18:47,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1052598.0, ans=0.0 2023-06-21 20:19:08,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1052658.0, ans=0.0 2023-06-21 20:19:13,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1052718.0, ans=0.125 2023-06-21 20:19:19,889 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-21 20:19:31,375 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-06-21 20:19:49,056 INFO [train.py:996] (3/4) Epoch 6, batch 23000, loss[loss=0.2471, simple_loss=0.3174, pruned_loss=0.08839, over 21640.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3231, pruned_loss=0.08551, over 4280172.70 frames. ], batch size: 263, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:19:53,524 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.747e+02 3.155e+02 3.821e+02 7.452e+02, threshold=6.310e+02, percent-clipped=2.0 2023-06-21 20:21:19,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=22.5 2023-06-21 20:21:29,272 INFO [train.py:996] (3/4) Epoch 6, batch 23050, loss[loss=0.2295, simple_loss=0.3045, pruned_loss=0.07726, over 21465.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3244, pruned_loss=0.08755, over 4280681.48 frames. ], batch size: 194, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:22:10,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1053258.0, ans=0.1 2023-06-21 20:23:02,190 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-21 20:23:02,818 INFO [train.py:996] (3/4) Epoch 6, batch 23100, loss[loss=0.2329, simple_loss=0.2919, pruned_loss=0.08697, over 21118.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3192, pruned_loss=0.088, over 4275647.80 frames. ], batch size: 159, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:23:07,167 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.588e+02 3.234e+02 3.747e+02 4.482e+02 8.068e+02, threshold=7.493e+02, percent-clipped=4.0 2023-06-21 20:23:09,684 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-21 20:23:11,171 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-21 20:23:16,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1053498.0, ans=0.0 2023-06-21 20:23:34,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1053558.0, ans=0.125 2023-06-21 20:23:43,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1053558.0, ans=0.125 2023-06-21 20:23:48,433 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=15.0 2023-06-21 20:24:35,496 INFO [train.py:996] (3/4) Epoch 6, batch 23150, loss[loss=0.2468, simple_loss=0.3149, pruned_loss=0.08938, over 21620.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3132, pruned_loss=0.08735, over 4281993.26 frames. ], batch size: 389, lr: 4.98e-03, grad_scale: 16.0 2023-06-21 20:24:59,635 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.63 vs. limit=10.0 2023-06-21 20:25:03,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1053858.0, ans=0.125 2023-06-21 20:25:15,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1053858.0, ans=22.5 2023-06-21 20:25:58,116 INFO [train.py:996] (3/4) Epoch 6, batch 23200, loss[loss=0.2707, simple_loss=0.3344, pruned_loss=0.1035, over 21631.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3119, pruned_loss=0.08809, over 4289116.45 frames. ], batch size: 471, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:26:13,453 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.774e+02 3.196e+02 3.706e+02 6.362e+02, threshold=6.391e+02, percent-clipped=0.0 2023-06-21 20:26:27,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1054098.0, ans=0.125 2023-06-21 20:26:30,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1054098.0, ans=0.125 2023-06-21 20:26:47,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1054158.0, ans=0.0 2023-06-21 20:27:30,816 INFO [train.py:996] (3/4) Epoch 6, batch 23250, loss[loss=0.2309, simple_loss=0.2942, pruned_loss=0.08385, over 21579.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3127, pruned_loss=0.08939, over 4284752.41 frames. ], batch size: 548, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:28:45,490 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-21 20:29:05,991 INFO [train.py:996] (3/4) Epoch 6, batch 23300, loss[loss=0.2441, simple_loss=0.2944, pruned_loss=0.09692, over 20122.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3222, pruned_loss=0.09188, over 4286333.44 frames. ], batch size: 703, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:29:12,273 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 2.961e+02 3.509e+02 4.048e+02 6.618e+02, threshold=7.018e+02, percent-clipped=1.0 2023-06-21 20:29:43,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=22.5 2023-06-21 20:30:12,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1054818.0, ans=0.0 2023-06-21 20:30:28,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1054878.0, ans=0.1 2023-06-21 20:30:33,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1054878.0, ans=0.0 2023-06-21 20:30:40,392 INFO [train.py:996] (3/4) Epoch 6, batch 23350, loss[loss=0.247, simple_loss=0.3205, pruned_loss=0.08675, over 20000.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3258, pruned_loss=0.09033, over 4272146.09 frames. ], batch size: 702, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:30:53,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1054938.0, ans=0.2 2023-06-21 20:30:57,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1054998.0, ans=0.125 2023-06-21 20:31:03,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1054998.0, ans=0.1 2023-06-21 20:31:08,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1054998.0, ans=0.0 2023-06-21 20:31:33,939 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.15 vs. limit=10.0 2023-06-21 20:31:53,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1055178.0, ans=0.125 2023-06-21 20:32:09,901 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.38 vs. limit=12.0 2023-06-21 20:32:13,198 INFO [train.py:996] (3/4) Epoch 6, batch 23400, loss[loss=0.2298, simple_loss=0.2964, pruned_loss=0.0816, over 21682.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3191, pruned_loss=0.08716, over 4268404.85 frames. ], batch size: 263, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:32:18,971 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.966e+02 3.517e+02 4.346e+02 6.933e+02, threshold=7.034e+02, percent-clipped=0.0 2023-06-21 20:32:30,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1055298.0, ans=0.125 2023-06-21 20:32:31,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1055298.0, ans=0.125 2023-06-21 20:32:36,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1055298.0, ans=0.125 2023-06-21 20:32:36,992 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-06-21 20:33:47,364 INFO [train.py:996] (3/4) Epoch 6, batch 23450, loss[loss=0.2477, simple_loss=0.3128, pruned_loss=0.09137, over 21330.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3208, pruned_loss=0.09036, over 4278604.02 frames. ], batch size: 548, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:33:53,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1055538.0, ans=10.0 2023-06-21 20:34:02,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1055598.0, ans=0.2 2023-06-21 20:34:13,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1055598.0, ans=0.125 2023-06-21 20:34:34,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1055658.0, ans=0.125 2023-06-21 20:35:13,857 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-06-21 20:35:20,273 INFO [train.py:996] (3/4) Epoch 6, batch 23500, loss[loss=0.2493, simple_loss=0.3139, pruned_loss=0.09238, over 21805.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3207, pruned_loss=0.09225, over 4280831.68 frames. ], batch size: 414, lr: 4.98e-03, grad_scale: 16.0 2023-06-21 20:35:27,702 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.361e+02 2.940e+02 3.315e+02 3.870e+02 5.953e+02, threshold=6.630e+02, percent-clipped=0.0 2023-06-21 20:36:53,698 INFO [train.py:996] (3/4) Epoch 6, batch 23550, loss[loss=0.255, simple_loss=0.2996, pruned_loss=0.1052, over 21647.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3146, pruned_loss=0.09146, over 4272615.42 frames. ], batch size: 416, lr: 4.98e-03, grad_scale: 16.0 2023-06-21 20:38:20,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1056378.0, ans=0.07 2023-06-21 20:38:27,630 INFO [train.py:996] (3/4) Epoch 6, batch 23600, loss[loss=0.2908, simple_loss=0.3526, pruned_loss=0.1145, over 21373.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3148, pruned_loss=0.09096, over 4274087.68 frames. ], batch size: 549, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:38:34,987 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 2.807e+02 3.254e+02 4.113e+02 6.430e+02, threshold=6.509e+02, percent-clipped=0.0 2023-06-21 20:38:41,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1056498.0, ans=0.125 2023-06-21 20:39:17,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1056558.0, ans=0.125 2023-06-21 20:39:23,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1056558.0, ans=0.125 2023-06-21 20:39:49,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1056678.0, ans=0.0 2023-06-21 20:40:01,261 INFO [train.py:996] (3/4) Epoch 6, batch 23650, loss[loss=0.2859, simple_loss=0.3556, pruned_loss=0.1081, over 21842.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3166, pruned_loss=0.0892, over 4274499.27 frames. ], batch size: 118, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:41:15,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=12.0 2023-06-21 20:41:21,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1056978.0, ans=0.125 2023-06-21 20:41:39,601 INFO [train.py:996] (3/4) Epoch 6, batch 23700, loss[loss=0.2598, simple_loss=0.3303, pruned_loss=0.09463, over 21794.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3181, pruned_loss=0.08875, over 4271565.24 frames. ], batch size: 124, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:41:51,804 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.178e+02 2.889e+02 3.360e+02 4.132e+02 7.517e+02, threshold=6.720e+02, percent-clipped=1.0 2023-06-21 20:42:09,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1057098.0, ans=0.035 2023-06-21 20:42:14,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1057098.0, ans=0.1 2023-06-21 20:42:42,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1057218.0, ans=0.125 2023-06-21 20:42:44,311 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=22.5 2023-06-21 20:42:56,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1057278.0, ans=0.0 2023-06-21 20:43:20,234 INFO [train.py:996] (3/4) Epoch 6, batch 23750, loss[loss=0.2149, simple_loss=0.3085, pruned_loss=0.06065, over 21738.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3225, pruned_loss=0.08935, over 4271658.59 frames. ], batch size: 298, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:43:23,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1057338.0, ans=0.2 2023-06-21 20:44:12,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1057458.0, ans=0.0 2023-06-21 20:44:31,660 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-21 20:44:47,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1057578.0, ans=0.5 2023-06-21 20:44:55,715 INFO [train.py:996] (3/4) Epoch 6, batch 23800, loss[loss=0.2585, simple_loss=0.3301, pruned_loss=0.09343, over 21432.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3199, pruned_loss=0.08695, over 4268239.31 frames. ], batch size: 194, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:45:03,401 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.613e+02 2.976e+02 3.389e+02 5.789e+02, threshold=5.953e+02, percent-clipped=0.0 2023-06-21 20:45:05,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1057638.0, ans=0.125 2023-06-21 20:45:15,127 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:45:29,327 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.26 vs. limit=22.5 2023-06-21 20:46:06,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1057878.0, ans=10.0 2023-06-21 20:46:30,952 INFO [train.py:996] (3/4) Epoch 6, batch 23850, loss[loss=0.2801, simple_loss=0.3509, pruned_loss=0.1046, over 21594.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3277, pruned_loss=0.08884, over 4262009.61 frames. ], batch size: 389, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:46:37,832 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=12.0 2023-06-21 20:47:41,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1058178.0, ans=0.2 2023-06-21 20:48:00,279 INFO [train.py:996] (3/4) Epoch 6, batch 23900, loss[loss=0.2624, simple_loss=0.3374, pruned_loss=0.09366, over 16320.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3352, pruned_loss=0.09161, over 4247647.94 frames. ], batch size: 60, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:48:07,685 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 3.320e+02 3.834e+02 4.673e+02 6.802e+02, threshold=7.669e+02, percent-clipped=5.0 2023-06-21 20:48:26,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=1058298.0, ans=0.2 2023-06-21 20:49:33,463 INFO [train.py:996] (3/4) Epoch 6, batch 23950, loss[loss=0.258, simple_loss=0.3179, pruned_loss=0.09906, over 21744.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3275, pruned_loss=0.09078, over 4246265.98 frames. ], batch size: 282, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:49:57,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1058598.0, ans=0.0 2023-06-21 20:50:36,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1058718.0, ans=0.1 2023-06-21 20:51:07,684 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-21 20:51:08,212 INFO [train.py:996] (3/4) Epoch 6, batch 24000, loss[loss=0.295, simple_loss=0.3803, pruned_loss=0.1049, over 21780.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3296, pruned_loss=0.09428, over 4251190.33 frames. ], batch size: 118, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:51:08,213 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 20:51:24,743 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2687, simple_loss=0.3663, pruned_loss=0.08552, over 1796401.00 frames. 2023-06-21 20:51:24,744 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-21 20:51:32,335 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.640e+02 3.190e+02 3.718e+02 4.654e+02 6.990e+02, threshold=7.435e+02, percent-clipped=0.0 2023-06-21 20:51:55,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1058898.0, ans=0.0 2023-06-21 20:52:18,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1058958.0, ans=0.07 2023-06-21 20:52:21,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1058958.0, ans=0.0 2023-06-21 20:52:38,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1059018.0, ans=0.0 2023-06-21 20:52:40,126 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=22.5 2023-06-21 20:52:49,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1059078.0, ans=0.0 2023-06-21 20:52:52,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1059078.0, ans=0.025 2023-06-21 20:52:58,908 INFO [train.py:996] (3/4) Epoch 6, batch 24050, loss[loss=0.2059, simple_loss=0.2921, pruned_loss=0.05982, over 21458.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3308, pruned_loss=0.09455, over 4255511.38 frames. ], batch size: 194, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:53:01,221 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-21 20:53:37,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1059198.0, ans=0.125 2023-06-21 20:54:06,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1059318.0, ans=0.0 2023-06-21 20:54:33,341 INFO [train.py:996] (3/4) Epoch 6, batch 24100, loss[loss=0.2895, simple_loss=0.3722, pruned_loss=0.1034, over 21731.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3306, pruned_loss=0.09208, over 4260923.60 frames. ], batch size: 441, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:54:40,890 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.753e+02 3.093e+02 3.531e+02 5.265e+02, threshold=6.186e+02, percent-clipped=0.0 2023-06-21 20:54:46,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1059438.0, ans=0.1 2023-06-21 20:55:07,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1059498.0, ans=0.1 2023-06-21 20:55:33,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1059558.0, ans=0.0 2023-06-21 20:55:48,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1059618.0, ans=0.125 2023-06-21 20:55:50,455 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-21 20:56:07,343 INFO [train.py:996] (3/4) Epoch 6, batch 24150, loss[loss=0.3097, simple_loss=0.3553, pruned_loss=0.1321, over 21629.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3296, pruned_loss=0.09374, over 4269859.55 frames. ], batch size: 471, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 20:56:14,740 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.78 vs. limit=12.0 2023-06-21 20:56:50,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1059858.0, ans=0.1 2023-06-21 20:56:52,145 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=22.5 2023-06-21 20:57:01,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1059858.0, ans=0.2 2023-06-21 20:57:02,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1059858.0, ans=0.125 2023-06-21 20:57:26,058 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=15.0 2023-06-21 20:57:51,194 INFO [train.py:996] (3/4) Epoch 6, batch 24200, loss[loss=0.2554, simple_loss=0.3358, pruned_loss=0.08754, over 21770.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3324, pruned_loss=0.09589, over 4272322.76 frames. ], batch size: 282, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 20:58:05,320 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.424e+02 3.135e+02 3.604e+02 4.507e+02 8.443e+02, threshold=7.208e+02, percent-clipped=5.0 2023-06-21 20:58:36,830 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.59 vs. limit=15.0 2023-06-21 20:58:45,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1060218.0, ans=0.0 2023-06-21 20:59:30,804 INFO [train.py:996] (3/4) Epoch 6, batch 24250, loss[loss=0.2425, simple_loss=0.3522, pruned_loss=0.06637, over 21217.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3292, pruned_loss=0.08885, over 4272041.09 frames. ], batch size: 548, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 21:00:17,983 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-21 21:00:29,982 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.07 vs. limit=15.0 2023-06-21 21:00:57,365 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-21 21:01:03,897 INFO [train.py:996] (3/4) Epoch 6, batch 24300, loss[loss=0.2594, simple_loss=0.3305, pruned_loss=0.09414, over 21569.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3204, pruned_loss=0.08222, over 4275083.17 frames. ], batch size: 507, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 21:01:12,932 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.875e+02 2.484e+02 3.071e+02 3.742e+02 5.232e+02, threshold=6.142e+02, percent-clipped=0.0 2023-06-21 21:01:31,624 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-21 21:01:56,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1060818.0, ans=0.125 2023-06-21 21:02:29,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1060878.0, ans=0.125 2023-06-21 21:02:37,549 INFO [train.py:996] (3/4) Epoch 6, batch 24350, loss[loss=0.3059, simple_loss=0.3539, pruned_loss=0.1289, over 21721.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3161, pruned_loss=0.08195, over 4275348.86 frames. ], batch size: 473, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 21:02:50,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=15.0 2023-06-21 21:03:23,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1061058.0, ans=0.125 2023-06-21 21:03:25,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1061058.0, ans=0.0 2023-06-21 21:04:16,763 INFO [train.py:996] (3/4) Epoch 6, batch 24400, loss[loss=0.2752, simple_loss=0.3429, pruned_loss=0.1038, over 21557.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3222, pruned_loss=0.08661, over 4272939.23 frames. ], batch size: 389, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 21:04:25,988 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 3.143e+02 3.570e+02 4.226e+02 5.954e+02, threshold=7.140e+02, percent-clipped=0.0 2023-06-21 21:04:37,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1061298.0, ans=0.0 2023-06-21 21:04:48,289 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=22.5 2023-06-21 21:05:31,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1061478.0, ans=10.0 2023-06-21 21:05:33,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1061478.0, ans=0.125 2023-06-21 21:05:51,482 INFO [train.py:996] (3/4) Epoch 6, batch 24450, loss[loss=0.3267, simple_loss=0.4056, pruned_loss=0.124, over 21642.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3266, pruned_loss=0.08885, over 4273562.33 frames. ], batch size: 441, lr: 4.96e-03, grad_scale: 32.0 2023-06-21 21:06:21,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1061658.0, ans=0.0 2023-06-21 21:06:22,427 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2023-06-21 21:06:23,853 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-21 21:06:32,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1061658.0, ans=0.1 2023-06-21 21:06:40,567 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-21 21:07:19,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1061778.0, ans=0.07 2023-06-21 21:07:20,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1061778.0, ans=0.2 2023-06-21 21:07:24,540 INFO [train.py:996] (3/4) Epoch 6, batch 24500, loss[loss=0.2581, simple_loss=0.318, pruned_loss=0.09905, over 20208.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3264, pruned_loss=0.08851, over 4281833.69 frames. ], batch size: 707, lr: 4.96e-03, grad_scale: 32.0 2023-06-21 21:07:33,654 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 2.842e+02 3.184e+02 3.780e+02 5.341e+02, threshold=6.369e+02, percent-clipped=0.0 2023-06-21 21:07:35,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1061838.0, ans=0.1 2023-06-21 21:07:38,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1061898.0, ans=0.125 2023-06-21 21:08:46,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1062078.0, ans=0.1 2023-06-21 21:08:58,972 INFO [train.py:996] (3/4) Epoch 6, batch 24550, loss[loss=0.2889, simple_loss=0.3617, pruned_loss=0.108, over 21935.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3281, pruned_loss=0.09045, over 4284112.93 frames. ], batch size: 372, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:10:21,021 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=22.5 2023-06-21 21:10:33,716 INFO [train.py:996] (3/4) Epoch 6, batch 24600, loss[loss=0.2046, simple_loss=0.2649, pruned_loss=0.07218, over 21480.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3248, pruned_loss=0.09157, over 4269576.00 frames. ], batch size: 132, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:10:44,186 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.313e+02 2.960e+02 3.461e+02 4.086e+02 6.859e+02, threshold=6.922e+02, percent-clipped=1.0 2023-06-21 21:11:18,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1062558.0, ans=0.125 2023-06-21 21:11:45,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1062618.0, ans=0.2 2023-06-21 21:12:03,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1062678.0, ans=0.1 2023-06-21 21:12:05,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1062678.0, ans=0.07 2023-06-21 21:12:08,342 INFO [train.py:996] (3/4) Epoch 6, batch 24650, loss[loss=0.2159, simple_loss=0.2762, pruned_loss=0.07784, over 21753.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3168, pruned_loss=0.08974, over 4267530.26 frames. ], batch size: 300, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:12:14,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1062738.0, ans=0.125 2023-06-21 21:13:14,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1062918.0, ans=0.125 2023-06-21 21:13:24,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1062918.0, ans=15.0 2023-06-21 21:13:26,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1062978.0, ans=0.125 2023-06-21 21:13:41,568 INFO [train.py:996] (3/4) Epoch 6, batch 24700, loss[loss=0.2406, simple_loss=0.3792, pruned_loss=0.05102, over 19894.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3149, pruned_loss=0.08766, over 4267713.74 frames. ], batch size: 702, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:13:47,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1063038.0, ans=0.0 2023-06-21 21:13:51,761 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.340e+02 2.793e+02 3.149e+02 3.525e+02 6.939e+02, threshold=6.298e+02, percent-clipped=1.0 2023-06-21 21:14:04,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1063098.0, ans=0.0 2023-06-21 21:14:21,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1063098.0, ans=0.125 2023-06-21 21:14:29,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1063158.0, ans=0.0 2023-06-21 21:14:31,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1063158.0, ans=0.125 2023-06-21 21:14:56,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1063218.0, ans=0.125 2023-06-21 21:15:09,956 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:15:15,545 INFO [train.py:996] (3/4) Epoch 6, batch 24750, loss[loss=0.1852, simple_loss=0.2451, pruned_loss=0.06264, over 21501.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3087, pruned_loss=0.08548, over 4270550.46 frames. ], batch size: 230, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:15:23,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1063338.0, ans=0.125 2023-06-21 21:15:47,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1063398.0, ans=0.2 2023-06-21 21:15:55,006 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.07 vs. limit=15.0 2023-06-21 21:16:04,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1063458.0, ans=0.04949747468305833 2023-06-21 21:16:49,356 INFO [train.py:996] (3/4) Epoch 6, batch 24800, loss[loss=0.2265, simple_loss=0.2918, pruned_loss=0.08057, over 21910.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3038, pruned_loss=0.08552, over 4279983.87 frames. ], batch size: 316, lr: 4.96e-03, grad_scale: 32.0 2023-06-21 21:17:07,680 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.811e+02 3.326e+02 3.870e+02 1.010e+03, threshold=6.653e+02, percent-clipped=1.0 2023-06-21 21:17:21,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1063698.0, ans=0.0 2023-06-21 21:17:44,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1063758.0, ans=0.1 2023-06-21 21:18:22,756 INFO [train.py:996] (3/4) Epoch 6, batch 24850, loss[loss=0.2809, simple_loss=0.3382, pruned_loss=0.1118, over 20211.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3048, pruned_loss=0.08722, over 4277214.88 frames. ], batch size: 702, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:18:31,184 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-21 21:19:03,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1063998.0, ans=0.0 2023-06-21 21:19:09,899 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.08 vs. limit=8.0 2023-06-21 21:19:22,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1064118.0, ans=0.125 2023-06-21 21:19:25,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1064118.0, ans=0.0 2023-06-21 21:19:51,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1064178.0, ans=0.2 2023-06-21 21:19:57,068 INFO [train.py:996] (3/4) Epoch 6, batch 24900, loss[loss=0.2459, simple_loss=0.3197, pruned_loss=0.08608, over 21949.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.309, pruned_loss=0.08828, over 4283580.53 frames. ], batch size: 316, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:19:58,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1064238.0, ans=0.0 2023-06-21 21:20:14,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1064238.0, ans=0.125 2023-06-21 21:20:15,236 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.413e+02 3.136e+02 3.665e+02 4.988e+02 9.346e+02, threshold=7.330e+02, percent-clipped=11.0 2023-06-21 21:20:45,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1064358.0, ans=0.125 2023-06-21 21:21:01,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1064418.0, ans=0.125 2023-06-21 21:21:37,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1064538.0, ans=0.0 2023-06-21 21:21:38,221 INFO [train.py:996] (3/4) Epoch 6, batch 24950, loss[loss=0.3568, simple_loss=0.4015, pruned_loss=0.1561, over 21405.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3171, pruned_loss=0.09207, over 4281139.63 frames. ], batch size: 471, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:23:05,369 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:23:18,865 INFO [train.py:996] (3/4) Epoch 6, batch 25000, loss[loss=0.2483, simple_loss=0.3094, pruned_loss=0.09359, over 21282.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3236, pruned_loss=0.094, over 4271430.73 frames. ], batch size: 549, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:23:34,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1064838.0, ans=0.0 2023-06-21 21:23:36,915 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 2.934e+02 3.469e+02 4.480e+02 7.234e+02, threshold=6.939e+02, percent-clipped=0.0 2023-06-21 21:23:55,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1064898.0, ans=0.0 2023-06-21 21:23:59,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1064958.0, ans=0.1 2023-06-21 21:24:14,614 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:24:20,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1065018.0, ans=0.0 2023-06-21 21:24:41,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1065078.0, ans=0.07 2023-06-21 21:24:52,474 INFO [train.py:996] (3/4) Epoch 6, batch 25050, loss[loss=0.24, simple_loss=0.2909, pruned_loss=0.09452, over 21514.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3149, pruned_loss=0.0915, over 4271848.22 frames. ], batch size: 441, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:25:52,918 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-21 21:26:27,039 INFO [train.py:996] (3/4) Epoch 6, batch 25100, loss[loss=0.2268, simple_loss=0.3078, pruned_loss=0.07285, over 21264.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3078, pruned_loss=0.08913, over 4273495.53 frames. ], batch size: 176, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:26:38,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1065438.0, ans=0.0 2023-06-21 21:26:45,412 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 2.865e+02 3.430e+02 4.483e+02 9.616e+02, threshold=6.861e+02, percent-clipped=4.0 2023-06-21 21:26:53,123 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.40 vs. limit=15.0 2023-06-21 21:28:01,885 INFO [train.py:996] (3/4) Epoch 6, batch 25150, loss[loss=0.216, simple_loss=0.3032, pruned_loss=0.06439, over 21355.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3116, pruned_loss=0.08662, over 4278548.42 frames. ], batch size: 159, lr: 4.95e-03, grad_scale: 8.0 2023-06-21 21:28:09,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1065738.0, ans=0.1 2023-06-21 21:28:27,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1065798.0, ans=0.0 2023-06-21 21:29:11,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1065978.0, ans=0.09899494936611666 2023-06-21 21:29:30,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1066038.0, ans=0.0 2023-06-21 21:29:32,226 INFO [train.py:996] (3/4) Epoch 6, batch 25200, loss[loss=0.2265, simple_loss=0.316, pruned_loss=0.06851, over 21723.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3109, pruned_loss=0.0842, over 4275465.94 frames. ], batch size: 298, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:29:41,842 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:29:43,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1066038.0, ans=0.0 2023-06-21 21:29:55,467 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.627e+02 3.080e+02 3.902e+02 5.113e+02, threshold=6.160e+02, percent-clipped=0.0 2023-06-21 21:29:59,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1066098.0, ans=0.125 2023-06-21 21:30:49,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1066278.0, ans=0.2 2023-06-21 21:31:01,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1066278.0, ans=0.2 2023-06-21 21:31:06,402 INFO [train.py:996] (3/4) Epoch 6, batch 25250, loss[loss=0.2605, simple_loss=0.3154, pruned_loss=0.1028, over 21744.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3103, pruned_loss=0.08349, over 4281874.49 frames. ], batch size: 317, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:31:13,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1066338.0, ans=0.1 2023-06-21 21:32:22,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1066578.0, ans=0.125 2023-06-21 21:32:46,513 INFO [train.py:996] (3/4) Epoch 6, batch 25300, loss[loss=0.2535, simple_loss=0.3218, pruned_loss=0.09253, over 21434.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.308, pruned_loss=0.08371, over 4280507.68 frames. ], batch size: 211, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:32:50,500 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-06-21 21:33:01,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1066638.0, ans=0.125 2023-06-21 21:33:05,220 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.215e+02 2.867e+02 3.250e+02 3.935e+02 6.834e+02, threshold=6.501e+02, percent-clipped=3.0 2023-06-21 21:33:56,911 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.71 vs. limit=10.0 2023-06-21 21:34:21,341 INFO [train.py:996] (3/4) Epoch 6, batch 25350, loss[loss=0.1943, simple_loss=0.277, pruned_loss=0.05577, over 21382.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3099, pruned_loss=0.08316, over 4270207.05 frames. ], batch size: 211, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:34:38,852 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2023-06-21 21:34:50,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1066998.0, ans=0.125 2023-06-21 21:35:13,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1067058.0, ans=10.0 2023-06-21 21:35:49,713 INFO [train.py:996] (3/4) Epoch 6, batch 25400, loss[loss=0.2206, simple_loss=0.2942, pruned_loss=0.07343, over 21677.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3045, pruned_loss=0.08167, over 4259730.19 frames. ], batch size: 298, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:35:57,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1067238.0, ans=0.0 2023-06-21 21:36:13,083 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.660e+02 3.051e+02 3.605e+02 5.899e+02, threshold=6.102e+02, percent-clipped=0.0 2023-06-21 21:36:13,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1067298.0, ans=0.125 2023-06-21 21:36:33,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1067358.0, ans=0.0 2023-06-21 21:36:45,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1067358.0, ans=0.0 2023-06-21 21:36:58,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1067418.0, ans=0.125 2023-06-21 21:37:17,544 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=15.0 2023-06-21 21:37:30,717 INFO [train.py:996] (3/4) Epoch 6, batch 25450, loss[loss=0.1859, simple_loss=0.268, pruned_loss=0.05194, over 20709.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3052, pruned_loss=0.08214, over 4253380.71 frames. ], batch size: 607, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:37:43,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1067538.0, ans=0.2 2023-06-21 21:37:51,014 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:37:51,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1067598.0, ans=0.125 2023-06-21 21:37:57,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1067598.0, ans=0.2 2023-06-21 21:38:00,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1067598.0, ans=0.125 2023-06-21 21:38:12,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1067658.0, ans=0.125 2023-06-21 21:38:34,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1067718.0, ans=0.5 2023-06-21 21:38:56,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1067778.0, ans=0.125 2023-06-21 21:39:10,617 INFO [train.py:996] (3/4) Epoch 6, batch 25500, loss[loss=0.2269, simple_loss=0.3107, pruned_loss=0.0716, over 21732.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3078, pruned_loss=0.08045, over 4259884.78 frames. ], batch size: 332, lr: 4.95e-03, grad_scale: 8.0 2023-06-21 21:39:25,733 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.857e+02 3.431e+02 4.303e+02 7.136e+02, threshold=6.862e+02, percent-clipped=5.0 2023-06-21 21:39:26,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1067898.0, ans=0.125 2023-06-21 21:39:31,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1067898.0, ans=0.0 2023-06-21 21:39:51,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1067958.0, ans=0.1 2023-06-21 21:40:09,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1068018.0, ans=0.125 2023-06-21 21:40:09,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1068018.0, ans=0.1 2023-06-21 21:40:17,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1068018.0, ans=0.1 2023-06-21 21:40:23,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1068018.0, ans=0.1 2023-06-21 21:40:26,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1068078.0, ans=0.125 2023-06-21 21:40:31,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1068078.0, ans=0.0 2023-06-21 21:40:44,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1068138.0, ans=0.1 2023-06-21 21:40:45,937 INFO [train.py:996] (3/4) Epoch 6, batch 25550, loss[loss=0.2492, simple_loss=0.3412, pruned_loss=0.07858, over 21714.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3126, pruned_loss=0.07979, over 4248811.71 frames. ], batch size: 332, lr: 4.95e-03, grad_scale: 8.0 2023-06-21 21:41:56,889 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:41:58,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1068318.0, ans=0.125 2023-06-21 21:42:19,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1068438.0, ans=0.0 2023-06-21 21:42:21,197 INFO [train.py:996] (3/4) Epoch 6, batch 25600, loss[loss=0.3335, simple_loss=0.3887, pruned_loss=0.1391, over 21484.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3178, pruned_loss=0.08098, over 4253023.06 frames. ], batch size: 471, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:42:23,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1068438.0, ans=0.0 2023-06-21 21:42:40,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1068498.0, ans=0.125 2023-06-21 21:42:41,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.868e+02 3.276e+02 3.835e+02 9.464e+02, threshold=6.552e+02, percent-clipped=3.0 2023-06-21 21:43:09,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1068558.0, ans=0.5 2023-06-21 21:43:22,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1068618.0, ans=0.125 2023-06-21 21:43:41,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1068678.0, ans=0.0 2023-06-21 21:43:52,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-21 21:43:56,052 INFO [train.py:996] (3/4) Epoch 6, batch 25650, loss[loss=0.2856, simple_loss=0.3698, pruned_loss=0.1007, over 19937.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3188, pruned_loss=0.08346, over 4252191.38 frames. ], batch size: 702, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:43:56,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.25 vs. limit=12.0 2023-06-21 21:44:22,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1068798.0, ans=0.015 2023-06-21 21:44:28,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1068858.0, ans=0.1 2023-06-21 21:44:41,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-21 21:44:43,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1068858.0, ans=0.125 2023-06-21 21:44:48,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1068918.0, ans=0.05 2023-06-21 21:45:02,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1068918.0, ans=0.125 2023-06-21 21:45:13,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1068978.0, ans=0.0 2023-06-21 21:45:28,618 INFO [train.py:996] (3/4) Epoch 6, batch 25700, loss[loss=0.2561, simple_loss=0.3307, pruned_loss=0.09073, over 21883.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3158, pruned_loss=0.0853, over 4263703.80 frames. ], batch size: 316, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:45:48,635 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.399e+02 2.859e+02 3.225e+02 3.794e+02 7.100e+02, threshold=6.450e+02, percent-clipped=1.0 2023-06-21 21:45:58,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1069098.0, ans=0.0 2023-06-21 21:46:48,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1069278.0, ans=0.0 2023-06-21 21:47:00,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1069278.0, ans=0.125 2023-06-21 21:47:05,090 INFO [train.py:996] (3/4) Epoch 6, batch 25750, loss[loss=0.4061, simple_loss=0.4709, pruned_loss=0.1706, over 21460.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3214, pruned_loss=0.08877, over 4249271.74 frames. ], batch size: 508, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:47:11,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1069338.0, ans=0.0 2023-06-21 21:48:50,361 INFO [train.py:996] (3/4) Epoch 6, batch 25800, loss[loss=0.2974, simple_loss=0.3636, pruned_loss=0.1156, over 21311.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3318, pruned_loss=0.09394, over 4250889.38 frames. ], batch size: 143, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:49:10,562 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.284e+02 3.168e+02 3.870e+02 4.969e+02 1.145e+03, threshold=7.739e+02, percent-clipped=13.0 2023-06-21 21:49:23,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1069698.0, ans=0.125 2023-06-21 21:49:41,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1069758.0, ans=0.125 2023-06-21 21:50:25,965 INFO [train.py:996] (3/4) Epoch 6, batch 25850, loss[loss=0.2775, simple_loss=0.3461, pruned_loss=0.1045, over 21839.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3339, pruned_loss=0.09329, over 4252598.20 frames. ], batch size: 118, lr: 4.94e-03, grad_scale: 16.0 2023-06-21 21:50:39,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1069938.0, ans=0.125 2023-06-21 21:51:15,625 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-06-21 21:51:47,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1070178.0, ans=0.0 2023-06-21 21:51:55,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=1070178.0, ans=0.02 2023-06-21 21:52:10,924 INFO [train.py:996] (3/4) Epoch 6, batch 25900, loss[loss=0.2735, simple_loss=0.3605, pruned_loss=0.09321, over 21710.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3351, pruned_loss=0.09359, over 4260038.32 frames. ], batch size: 247, lr: 4.94e-03, grad_scale: 16.0 2023-06-21 21:52:25,940 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.151e+02 3.005e+02 3.553e+02 4.246e+02 7.646e+02, threshold=7.106e+02, percent-clipped=0.0 2023-06-21 21:52:52,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=1070358.0, ans=0.1 2023-06-21 21:53:28,077 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-21 21:53:41,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1070478.0, ans=0.035 2023-06-21 21:53:45,639 INFO [train.py:996] (3/4) Epoch 6, batch 25950, loss[loss=0.2588, simple_loss=0.3315, pruned_loss=0.09307, over 21847.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3433, pruned_loss=0.09752, over 4258584.48 frames. ], batch size: 316, lr: 4.94e-03, grad_scale: 16.0 2023-06-21 21:53:59,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.52 vs. limit=10.0 2023-06-21 21:54:15,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1070598.0, ans=0.125 2023-06-21 21:54:16,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1070598.0, ans=0.0 2023-06-21 21:54:57,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1070718.0, ans=0.0 2023-06-21 21:55:00,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1070778.0, ans=0.125 2023-06-21 21:55:09,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1070778.0, ans=0.125 2023-06-21 21:55:21,663 INFO [train.py:996] (3/4) Epoch 6, batch 26000, loss[loss=0.2498, simple_loss=0.3314, pruned_loss=0.08416, over 21293.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3417, pruned_loss=0.09442, over 4256287.87 frames. ], batch size: 549, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 21:55:41,612 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 3.120e+02 3.589e+02 4.615e+02 8.181e+02, threshold=7.178e+02, percent-clipped=1.0 2023-06-21 21:55:46,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1070898.0, ans=0.2 2023-06-21 21:56:06,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1070958.0, ans=0.2 2023-06-21 21:56:12,868 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-06-21 21:56:28,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1071018.0, ans=0.0 2023-06-21 21:56:55,913 INFO [train.py:996] (3/4) Epoch 6, batch 26050, loss[loss=0.2681, simple_loss=0.3248, pruned_loss=0.1057, over 21920.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3418, pruned_loss=0.09665, over 4264260.92 frames. ], batch size: 351, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 21:56:58,399 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=15.0 2023-06-21 21:57:01,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-06-21 21:57:10,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1071138.0, ans=0.0 2023-06-21 21:57:23,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1071198.0, ans=0.125 2023-06-21 21:57:27,905 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:57:44,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1071258.0, ans=0.125 2023-06-21 21:57:49,581 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-21 21:57:52,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1071318.0, ans=0.125 2023-06-21 21:58:01,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1071318.0, ans=0.125 2023-06-21 21:58:20,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1071378.0, ans=0.0 2023-06-21 21:58:29,346 INFO [train.py:996] (3/4) Epoch 6, batch 26100, loss[loss=0.2372, simple_loss=0.3059, pruned_loss=0.08423, over 21916.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3357, pruned_loss=0.09601, over 4269404.59 frames. ], batch size: 371, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 21:58:33,580 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.47 vs. limit=22.5 2023-06-21 21:58:49,048 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 3.066e+02 3.551e+02 4.321e+02 9.246e+02, threshold=7.101e+02, percent-clipped=1.0 2023-06-21 21:58:57,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1071498.0, ans=0.04949747468305833 2023-06-21 22:00:03,476 INFO [train.py:996] (3/4) Epoch 6, batch 26150, loss[loss=0.2754, simple_loss=0.3408, pruned_loss=0.105, over 21746.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3332, pruned_loss=0.09641, over 4281965.79 frames. ], batch size: 351, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:00:24,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1071798.0, ans=0.125 2023-06-21 22:01:21,361 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-06-21 22:01:31,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1071978.0, ans=0.0 2023-06-21 22:01:40,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1071978.0, ans=0.5 2023-06-21 22:01:43,202 INFO [train.py:996] (3/4) Epoch 6, batch 26200, loss[loss=0.3081, simple_loss=0.3996, pruned_loss=0.1083, over 21697.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.335, pruned_loss=0.09442, over 4282025.78 frames. ], batch size: 414, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:01:58,602 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.878e+02 3.123e+02 3.619e+02 5.924e+02, threshold=6.246e+02, percent-clipped=0.0 2023-06-21 22:02:56,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1072278.0, ans=0.5 2023-06-21 22:03:08,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1072278.0, ans=0.1 2023-06-21 22:03:10,924 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.72 vs. limit=10.0 2023-06-21 22:03:17,229 INFO [train.py:996] (3/4) Epoch 6, batch 26250, loss[loss=0.2263, simple_loss=0.301, pruned_loss=0.07578, over 21831.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3387, pruned_loss=0.09442, over 4283975.63 frames. ], batch size: 282, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:03:20,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1072338.0, ans=0.125 2023-06-21 22:04:13,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1072458.0, ans=0.1 2023-06-21 22:04:27,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1072518.0, ans=0.2 2023-06-21 22:04:50,862 INFO [train.py:996] (3/4) Epoch 6, batch 26300, loss[loss=0.2415, simple_loss=0.3034, pruned_loss=0.08976, over 21898.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3342, pruned_loss=0.09432, over 4285106.50 frames. ], batch size: 414, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:05:04,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1072638.0, ans=0.125 2023-06-21 22:05:10,180 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 2.912e+02 3.361e+02 4.041e+02 6.857e+02, threshold=6.722e+02, percent-clipped=1.0 2023-06-21 22:06:28,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1072938.0, ans=0.0 2023-06-21 22:06:29,488 INFO [train.py:996] (3/4) Epoch 6, batch 26350, loss[loss=0.2531, simple_loss=0.3211, pruned_loss=0.09255, over 21348.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3334, pruned_loss=0.09491, over 4295131.58 frames. ], batch size: 548, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:06:31,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1072938.0, ans=0.0 2023-06-21 22:07:04,808 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.96 vs. limit=12.0 2023-06-21 22:08:02,658 INFO [train.py:996] (3/4) Epoch 6, batch 26400, loss[loss=0.2328, simple_loss=0.2845, pruned_loss=0.09055, over 21306.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3257, pruned_loss=0.09403, over 4280995.47 frames. ], batch size: 176, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:08:20,603 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.60 vs. limit=15.0 2023-06-21 22:08:22,543 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.529e+02 3.009e+02 3.283e+02 3.744e+02 6.986e+02, threshold=6.566e+02, percent-clipped=1.0 2023-06-21 22:09:05,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-21 22:09:36,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1073478.0, ans=0.125 2023-06-21 22:09:43,568 INFO [train.py:996] (3/4) Epoch 6, batch 26450, loss[loss=0.3265, simple_loss=0.4209, pruned_loss=0.116, over 21519.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3292, pruned_loss=0.09457, over 4277021.62 frames. ], batch size: 471, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:10:36,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1073658.0, ans=0.125 2023-06-21 22:10:49,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1073718.0, ans=0.125 2023-06-21 22:11:16,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1073778.0, ans=0.95 2023-06-21 22:11:17,214 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-06-21 22:11:23,582 INFO [train.py:996] (3/4) Epoch 6, batch 26500, loss[loss=0.1964, simple_loss=0.2592, pruned_loss=0.06683, over 21260.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3286, pruned_loss=0.09279, over 4270354.60 frames. ], batch size: 159, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:11:28,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1073838.0, ans=0.0 2023-06-21 22:11:38,657 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.495e+02 3.232e+02 3.914e+02 4.900e+02 8.574e+02, threshold=7.829e+02, percent-clipped=7.0 2023-06-21 22:11:50,852 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.89 vs. limit=10.0 2023-06-21 22:12:48,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1074078.0, ans=0.125 2023-06-21 22:12:59,952 INFO [train.py:996] (3/4) Epoch 6, batch 26550, loss[loss=0.3011, simple_loss=0.3591, pruned_loss=0.1215, over 20009.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3269, pruned_loss=0.08907, over 4265660.42 frames. ], batch size: 702, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:13:35,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1074198.0, ans=0.05 2023-06-21 22:13:40,886 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2023-06-21 22:14:06,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1074318.0, ans=0.0 2023-06-21 22:14:16,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1074318.0, ans=0.2 2023-06-21 22:14:19,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1074378.0, ans=0.1 2023-06-21 22:14:24,817 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.05 vs. limit=10.0 2023-06-21 22:14:25,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1074378.0, ans=0.1 2023-06-21 22:14:34,293 INFO [train.py:996] (3/4) Epoch 6, batch 26600, loss[loss=0.2281, simple_loss=0.3014, pruned_loss=0.07745, over 21496.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.324, pruned_loss=0.08567, over 4258176.47 frames. ], batch size: 389, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:15:00,583 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.868e+02 3.429e+02 4.174e+02 7.700e+02, threshold=6.858e+02, percent-clipped=0.0 2023-06-21 22:16:13,048 INFO [train.py:996] (3/4) Epoch 6, batch 26650, loss[loss=0.2071, simple_loss=0.2532, pruned_loss=0.08052, over 20071.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3165, pruned_loss=0.08443, over 4256583.67 frames. ], batch size: 704, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:16:42,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1074798.0, ans=0.5 2023-06-21 22:17:05,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1074858.0, ans=0.125 2023-06-21 22:17:50,850 INFO [train.py:996] (3/4) Epoch 6, batch 26700, loss[loss=0.2494, simple_loss=0.3168, pruned_loss=0.09101, over 21790.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3075, pruned_loss=0.0802, over 4254424.94 frames. ], batch size: 441, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:17:54,896 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.50 vs. limit=15.0 2023-06-21 22:18:03,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1075038.0, ans=0.125 2023-06-21 22:18:07,349 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.839e+02 3.537e+02 4.249e+02 6.809e+02, threshold=7.074e+02, percent-clipped=0.0 2023-06-21 22:19:21,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1075278.0, ans=0.1 2023-06-21 22:19:25,050 INFO [train.py:996] (3/4) Epoch 6, batch 26750, loss[loss=0.2831, simple_loss=0.3424, pruned_loss=0.1119, over 21896.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3082, pruned_loss=0.07992, over 4266300.23 frames. ], batch size: 351, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:19:30,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1075338.0, ans=0.1 2023-06-21 22:20:05,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1075458.0, ans=0.125 2023-06-21 22:21:00,198 INFO [train.py:996] (3/4) Epoch 6, batch 26800, loss[loss=0.2789, simple_loss=0.343, pruned_loss=0.1074, over 21974.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3149, pruned_loss=0.08358, over 4270476.52 frames. ], batch size: 317, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:21:04,344 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-06-21 22:21:25,842 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 2.803e+02 3.255e+02 3.983e+02 6.627e+02, threshold=6.510e+02, percent-clipped=0.0 2023-06-21 22:21:41,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1075758.0, ans=0.2 2023-06-21 22:21:48,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1075758.0, ans=0.125 2023-06-21 22:22:16,761 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-21 22:22:38,519 INFO [train.py:996] (3/4) Epoch 6, batch 26850, loss[loss=0.2139, simple_loss=0.2759, pruned_loss=0.07593, over 21825.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3179, pruned_loss=0.08694, over 4268015.38 frames. ], batch size: 98, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:22:39,502 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=12.0 2023-06-21 22:23:18,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1076058.0, ans=0.0 2023-06-21 22:23:38,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1076118.0, ans=0.125 2023-06-21 22:24:06,564 INFO [train.py:996] (3/4) Epoch 6, batch 26900, loss[loss=0.2163, simple_loss=0.2706, pruned_loss=0.08097, over 21604.00 frames. ], tot_loss[loss=0.241, simple_loss=0.31, pruned_loss=0.08599, over 4262226.94 frames. ], batch size: 231, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:24:08,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1076238.0, ans=0.125 2023-06-21 22:24:09,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1076238.0, ans=0.125 2023-06-21 22:24:32,544 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 2.927e+02 3.403e+02 4.314e+02 6.686e+02, threshold=6.806e+02, percent-clipped=1.0 2023-06-21 22:24:47,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-06-21 22:25:00,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1076418.0, ans=0.2 2023-06-21 22:25:26,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1076478.0, ans=0.1 2023-06-21 22:25:30,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1076478.0, ans=0.05 2023-06-21 22:25:35,524 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:25:39,971 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:25:40,908 INFO [train.py:996] (3/4) Epoch 6, batch 26950, loss[loss=0.2141, simple_loss=0.2601, pruned_loss=0.08407, over 20780.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.308, pruned_loss=0.08575, over 4267616.21 frames. ], batch size: 609, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:25:41,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1076538.0, ans=0.125 2023-06-21 22:25:42,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1076538.0, ans=0.125 2023-06-21 22:26:07,900 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.09 vs. limit=12.0 2023-06-21 22:26:26,169 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.69 vs. limit=12.0 2023-06-21 22:26:49,378 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=15.0 2023-06-21 22:26:50,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1076718.0, ans=0.2 2023-06-21 22:27:16,847 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=22.5 2023-06-21 22:27:20,696 INFO [train.py:996] (3/4) Epoch 6, batch 27000, loss[loss=0.2164, simple_loss=0.3085, pruned_loss=0.06217, over 21750.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3085, pruned_loss=0.08399, over 4263624.47 frames. ], batch size: 316, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:27:20,696 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-21 22:27:39,472 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2469, simple_loss=0.3452, pruned_loss=0.07428, over 1796401.00 frames. 2023-06-21 22:27:39,473 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-21 22:27:57,433 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.871e+02 3.391e+02 3.871e+02 6.119e+02, threshold=6.783e+02, percent-clipped=0.0 2023-06-21 22:28:21,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=15.0 2023-06-21 22:28:23,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1076958.0, ans=0.1 2023-06-21 22:28:37,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1077018.0, ans=0.1 2023-06-21 22:29:09,015 INFO [train.py:996] (3/4) Epoch 6, batch 27050, loss[loss=0.2226, simple_loss=0.3088, pruned_loss=0.06817, over 21685.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3116, pruned_loss=0.08048, over 4265919.32 frames. ], batch size: 247, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:29:24,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1077198.0, ans=0.0 2023-06-21 22:30:02,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-21 22:30:33,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1077378.0, ans=0.1 2023-06-21 22:30:38,685 INFO [train.py:996] (3/4) Epoch 6, batch 27100, loss[loss=0.2386, simple_loss=0.3188, pruned_loss=0.07918, over 21805.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.314, pruned_loss=0.08241, over 4280393.80 frames. ], batch size: 247, lr: 4.93e-03, grad_scale: 8.0 2023-06-21 22:31:08,042 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.819e+02 3.363e+02 4.112e+02 5.749e+02, threshold=6.726e+02, percent-clipped=0.0 2023-06-21 22:31:08,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1077498.0, ans=0.125 2023-06-21 22:31:12,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1077498.0, ans=0.125 2023-06-21 22:31:54,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1077618.0, ans=0.07 2023-06-21 22:32:13,369 INFO [train.py:996] (3/4) Epoch 6, batch 27150, loss[loss=0.2511, simple_loss=0.3403, pruned_loss=0.08099, over 21443.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3257, pruned_loss=0.08604, over 4287080.75 frames. ], batch size: 211, lr: 4.93e-03, grad_scale: 8.0 2023-06-21 22:32:43,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1077798.0, ans=0.0 2023-06-21 22:32:59,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1077858.0, ans=0.125 2023-06-21 22:32:59,676 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=22.5 2023-06-21 22:33:05,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1077858.0, ans=0.125 2023-06-21 22:33:25,493 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-21 22:33:47,060 INFO [train.py:996] (3/4) Epoch 6, batch 27200, loss[loss=0.3849, simple_loss=0.4265, pruned_loss=0.1717, over 21392.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3343, pruned_loss=0.0892, over 4286863.68 frames. ], batch size: 508, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:34:15,785 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.516e+02 3.235e+02 3.777e+02 4.284e+02 9.441e+02, threshold=7.555e+02, percent-clipped=8.0 2023-06-21 22:35:30,909 INFO [train.py:996] (3/4) Epoch 6, batch 27250, loss[loss=0.2651, simple_loss=0.328, pruned_loss=0.1011, over 21373.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3378, pruned_loss=0.09367, over 4287817.12 frames. ], batch size: 176, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:36:05,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1078398.0, ans=0.125 2023-06-21 22:36:28,227 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-21 22:36:29,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1078518.0, ans=0.0 2023-06-21 22:37:06,680 INFO [train.py:996] (3/4) Epoch 6, batch 27300, loss[loss=0.2565, simple_loss=0.3453, pruned_loss=0.08383, over 21913.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3398, pruned_loss=0.09509, over 4286609.69 frames. ], batch size: 372, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:37:10,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1078638.0, ans=0.2 2023-06-21 22:37:30,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1078698.0, ans=0.125 2023-06-21 22:37:36,248 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.428e+02 3.091e+02 3.407e+02 3.961e+02 5.625e+02, threshold=6.815e+02, percent-clipped=0.0 2023-06-21 22:38:16,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1078818.0, ans=0.125 2023-06-21 22:38:16,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1078818.0, ans=0.2 2023-06-21 22:38:18,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1078818.0, ans=0.0 2023-06-21 22:38:19,556 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-06-21 22:38:24,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1078878.0, ans=0.125 2023-06-21 22:38:32,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1078878.0, ans=0.0 2023-06-21 22:38:45,655 INFO [train.py:996] (3/4) Epoch 6, batch 27350, loss[loss=0.2529, simple_loss=0.3369, pruned_loss=0.08441, over 21780.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3428, pruned_loss=0.09643, over 4285247.38 frames. ], batch size: 332, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:39:12,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1078998.0, ans=0.125 2023-06-21 22:39:41,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1079118.0, ans=0.2 2023-06-21 22:39:57,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1079178.0, ans=0.5 2023-06-21 22:40:00,831 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-21 22:40:16,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1079238.0, ans=0.2 2023-06-21 22:40:17,994 INFO [train.py:996] (3/4) Epoch 6, batch 27400, loss[loss=0.2128, simple_loss=0.2777, pruned_loss=0.07397, over 21659.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3358, pruned_loss=0.09439, over 4286076.34 frames. ], batch size: 247, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:40:41,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1079298.0, ans=0.1 2023-06-21 22:40:43,755 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 2.934e+02 3.230e+02 3.710e+02 5.363e+02, threshold=6.461e+02, percent-clipped=0.0 2023-06-21 22:40:47,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1079298.0, ans=0.125 2023-06-21 22:40:51,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1079358.0, ans=0.0 2023-06-21 22:40:54,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1079358.0, ans=0.0 2023-06-21 22:41:51,778 INFO [train.py:996] (3/4) Epoch 6, batch 27450, loss[loss=0.2435, simple_loss=0.3243, pruned_loss=0.08136, over 21600.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3297, pruned_loss=0.09208, over 4289797.50 frames. ], batch size: 263, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:42:01,872 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.85 vs. limit=15.0 2023-06-21 22:42:38,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1079658.0, ans=0.0 2023-06-21 22:43:20,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1079778.0, ans=0.1 2023-06-21 22:43:24,726 INFO [train.py:996] (3/4) Epoch 6, batch 27500, loss[loss=0.23, simple_loss=0.302, pruned_loss=0.07903, over 21503.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3282, pruned_loss=0.09278, over 4294654.59 frames. ], batch size: 548, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:43:44,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1079898.0, ans=0.2 2023-06-21 22:43:50,266 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.559e+02 2.999e+02 3.729e+02 4.399e+02 9.645e+02, threshold=7.458e+02, percent-clipped=3.0 2023-06-21 22:44:05,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1079958.0, ans=0.015 2023-06-21 22:44:06,833 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:44:06,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1079958.0, ans=0.0 2023-06-21 22:44:52,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1080078.0, ans=0.125 2023-06-21 22:44:59,245 INFO [train.py:996] (3/4) Epoch 6, batch 27550, loss[loss=0.2228, simple_loss=0.285, pruned_loss=0.08029, over 21502.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3232, pruned_loss=0.08927, over 4285659.88 frames. ], batch size: 441, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:45:30,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1080198.0, ans=0.0 2023-06-21 22:45:56,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1080258.0, ans=0.125 2023-06-21 22:46:12,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1080318.0, ans=0.125 2023-06-21 22:46:14,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1080318.0, ans=0.1 2023-06-21 22:46:37,806 INFO [train.py:996] (3/4) Epoch 6, batch 27600, loss[loss=0.1985, simple_loss=0.2675, pruned_loss=0.06477, over 21609.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3158, pruned_loss=0.08758, over 4279771.92 frames. ], batch size: 263, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:46:43,384 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.96 vs. limit=22.5 2023-06-21 22:46:58,587 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 2.844e+02 3.346e+02 3.964e+02 7.072e+02, threshold=6.692e+02, percent-clipped=0.0 2023-06-21 22:47:12,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1080558.0, ans=0.125 2023-06-21 22:47:52,613 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.12 vs. limit=15.0 2023-06-21 22:48:00,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1080678.0, ans=0.125 2023-06-21 22:48:05,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1080738.0, ans=0.0 2023-06-21 22:48:06,203 INFO [train.py:996] (3/4) Epoch 6, batch 27650, loss[loss=0.2104, simple_loss=0.2643, pruned_loss=0.07828, over 21406.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3102, pruned_loss=0.08699, over 4274676.11 frames. ], batch size: 160, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:48:12,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1080738.0, ans=0.125 2023-06-21 22:49:44,270 INFO [train.py:996] (3/4) Epoch 6, batch 27700, loss[loss=0.267, simple_loss=0.3456, pruned_loss=0.09422, over 21713.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3106, pruned_loss=0.08637, over 4280450.24 frames. ], batch size: 298, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:50:01,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1081098.0, ans=0.125 2023-06-21 22:50:05,745 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.051e+02 2.845e+02 3.268e+02 3.924e+02 7.341e+02, threshold=6.535e+02, percent-clipped=1.0 2023-06-21 22:50:38,909 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.52 vs. limit=10.0 2023-06-21 22:50:58,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1081278.0, ans=0.125 2023-06-21 22:51:02,124 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.04 vs. limit=15.0 2023-06-21 22:51:10,718 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2023-06-21 22:51:18,672 INFO [train.py:996] (3/4) Epoch 6, batch 27750, loss[loss=0.2272, simple_loss=0.3076, pruned_loss=0.07344, over 21431.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3133, pruned_loss=0.08535, over 4280415.43 frames. ], batch size: 211, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:51:49,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1081398.0, ans=0.125 2023-06-21 22:52:11,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1081518.0, ans=0.125 2023-06-21 22:52:51,594 INFO [train.py:996] (3/4) Epoch 6, batch 27800, loss[loss=0.2284, simple_loss=0.2923, pruned_loss=0.08223, over 20040.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3132, pruned_loss=0.08559, over 4283590.52 frames. ], batch size: 703, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:53:06,043 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.70 vs. limit=15.0 2023-06-21 22:53:12,157 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.907e+02 3.249e+02 3.877e+02 6.679e+02, threshold=6.497e+02, percent-clipped=1.0 2023-06-21 22:54:25,871 INFO [train.py:996] (3/4) Epoch 6, batch 27850, loss[loss=0.2696, simple_loss=0.3321, pruned_loss=0.1035, over 21808.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3142, pruned_loss=0.08747, over 4291811.00 frames. ], batch size: 441, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:55:28,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1082118.0, ans=0.125 2023-06-21 22:55:31,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1082118.0, ans=0.2 2023-06-21 22:55:46,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1082178.0, ans=0.125 2023-06-21 22:55:52,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1082178.0, ans=0.0 2023-06-21 22:56:01,480 INFO [train.py:996] (3/4) Epoch 6, batch 27900, loss[loss=0.2607, simple_loss=0.3446, pruned_loss=0.08842, over 20042.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3247, pruned_loss=0.08946, over 4287492.03 frames. ], batch size: 703, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:56:17,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1082238.0, ans=0.2 2023-06-21 22:56:27,607 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 2.965e+02 3.401e+02 4.272e+02 8.717e+02, threshold=6.802e+02, percent-clipped=4.0 2023-06-21 22:56:55,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1082358.0, ans=0.125 2023-06-21 22:57:42,211 INFO [train.py:996] (3/4) Epoch 6, batch 27950, loss[loss=0.1985, simple_loss=0.281, pruned_loss=0.05797, over 21411.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3222, pruned_loss=0.0857, over 4275432.63 frames. ], batch size: 211, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:57:45,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1082538.0, ans=0.0 2023-06-21 22:57:45,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1082538.0, ans=0.07 2023-06-21 22:57:50,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1082538.0, ans=15.0 2023-06-21 22:58:37,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1082658.0, ans=0.125 2023-06-21 22:58:37,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1082658.0, ans=0.0 2023-06-21 22:59:05,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1082778.0, ans=0.125 2023-06-21 22:59:15,427 INFO [train.py:996] (3/4) Epoch 6, batch 28000, loss[loss=0.2505, simple_loss=0.3065, pruned_loss=0.09726, over 21433.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3202, pruned_loss=0.08422, over 4275808.48 frames. ], batch size: 144, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:59:22,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1082838.0, ans=0.0 2023-06-21 22:59:43,149 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.927e+02 3.364e+02 4.265e+02 7.771e+02, threshold=6.727e+02, percent-clipped=2.0 2023-06-21 23:00:14,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1083018.0, ans=0.2 2023-06-21 23:00:50,682 INFO [train.py:996] (3/4) Epoch 6, batch 28050, loss[loss=0.2324, simple_loss=0.2831, pruned_loss=0.09088, over 21818.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3168, pruned_loss=0.0845, over 4273731.98 frames. ], batch size: 118, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:00:55,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1083138.0, ans=0.2 2023-06-21 23:01:41,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1083258.0, ans=0.0 2023-06-21 23:01:47,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1083258.0, ans=0.1 2023-06-21 23:02:14,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1083378.0, ans=0.0 2023-06-21 23:02:29,345 INFO [train.py:996] (3/4) Epoch 6, batch 28100, loss[loss=0.2224, simple_loss=0.2822, pruned_loss=0.08132, over 21728.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.317, pruned_loss=0.08482, over 4275096.08 frames. ], batch size: 371, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:02:53,172 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.19 vs. limit=15.0 2023-06-21 23:03:00,985 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.119e+02 3.918e+02 4.692e+02 8.833e+02, threshold=7.836e+02, percent-clipped=5.0 2023-06-21 23:04:02,318 INFO [train.py:996] (3/4) Epoch 6, batch 28150, loss[loss=0.239, simple_loss=0.2913, pruned_loss=0.09338, over 21575.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3154, pruned_loss=0.0848, over 4262918.69 frames. ], batch size: 415, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:04:33,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1083798.0, ans=0.125 2023-06-21 23:04:38,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1083798.0, ans=0.125 2023-06-21 23:05:20,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1083978.0, ans=0.05 2023-06-21 23:05:40,579 INFO [train.py:996] (3/4) Epoch 6, batch 28200, loss[loss=0.2386, simple_loss=0.2876, pruned_loss=0.0948, over 21124.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3126, pruned_loss=0.08641, over 4271292.67 frames. ], batch size: 176, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:05:54,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1084038.0, ans=0.125 2023-06-21 23:06:07,618 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.278e+02 3.112e+02 3.798e+02 4.464e+02 8.953e+02, threshold=7.596e+02, percent-clipped=1.0 2023-06-21 23:06:14,957 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.35 vs. limit=15.0 2023-06-21 23:07:13,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1084338.0, ans=0.0 2023-06-21 23:07:14,367 INFO [train.py:996] (3/4) Epoch 6, batch 28250, loss[loss=0.321, simple_loss=0.3566, pruned_loss=0.1427, over 21431.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3142, pruned_loss=0.08884, over 4264534.00 frames. ], batch size: 510, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:07:50,513 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=12.0 2023-06-21 23:08:24,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1084578.0, ans=0.0 2023-06-21 23:08:50,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1084578.0, ans=0.125 2023-06-21 23:08:53,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1084638.0, ans=0.1 2023-06-21 23:08:54,449 INFO [train.py:996] (3/4) Epoch 6, batch 28300, loss[loss=0.2173, simple_loss=0.287, pruned_loss=0.07377, over 21382.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3116, pruned_loss=0.08555, over 4263244.44 frames. ], batch size: 160, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:09:02,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1084638.0, ans=0.0 2023-06-21 23:09:17,328 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.819e+02 3.236e+02 3.708e+02 8.201e+02, threshold=6.472e+02, percent-clipped=2.0 2023-06-21 23:10:10,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1084878.0, ans=0.1 2023-06-21 23:10:18,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1084878.0, ans=0.125 2023-06-21 23:10:19,034 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=15.0 2023-06-21 23:10:28,257 INFO [train.py:996] (3/4) Epoch 6, batch 28350, loss[loss=0.2271, simple_loss=0.2791, pruned_loss=0.08759, over 22004.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3045, pruned_loss=0.07933, over 4266679.51 frames. ], batch size: 103, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:10:37,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1084938.0, ans=0.0 2023-06-21 23:12:02,365 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:12:03,515 INFO [train.py:996] (3/4) Epoch 6, batch 28400, loss[loss=0.2518, simple_loss=0.3088, pruned_loss=0.09738, over 21688.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3021, pruned_loss=0.0797, over 4274672.44 frames. ], batch size: 112, lr: 4.91e-03, grad_scale: 32.0 2023-06-21 23:12:19,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1085298.0, ans=0.95 2023-06-21 23:12:25,695 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2023-06-21 23:12:26,109 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.685e+02 3.251e+02 3.858e+02 5.974e+02, threshold=6.502e+02, percent-clipped=0.0 2023-06-21 23:12:26,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1085298.0, ans=0.125 2023-06-21 23:13:01,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1085418.0, ans=0.125 2023-06-21 23:13:27,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1085478.0, ans=0.125 2023-06-21 23:13:33,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1085478.0, ans=0.1 2023-06-21 23:13:34,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1085478.0, ans=0.125 2023-06-21 23:13:37,438 INFO [train.py:996] (3/4) Epoch 6, batch 28450, loss[loss=0.202, simple_loss=0.2502, pruned_loss=0.07688, over 20733.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3071, pruned_loss=0.08391, over 4280718.49 frames. ], batch size: 607, lr: 4.91e-03, grad_scale: 32.0 2023-06-21 23:13:43,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1085538.0, ans=0.0 2023-06-21 23:13:51,670 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.92 vs. limit=12.0 2023-06-21 23:14:59,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1085778.0, ans=0.125 2023-06-21 23:15:10,660 INFO [train.py:996] (3/4) Epoch 6, batch 28500, loss[loss=0.2225, simple_loss=0.2959, pruned_loss=0.07455, over 21879.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3104, pruned_loss=0.08656, over 4286802.23 frames. ], batch size: 124, lr: 4.91e-03, grad_scale: 32.0 2023-06-21 23:15:38,220 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.422e+02 3.148e+02 3.464e+02 4.022e+02 7.400e+02, threshold=6.927e+02, percent-clipped=1.0 2023-06-21 23:15:49,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1085958.0, ans=0.125 2023-06-21 23:16:26,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1086018.0, ans=0.125 2023-06-21 23:16:45,765 INFO [train.py:996] (3/4) Epoch 6, batch 28550, loss[loss=0.2718, simple_loss=0.3466, pruned_loss=0.09845, over 21282.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3182, pruned_loss=0.08926, over 4286981.65 frames. ], batch size: 143, lr: 4.91e-03, grad_scale: 32.0 2023-06-21 23:17:06,765 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-21 23:17:27,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1086198.0, ans=0.125 2023-06-21 23:18:20,980 INFO [train.py:996] (3/4) Epoch 6, batch 28600, loss[loss=0.2211, simple_loss=0.3007, pruned_loss=0.07079, over 21734.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3253, pruned_loss=0.09161, over 4282975.76 frames. ], batch size: 247, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:18:24,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1086438.0, ans=0.125 2023-06-21 23:18:57,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1086498.0, ans=0.1 2023-06-21 23:18:58,678 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.626e+02 3.164e+02 3.571e+02 4.573e+02 8.343e+02, threshold=7.141e+02, percent-clipped=3.0 2023-06-21 23:19:45,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1086678.0, ans=0.2 2023-06-21 23:19:59,185 INFO [train.py:996] (3/4) Epoch 6, batch 28650, loss[loss=0.2259, simple_loss=0.2814, pruned_loss=0.08516, over 21538.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3196, pruned_loss=0.09066, over 4277399.36 frames. ], batch size: 196, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:20:07,940 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-21 23:20:12,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1086738.0, ans=0.0 2023-06-21 23:20:31,255 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-06-21 23:20:47,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1086858.0, ans=0.0 2023-06-21 23:21:01,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1086918.0, ans=0.0 2023-06-21 23:21:01,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1086918.0, ans=0.07 2023-06-21 23:21:38,616 INFO [train.py:996] (3/4) Epoch 6, batch 28700, loss[loss=0.2943, simple_loss=0.3559, pruned_loss=0.1163, over 21650.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3185, pruned_loss=0.09181, over 4280488.51 frames. ], batch size: 389, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:21:56,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1087038.0, ans=0.125 2023-06-21 23:21:58,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.83 vs. limit=6.0 2023-06-21 23:22:07,879 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.373e+02 3.119e+02 3.496e+02 4.060e+02 9.079e+02, threshold=6.992e+02, percent-clipped=1.0 2023-06-21 23:23:09,090 INFO [train.py:996] (3/4) Epoch 6, batch 28750, loss[loss=0.2319, simple_loss=0.2973, pruned_loss=0.08328, over 21447.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3193, pruned_loss=0.09259, over 4284143.09 frames. ], batch size: 144, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:23:19,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1087338.0, ans=0.125 2023-06-21 23:23:22,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1087338.0, ans=0.1 2023-06-21 23:23:23,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1087338.0, ans=0.2 2023-06-21 23:23:45,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1087398.0, ans=0.5 2023-06-21 23:24:43,702 INFO [train.py:996] (3/4) Epoch 6, batch 28800, loss[loss=0.2925, simple_loss=0.3535, pruned_loss=0.1158, over 21840.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3212, pruned_loss=0.09219, over 4275594.28 frames. ], batch size: 282, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:25:08,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1087698.0, ans=0.125 2023-06-21 23:25:16,852 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.957e+02 3.291e+02 3.824e+02 6.486e+02, threshold=6.582e+02, percent-clipped=0.0 2023-06-21 23:26:21,870 INFO [train.py:996] (3/4) Epoch 6, batch 28850, loss[loss=0.2475, simple_loss=0.3153, pruned_loss=0.08986, over 21688.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3228, pruned_loss=0.09333, over 4280650.07 frames. ], batch size: 263, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:26:40,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1087998.0, ans=0.2 2023-06-21 23:26:46,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1087998.0, ans=0.2 2023-06-21 23:28:01,562 INFO [train.py:996] (3/4) Epoch 6, batch 28900, loss[loss=0.2516, simple_loss=0.3173, pruned_loss=0.09291, over 21893.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3254, pruned_loss=0.09467, over 4277787.74 frames. ], batch size: 316, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:28:26,423 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.480e+02 3.260e+02 3.561e+02 4.329e+02 7.781e+02, threshold=7.122e+02, percent-clipped=1.0 2023-06-21 23:28:44,845 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.59 vs. limit=15.0 2023-06-21 23:29:28,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1088478.0, ans=0.1 2023-06-21 23:29:38,521 INFO [train.py:996] (3/4) Epoch 6, batch 28950, loss[loss=0.2247, simple_loss=0.3232, pruned_loss=0.06306, over 21820.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3277, pruned_loss=0.09405, over 4274182.95 frames. ], batch size: 316, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:30:08,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1088598.0, ans=0.04949747468305833 2023-06-21 23:30:16,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1088658.0, ans=0.125 2023-06-21 23:30:56,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1088778.0, ans=0.125 2023-06-21 23:31:09,138 INFO [train.py:996] (3/4) Epoch 6, batch 29000, loss[loss=0.2797, simple_loss=0.3507, pruned_loss=0.1043, over 21580.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.33, pruned_loss=0.09358, over 4272055.68 frames. ], batch size: 414, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:31:40,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1088898.0, ans=0.0 2023-06-21 23:31:43,614 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.241e+02 3.719e+02 4.877e+02 7.775e+02, threshold=7.438e+02, percent-clipped=3.0 2023-06-21 23:31:47,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1088958.0, ans=22.5 2023-06-21 23:32:31,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1089078.0, ans=0.1 2023-06-21 23:32:42,124 INFO [train.py:996] (3/4) Epoch 6, batch 29050, loss[loss=0.2303, simple_loss=0.2947, pruned_loss=0.08292, over 21528.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3293, pruned_loss=0.09403, over 4281193.83 frames. ], batch size: 194, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:33:20,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1089258.0, ans=0.0 2023-06-21 23:34:15,073 INFO [train.py:996] (3/4) Epoch 6, batch 29100, loss[loss=0.1934, simple_loss=0.2577, pruned_loss=0.0646, over 21606.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3213, pruned_loss=0.09184, over 4283005.15 frames. ], batch size: 231, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:34:41,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1089498.0, ans=0.0 2023-06-21 23:34:49,733 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.284e+02 2.936e+02 3.266e+02 3.955e+02 6.605e+02, threshold=6.533e+02, percent-clipped=0.0 2023-06-21 23:35:48,085 INFO [train.py:996] (3/4) Epoch 6, batch 29150, loss[loss=0.241, simple_loss=0.3107, pruned_loss=0.08568, over 21772.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3203, pruned_loss=0.09045, over 4285413.09 frames. ], batch size: 371, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:35:53,751 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=22.5 2023-06-21 23:36:37,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1089858.0, ans=0.125 2023-06-21 23:37:12,323 INFO [train.py:996] (3/4) Epoch 6, batch 29200, loss[loss=0.3022, simple_loss=0.3541, pruned_loss=0.1252, over 21438.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3166, pruned_loss=0.08985, over 4276511.58 frames. ], batch size: 509, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:37:46,311 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:37:47,286 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 2.885e+02 3.375e+02 4.210e+02 7.193e+02, threshold=6.750e+02, percent-clipped=2.0 2023-06-21 23:38:50,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1090338.0, ans=0.125 2023-06-21 23:38:55,720 INFO [train.py:996] (3/4) Epoch 6, batch 29250, loss[loss=0.2238, simple_loss=0.3142, pruned_loss=0.06668, over 21763.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3152, pruned_loss=0.08793, over 4275250.18 frames. ], batch size: 282, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:39:26,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1090398.0, ans=0.07 2023-06-21 23:39:34,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1090458.0, ans=0.07 2023-06-21 23:39:51,901 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-21 23:40:29,844 INFO [train.py:996] (3/4) Epoch 6, batch 29300, loss[loss=0.2159, simple_loss=0.2798, pruned_loss=0.07603, over 21364.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3159, pruned_loss=0.08699, over 4263944.36 frames. ], batch size: 194, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:40:59,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1090698.0, ans=0.2 2023-06-21 23:41:00,633 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.989e+02 3.718e+02 4.652e+02 8.892e+02, threshold=7.436e+02, percent-clipped=6.0 2023-06-21 23:41:01,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1090698.0, ans=0.5 2023-06-21 23:41:03,049 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=22.5 2023-06-21 23:42:00,615 INFO [train.py:996] (3/4) Epoch 6, batch 29350, loss[loss=0.221, simple_loss=0.3036, pruned_loss=0.06916, over 21248.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.311, pruned_loss=0.08602, over 4257897.93 frames. ], batch size: 176, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:42:06,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1090938.0, ans=0.0 2023-06-21 23:42:34,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1090998.0, ans=0.0 2023-06-21 23:42:35,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1090998.0, ans=0.0 2023-06-21 23:42:39,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1090998.0, ans=0.125 2023-06-21 23:42:53,639 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=22.5 2023-06-21 23:43:21,108 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=12.0 2023-06-21 23:43:32,542 INFO [train.py:996] (3/4) Epoch 6, batch 29400, loss[loss=0.236, simple_loss=0.3298, pruned_loss=0.07104, over 21719.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3109, pruned_loss=0.08382, over 4251227.40 frames. ], batch size: 298, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:43:32,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1091238.0, ans=0.2 2023-06-21 23:43:40,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1091238.0, ans=0.125 2023-06-21 23:44:03,855 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 2.776e+02 3.211e+02 3.938e+02 7.454e+02, threshold=6.422e+02, percent-clipped=1.0 2023-06-21 23:44:27,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1091418.0, ans=0.0 2023-06-21 23:44:35,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1091418.0, ans=0.0 2023-06-21 23:44:36,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1091418.0, ans=0.125 2023-06-21 23:44:57,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1091538.0, ans=0.1 2023-06-21 23:44:58,835 INFO [train.py:996] (3/4) Epoch 6, batch 29450, loss[loss=0.2476, simple_loss=0.3238, pruned_loss=0.08571, over 21722.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3105, pruned_loss=0.08278, over 4259180.34 frames. ], batch size: 332, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:45:13,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1091538.0, ans=0.125 2023-06-21 23:45:16,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1091538.0, ans=0.125 2023-06-21 23:46:00,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1091718.0, ans=0.0 2023-06-21 23:46:01,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1091718.0, ans=0.125 2023-06-21 23:46:26,835 INFO [train.py:996] (3/4) Epoch 6, batch 29500, loss[loss=0.2901, simple_loss=0.3387, pruned_loss=0.1208, over 21743.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3144, pruned_loss=0.08601, over 4262436.01 frames. ], batch size: 508, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:47:01,554 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.458e+02 2.999e+02 3.395e+02 3.971e+02 6.244e+02, threshold=6.790e+02, percent-clipped=0.0 2023-06-21 23:47:08,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1091958.0, ans=0.0 2023-06-21 23:47:43,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1092018.0, ans=0.125 2023-06-21 23:47:52,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1092078.0, ans=0.0 2023-06-21 23:48:05,504 INFO [train.py:996] (3/4) Epoch 6, batch 29550, loss[loss=0.2323, simple_loss=0.3041, pruned_loss=0.08031, over 21839.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3146, pruned_loss=0.0883, over 4274664.82 frames. ], batch size: 332, lr: 4.89e-03, grad_scale: 32.0 2023-06-21 23:48:38,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1092198.0, ans=0.125 2023-06-21 23:48:54,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1092318.0, ans=0.1 2023-06-21 23:49:18,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1092378.0, ans=0.125 2023-06-21 23:49:20,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1092378.0, ans=0.0 2023-06-21 23:49:34,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1092378.0, ans=0.1 2023-06-21 23:49:44,843 INFO [train.py:996] (3/4) Epoch 6, batch 29600, loss[loss=0.2359, simple_loss=0.317, pruned_loss=0.07745, over 21210.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3197, pruned_loss=0.09014, over 4278162.50 frames. ], batch size: 143, lr: 4.89e-03, grad_scale: 32.0 2023-06-21 23:49:51,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1092438.0, ans=0.125 2023-06-21 23:50:11,744 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 3.056e+02 3.521e+02 4.458e+02 7.696e+02, threshold=7.042e+02, percent-clipped=3.0 2023-06-21 23:50:34,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1092618.0, ans=0.125 2023-06-21 23:50:51,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1092678.0, ans=0.125 2023-06-21 23:51:17,478 INFO [train.py:996] (3/4) Epoch 6, batch 29650, loss[loss=0.2991, simple_loss=0.4176, pruned_loss=0.09025, over 19829.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3165, pruned_loss=0.08648, over 4274689.70 frames. ], batch size: 702, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:51:29,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1092738.0, ans=0.125 2023-06-21 23:51:53,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1092858.0, ans=0.125 2023-06-21 23:52:19,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1092918.0, ans=0.05 2023-06-21 23:52:45,822 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.57 vs. limit=10.0 2023-06-21 23:52:50,736 INFO [train.py:996] (3/4) Epoch 6, batch 29700, loss[loss=0.3352, simple_loss=0.4328, pruned_loss=0.1188, over 21528.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3214, pruned_loss=0.08743, over 4272124.34 frames. ], batch size: 471, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:53:07,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1093098.0, ans=0.95 2023-06-21 23:53:07,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1093098.0, ans=0.0 2023-06-21 23:53:19,201 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.737e+02 3.024e+02 3.717e+02 5.941e+02, threshold=6.048e+02, percent-clipped=0.0 2023-06-21 23:54:09,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1093278.0, ans=0.1 2023-06-21 23:54:24,077 INFO [train.py:996] (3/4) Epoch 6, batch 29750, loss[loss=0.2841, simple_loss=0.3734, pruned_loss=0.09744, over 21681.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3255, pruned_loss=0.08731, over 4269951.70 frames. ], batch size: 441, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:54:41,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1093398.0, ans=0.1 2023-06-21 23:55:10,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1093458.0, ans=0.07 2023-06-21 23:55:19,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1093518.0, ans=0.125 2023-06-21 23:55:30,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1093518.0, ans=0.125 2023-06-21 23:55:56,888 INFO [train.py:996] (3/4) Epoch 6, batch 29800, loss[loss=0.229, simple_loss=0.3053, pruned_loss=0.07639, over 21194.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3261, pruned_loss=0.08791, over 4275308.94 frames. ], batch size: 143, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:56:01,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1093638.0, ans=0.0 2023-06-21 23:56:02,382 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-06-21 23:56:09,761 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-21 23:56:16,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1093698.0, ans=10.0 2023-06-21 23:56:25,355 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.253e+02 2.711e+02 3.022e+02 3.723e+02 5.120e+02, threshold=6.044e+02, percent-clipped=0.0 2023-06-21 23:56:51,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1093818.0, ans=0.0 2023-06-21 23:57:09,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1093878.0, ans=0.0 2023-06-21 23:57:28,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1093938.0, ans=0.1 2023-06-21 23:57:29,963 INFO [train.py:996] (3/4) Epoch 6, batch 29850, loss[loss=0.274, simple_loss=0.3276, pruned_loss=0.1102, over 21764.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.321, pruned_loss=0.08559, over 4275551.33 frames. ], batch size: 112, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:57:56,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=22.5 2023-06-21 23:57:59,474 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=15.0 2023-06-21 23:58:21,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1094058.0, ans=0.125 2023-06-21 23:58:37,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1094118.0, ans=0.0 2023-06-21 23:58:59,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1094178.0, ans=0.125 2023-06-21 23:59:02,887 INFO [train.py:996] (3/4) Epoch 6, batch 29900, loss[loss=0.3137, simple_loss=0.3642, pruned_loss=0.1316, over 21498.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3202, pruned_loss=0.08727, over 4282914.74 frames. ], batch size: 471, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:59:36,081 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 3.118e+02 4.055e+02 5.716e+02 1.068e+03, threshold=8.110e+02, percent-clipped=21.0 2023-06-22 00:00:04,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1094418.0, ans=0.125 2023-06-22 00:00:20,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=22.5 2023-06-22 00:00:33,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1094478.0, ans=0.02 2023-06-22 00:00:37,267 INFO [train.py:996] (3/4) Epoch 6, batch 29950, loss[loss=0.3024, simple_loss=0.366, pruned_loss=0.1194, over 21795.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3256, pruned_loss=0.09174, over 4288576.28 frames. ], batch size: 441, lr: 4.89e-03, grad_scale: 8.0 2023-06-22 00:00:39,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1094538.0, ans=0.125 2023-06-22 00:01:19,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1094658.0, ans=0.125 2023-06-22 00:01:34,703 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.24 vs. limit=10.0 2023-06-22 00:01:36,244 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-22 00:02:11,810 INFO [train.py:996] (3/4) Epoch 6, batch 30000, loss[loss=0.2514, simple_loss=0.3335, pruned_loss=0.0846, over 21647.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3269, pruned_loss=0.09176, over 4288144.86 frames. ], batch size: 230, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:02:11,811 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 00:02:30,093 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2467, simple_loss=0.3478, pruned_loss=0.07276, over 1796401.00 frames. 2023-06-22 00:02:30,094 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-22 00:03:00,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1094898.0, ans=0.0 2023-06-22 00:03:13,685 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2023-06-22 00:03:14,300 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.660e+02 3.036e+02 3.460e+02 6.733e+02, threshold=6.073e+02, percent-clipped=0.0 2023-06-22 00:03:19,982 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=22.5 2023-06-22 00:04:17,401 INFO [train.py:996] (3/4) Epoch 6, batch 30050, loss[loss=0.3298, simple_loss=0.4257, pruned_loss=0.117, over 21525.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.329, pruned_loss=0.0874, over 4276344.07 frames. ], batch size: 471, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:05:30,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1095378.0, ans=0.1 2023-06-22 00:05:39,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1095378.0, ans=0.0 2023-06-22 00:05:50,549 INFO [train.py:996] (3/4) Epoch 6, batch 30100, loss[loss=0.2433, simple_loss=0.3029, pruned_loss=0.09183, over 21631.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3301, pruned_loss=0.08764, over 4266999.72 frames. ], batch size: 333, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:06:13,728 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.71 vs. limit=8.0 2023-06-22 00:06:18,084 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.84 vs. limit=10.0 2023-06-22 00:06:24,501 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.037e+02 3.799e+02 4.739e+02 8.498e+02, threshold=7.598e+02, percent-clipped=11.0 2023-06-22 00:06:24,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1095558.0, ans=0.0 2023-06-22 00:06:44,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1095618.0, ans=0.0 2023-06-22 00:07:25,729 INFO [train.py:996] (3/4) Epoch 6, batch 30150, loss[loss=0.3513, simple_loss=0.3871, pruned_loss=0.1577, over 21319.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3275, pruned_loss=0.08998, over 4274377.46 frames. ], batch size: 507, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:07:53,703 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-22 00:08:08,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1095858.0, ans=0.125 2023-06-22 00:08:10,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1095858.0, ans=0.2 2023-06-22 00:08:26,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1095918.0, ans=10.0 2023-06-22 00:08:37,376 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=22.5 2023-06-22 00:09:02,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1095978.0, ans=0.025 2023-06-22 00:09:07,436 INFO [train.py:996] (3/4) Epoch 6, batch 30200, loss[loss=0.2591, simple_loss=0.3447, pruned_loss=0.08669, over 21748.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3284, pruned_loss=0.0881, over 4272836.01 frames. ], batch size: 441, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:09:12,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1096038.0, ans=0.125 2023-06-22 00:09:46,251 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 3.001e+02 3.567e+02 4.107e+02 7.558e+02, threshold=7.134e+02, percent-clipped=0.0 2023-06-22 00:09:46,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1096158.0, ans=0.2 2023-06-22 00:10:03,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=1096158.0, ans=0.1 2023-06-22 00:10:28,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1096278.0, ans=0.125 2023-06-22 00:10:37,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1096278.0, ans=0.125 2023-06-22 00:10:37,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1096278.0, ans=0.0 2023-06-22 00:10:42,669 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=22.5 2023-06-22 00:10:43,115 INFO [train.py:996] (3/4) Epoch 6, batch 30250, loss[loss=0.3068, simple_loss=0.395, pruned_loss=0.1093, over 21313.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3352, pruned_loss=0.08999, over 4273666.98 frames. ], batch size: 549, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:11:16,496 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.65 vs. limit=15.0 2023-06-22 00:11:28,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1096458.0, ans=0.1 2023-06-22 00:11:30,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1096458.0, ans=0.125 2023-06-22 00:11:36,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1096458.0, ans=0.05 2023-06-22 00:12:10,094 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:12:17,395 INFO [train.py:996] (3/4) Epoch 6, batch 30300, loss[loss=0.2097, simple_loss=0.2753, pruned_loss=0.07201, over 21924.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3309, pruned_loss=0.08982, over 4279761.87 frames. ], batch size: 119, lr: 4.88e-03, grad_scale: 16.0 2023-06-22 00:12:40,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1096698.0, ans=0.1 2023-06-22 00:12:58,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1096698.0, ans=0.125 2023-06-22 00:13:00,637 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.155e+02 3.767e+02 4.351e+02 8.059e+02, threshold=7.534e+02, percent-clipped=2.0 2023-06-22 00:13:02,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1096758.0, ans=0.2 2023-06-22 00:13:26,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1096818.0, ans=0.125 2023-06-22 00:13:36,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1096818.0, ans=0.125 2023-06-22 00:13:40,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1096878.0, ans=0.05 2023-06-22 00:13:57,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1096938.0, ans=0.0 2023-06-22 00:14:03,142 INFO [train.py:996] (3/4) Epoch 6, batch 30350, loss[loss=0.3022, simple_loss=0.3794, pruned_loss=0.1125, over 21547.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3324, pruned_loss=0.09179, over 4282389.49 frames. ], batch size: 441, lr: 4.88e-03, grad_scale: 16.0 2023-06-22 00:14:49,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1097118.0, ans=0.125 2023-06-22 00:15:21,231 INFO [train.py:996] (3/4) Epoch 6, batch 30400, loss[loss=0.2328, simple_loss=0.284, pruned_loss=0.0908, over 20118.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3263, pruned_loss=0.08998, over 4271425.90 frames. ], batch size: 702, lr: 4.88e-03, grad_scale: 32.0 2023-06-22 00:15:21,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1097238.0, ans=0.125 2023-06-22 00:15:39,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.74 vs. limit=10.0 2023-06-22 00:15:50,257 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.413e+02 4.097e+02 5.278e+02 1.616e+03, threshold=8.194e+02, percent-clipped=3.0 2023-06-22 00:16:09,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1097418.0, ans=0.07 2023-06-22 00:16:39,256 INFO [train.py:996] (3/4) Epoch 6, batch 30450, loss[loss=0.3281, simple_loss=0.4389, pruned_loss=0.1087, over 19755.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3286, pruned_loss=0.09012, over 4209705.33 frames. ], batch size: 702, lr: 4.88e-03, grad_scale: 16.0 2023-06-22 00:16:40,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1097538.0, ans=0.125 2023-06-22 00:17:28,075 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.53 vs. limit=15.0 2023-06-22 00:19:20,711 INFO [train.py:996] (3/4) Epoch 7, batch 0, loss[loss=0.2211, simple_loss=0.288, pruned_loss=0.07711, over 21277.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.288, pruned_loss=0.07711, over 21277.00 frames. ], batch size: 551, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:19:20,712 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 00:19:38,896 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2422, simple_loss=0.3486, pruned_loss=0.06787, over 1796401.00 frames. 2023-06-22 00:19:38,897 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-22 00:20:03,736 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.55 vs. limit=10.0 2023-06-22 00:20:13,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1097862.0, ans=0.0 2023-06-22 00:20:26,921 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 4.648e+02 5.934e+02 9.527e+02 2.892e+03, threshold=1.187e+03, percent-clipped=31.0 2023-06-22 00:20:46,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1098042.0, ans=0.2 2023-06-22 00:21:02,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1098042.0, ans=0.0 2023-06-22 00:21:07,550 INFO [train.py:996] (3/4) Epoch 7, batch 50, loss[loss=0.2874, simple_loss=0.3586, pruned_loss=0.1081, over 21388.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.326, pruned_loss=0.08982, over 965558.97 frames. ], batch size: 471, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:21:48,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1098222.0, ans=0.125 2023-06-22 00:22:43,747 INFO [train.py:996] (3/4) Epoch 7, batch 100, loss[loss=0.2604, simple_loss=0.3478, pruned_loss=0.08652, over 19926.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3439, pruned_loss=0.09256, over 1699499.32 frames. ], batch size: 702, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:22:54,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1098402.0, ans=0.035 2023-06-22 00:23:08,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1098462.0, ans=0.125 2023-06-22 00:23:37,731 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.827e+02 3.336e+02 3.937e+02 6.913e+02, threshold=6.673e+02, percent-clipped=0.0 2023-06-22 00:23:58,740 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-22 00:24:19,886 INFO [train.py:996] (3/4) Epoch 7, batch 150, loss[loss=0.2687, simple_loss=0.3528, pruned_loss=0.09228, over 21744.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3469, pruned_loss=0.09279, over 2263912.74 frames. ], batch size: 298, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:24:31,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1098702.0, ans=0.1 2023-06-22 00:25:10,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1098822.0, ans=0.125 2023-06-22 00:25:29,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1098882.0, ans=0.1 2023-06-22 00:25:34,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1098882.0, ans=0.125 2023-06-22 00:25:39,372 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.73 vs. limit=10.0 2023-06-22 00:25:58,211 INFO [train.py:996] (3/4) Epoch 7, batch 200, loss[loss=0.2822, simple_loss=0.3737, pruned_loss=0.09529, over 21674.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.343, pruned_loss=0.0919, over 2700777.93 frames. ], batch size: 414, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:26:11,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1099002.0, ans=0.1 2023-06-22 00:26:13,178 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-22 00:26:31,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1099062.0, ans=0.07 2023-06-22 00:26:52,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1099122.0, ans=0.0 2023-06-22 00:26:56,359 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.932e+02 3.409e+02 3.929e+02 8.481e+02, threshold=6.818e+02, percent-clipped=3.0 2023-06-22 00:27:07,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1099182.0, ans=0.125 2023-06-22 00:27:19,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1099242.0, ans=0.125 2023-06-22 00:27:36,476 INFO [train.py:996] (3/4) Epoch 7, batch 250, loss[loss=0.317, simple_loss=0.3511, pruned_loss=0.1415, over 21683.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3396, pruned_loss=0.09291, over 3045991.44 frames. ], batch size: 507, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:27:38,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1099302.0, ans=0.0 2023-06-22 00:27:53,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1099302.0, ans=0.2 2023-06-22 00:29:14,500 INFO [train.py:996] (3/4) Epoch 7, batch 300, loss[loss=0.2271, simple_loss=0.2894, pruned_loss=0.08242, over 21618.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3332, pruned_loss=0.0908, over 3315920.38 frames. ], batch size: 247, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:29:27,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1099602.0, ans=0.1 2023-06-22 00:30:11,200 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 2.968e+02 3.407e+02 3.987e+02 5.179e+02, threshold=6.813e+02, percent-clipped=0.0 2023-06-22 00:30:11,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1099722.0, ans=0.125 2023-06-22 00:30:23,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1099782.0, ans=0.07 2023-06-22 00:30:52,937 INFO [train.py:996] (3/4) Epoch 7, batch 350, loss[loss=0.3119, simple_loss=0.3828, pruned_loss=0.1204, over 21724.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3268, pruned_loss=0.08963, over 3528978.94 frames. ], batch size: 441, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:30:57,448 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=22.5 2023-06-22 00:31:17,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1099962.0, ans=0.2 2023-06-22 00:31:25,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1099962.0, ans=0.125 2023-06-22 00:31:38,712 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.96 vs. limit=10.0 2023-06-22 00:31:45,493 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=22.5 2023-06-22 00:32:16,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1100142.0, ans=0.125 2023-06-22 00:32:36,677 INFO [train.py:996] (3/4) Epoch 7, batch 400, loss[loss=0.2352, simple_loss=0.3438, pruned_loss=0.06326, over 19848.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3218, pruned_loss=0.08737, over 3694697.33 frames. ], batch size: 703, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:33:29,204 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.277e+02 3.211e+02 3.756e+02 4.853e+02 8.203e+02, threshold=7.513e+02, percent-clipped=4.0 2023-06-22 00:34:11,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1100442.0, ans=0.125 2023-06-22 00:34:15,907 INFO [train.py:996] (3/4) Epoch 7, batch 450, loss[loss=0.2877, simple_loss=0.3913, pruned_loss=0.09207, over 21738.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3206, pruned_loss=0.08663, over 3827241.21 frames. ], batch size: 414, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:35:02,435 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-22 00:35:14,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1100682.0, ans=0.125 2023-06-22 00:35:19,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1100682.0, ans=0.07 2023-06-22 00:35:29,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1100682.0, ans=0.125 2023-06-22 00:35:33,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1100742.0, ans=0.1 2023-06-22 00:36:00,496 INFO [train.py:996] (3/4) Epoch 7, batch 500, loss[loss=0.2093, simple_loss=0.2747, pruned_loss=0.07199, over 21730.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3184, pruned_loss=0.08533, over 3933220.92 frames. ], batch size: 112, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:36:11,701 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-22 00:36:38,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1100922.0, ans=0.125 2023-06-22 00:36:41,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1100922.0, ans=0.125 2023-06-22 00:36:48,623 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-22 00:36:49,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1100922.0, ans=0.0 2023-06-22 00:36:50,807 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.984e+02 3.762e+02 4.525e+02 7.787e+02, threshold=7.525e+02, percent-clipped=1.0 2023-06-22 00:37:12,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1100982.0, ans=0.0 2023-06-22 00:37:41,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1101102.0, ans=0.0 2023-06-22 00:37:42,354 INFO [train.py:996] (3/4) Epoch 7, batch 550, loss[loss=0.2665, simple_loss=0.3221, pruned_loss=0.1055, over 21905.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3175, pruned_loss=0.08387, over 4007513.59 frames. ], batch size: 107, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:38:12,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1101162.0, ans=0.125 2023-06-22 00:38:45,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1101282.0, ans=0.025 2023-06-22 00:39:09,422 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.22 vs. limit=22.5 2023-06-22 00:39:19,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1101402.0, ans=0.2 2023-06-22 00:39:20,769 INFO [train.py:996] (3/4) Epoch 7, batch 600, loss[loss=0.2462, simple_loss=0.307, pruned_loss=0.09266, over 21511.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3227, pruned_loss=0.08481, over 4071251.19 frames. ], batch size: 194, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:39:27,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1101402.0, ans=0.2 2023-06-22 00:39:29,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1101402.0, ans=0.0 2023-06-22 00:39:37,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1101462.0, ans=10.0 2023-06-22 00:39:56,827 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:40:02,008 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=22.5 2023-06-22 00:40:10,557 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 2.944e+02 3.479e+02 4.173e+02 5.834e+02, threshold=6.959e+02, percent-clipped=0.0 2023-06-22 00:40:10,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1101522.0, ans=0.0 2023-06-22 00:40:50,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1101642.0, ans=0.1 2023-06-22 00:41:00,139 INFO [train.py:996] (3/4) Epoch 7, batch 650, loss[loss=0.2433, simple_loss=0.3308, pruned_loss=0.07794, over 21356.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3259, pruned_loss=0.0848, over 4124257.53 frames. ], batch size: 211, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:41:31,629 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:41:33,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1101762.0, ans=0.125 2023-06-22 00:41:37,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1101822.0, ans=0.0 2023-06-22 00:42:25,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1101942.0, ans=0.0 2023-06-22 00:42:38,116 INFO [train.py:996] (3/4) Epoch 7, batch 700, loss[loss=0.292, simple_loss=0.3572, pruned_loss=0.1134, over 21806.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.327, pruned_loss=0.08603, over 4151245.32 frames. ], batch size: 112, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:42:56,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1102002.0, ans=0.2 2023-06-22 00:43:27,976 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.306e+02 4.325e+02 5.540e+02 9.236e+02, threshold=8.651e+02, percent-clipped=10.0 2023-06-22 00:43:28,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1102122.0, ans=0.04949747468305833 2023-06-22 00:44:04,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1102242.0, ans=6.0 2023-06-22 00:44:16,054 INFO [train.py:996] (3/4) Epoch 7, batch 750, loss[loss=0.2553, simple_loss=0.3809, pruned_loss=0.06486, over 19782.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3266, pruned_loss=0.08648, over 4180787.20 frames. ], batch size: 702, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:44:16,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1102302.0, ans=0.0 2023-06-22 00:44:19,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1102302.0, ans=0.1 2023-06-22 00:44:54,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1102422.0, ans=0.04949747468305833 2023-06-22 00:45:25,514 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2023-06-22 00:45:32,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1102542.0, ans=0.1 2023-06-22 00:45:53,745 INFO [train.py:996] (3/4) Epoch 7, batch 800, loss[loss=0.228, simple_loss=0.3002, pruned_loss=0.07789, over 21314.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3246, pruned_loss=0.08738, over 4209308.79 frames. ], batch size: 131, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:46:07,298 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-06-22 00:46:34,271 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-06-22 00:46:42,856 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.348e+02 3.353e+02 3.931e+02 5.238e+02 1.056e+03, threshold=7.862e+02, percent-clipped=1.0 2023-06-22 00:46:47,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1102782.0, ans=0.125 2023-06-22 00:47:31,038 INFO [train.py:996] (3/4) Epoch 7, batch 850, loss[loss=0.2314, simple_loss=0.2821, pruned_loss=0.09038, over 21533.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3225, pruned_loss=0.08749, over 4229851.23 frames. ], batch size: 263, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:47:50,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1102962.0, ans=0.125 2023-06-22 00:49:04,177 INFO [train.py:996] (3/4) Epoch 7, batch 900, loss[loss=0.2115, simple_loss=0.2902, pruned_loss=0.06638, over 21164.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3156, pruned_loss=0.08545, over 4234873.52 frames. ], batch size: 548, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:49:33,901 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.53 vs. limit=10.0 2023-06-22 00:49:42,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1103322.0, ans=0.125 2023-06-22 00:49:53,742 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.788e+02 3.273e+02 3.960e+02 6.263e+02, threshold=6.546e+02, percent-clipped=0.0 2023-06-22 00:50:48,000 INFO [train.py:996] (3/4) Epoch 7, batch 950, loss[loss=0.2526, simple_loss=0.3092, pruned_loss=0.09797, over 21660.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3149, pruned_loss=0.08547, over 4248435.03 frames. ], batch size: 333, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:51:01,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1103502.0, ans=0.125 2023-06-22 00:51:02,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1103562.0, ans=0.5 2023-06-22 00:51:55,707 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-22 00:52:26,209 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-06-22 00:52:26,736 INFO [train.py:996] (3/4) Epoch 7, batch 1000, loss[loss=0.1699, simple_loss=0.2557, pruned_loss=0.04202, over 21340.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3165, pruned_loss=0.0861, over 4257394.30 frames. ], batch size: 194, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:53:25,784 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 2.971e+02 3.488e+02 4.258e+02 7.403e+02, threshold=6.977e+02, percent-clipped=1.0 2023-06-22 00:53:27,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1103982.0, ans=0.04949747468305833 2023-06-22 00:54:03,577 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-22 00:54:07,777 INFO [train.py:996] (3/4) Epoch 7, batch 1050, loss[loss=0.2499, simple_loss=0.3077, pruned_loss=0.09601, over 21313.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3153, pruned_loss=0.08672, over 4267516.25 frames. ], batch size: 176, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 00:54:08,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1104102.0, ans=0.125 2023-06-22 00:54:11,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1104102.0, ans=0.125 2023-06-22 00:55:09,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1104282.0, ans=0.2 2023-06-22 00:55:14,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1104282.0, ans=0.0 2023-06-22 00:55:47,352 INFO [train.py:996] (3/4) Epoch 7, batch 1100, loss[loss=0.2339, simple_loss=0.3193, pruned_loss=0.07425, over 21744.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3136, pruned_loss=0.08566, over 4269074.76 frames. ], batch size: 414, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 00:55:55,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1104402.0, ans=0.5 2023-06-22 00:56:33,256 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=15.0 2023-06-22 00:56:40,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1104522.0, ans=0.125 2023-06-22 00:56:48,282 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.865e+02 3.379e+02 3.962e+02 8.205e+02, threshold=6.758e+02, percent-clipped=2.0 2023-06-22 00:56:50,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1104582.0, ans=0.09899494936611666 2023-06-22 00:56:50,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1104582.0, ans=0.1 2023-06-22 00:56:53,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1104582.0, ans=0.125 2023-06-22 00:56:55,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1104582.0, ans=0.0 2023-06-22 00:56:58,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1104582.0, ans=0.125 2023-06-22 00:56:58,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1104582.0, ans=0.1 2023-06-22 00:57:20,320 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-22 00:57:27,374 INFO [train.py:996] (3/4) Epoch 7, batch 1150, loss[loss=0.2396, simple_loss=0.3168, pruned_loss=0.08125, over 21573.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3128, pruned_loss=0.08464, over 4269430.19 frames. ], batch size: 441, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 00:58:01,281 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.25 vs. limit=15.0 2023-06-22 00:58:16,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1104822.0, ans=0.125 2023-06-22 00:59:05,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1104942.0, ans=0.125 2023-06-22 00:59:08,527 INFO [train.py:996] (3/4) Epoch 7, batch 1200, loss[loss=0.2612, simple_loss=0.3305, pruned_loss=0.09596, over 21768.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3146, pruned_loss=0.0856, over 4272863.10 frames. ], batch size: 247, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:59:39,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1105062.0, ans=0.125 2023-06-22 00:59:49,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1105062.0, ans=0.125 2023-06-22 01:00:07,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1105122.0, ans=0.125 2023-06-22 01:00:10,589 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.150e+02 3.657e+02 4.212e+02 7.667e+02, threshold=7.313e+02, percent-clipped=2.0 2023-06-22 01:00:48,902 INFO [train.py:996] (3/4) Epoch 7, batch 1250, loss[loss=0.2525, simple_loss=0.3225, pruned_loss=0.09128, over 21835.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3165, pruned_loss=0.08689, over 4276049.62 frames. ], batch size: 351, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:01:35,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1105422.0, ans=0.125 2023-06-22 01:02:28,453 INFO [train.py:996] (3/4) Epoch 7, batch 1300, loss[loss=0.2585, simple_loss=0.3201, pruned_loss=0.09847, over 21928.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3181, pruned_loss=0.08715, over 4280126.34 frames. ], batch size: 113, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:02:28,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1105602.0, ans=0.2 2023-06-22 01:02:49,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1105602.0, ans=0.2 2023-06-22 01:02:57,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1105662.0, ans=0.0 2023-06-22 01:03:17,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1105722.0, ans=0.125 2023-06-22 01:03:20,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1105722.0, ans=0.0 2023-06-22 01:03:26,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1105722.0, ans=0.125 2023-06-22 01:03:36,715 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.032e+02 3.671e+02 4.552e+02 8.321e+02, threshold=7.341e+02, percent-clipped=3.0 2023-06-22 01:04:13,372 INFO [train.py:996] (3/4) Epoch 7, batch 1350, loss[loss=0.2118, simple_loss=0.2855, pruned_loss=0.069, over 21817.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3198, pruned_loss=0.08837, over 4282603.61 frames. ], batch size: 414, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:04:59,876 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-22 01:05:04,901 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-22 01:05:05,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1106022.0, ans=0.2 2023-06-22 01:05:22,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1106082.0, ans=0.0 2023-06-22 01:05:22,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1106082.0, ans=0.2 2023-06-22 01:05:51,560 INFO [train.py:996] (3/4) Epoch 7, batch 1400, loss[loss=0.2224, simple_loss=0.2951, pruned_loss=0.07486, over 21651.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3181, pruned_loss=0.08796, over 4281729.73 frames. ], batch size: 247, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:06:04,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1106202.0, ans=0.0 2023-06-22 01:06:25,988 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.67 vs. limit=22.5 2023-06-22 01:06:54,466 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.431e+02 3.168e+02 3.476e+02 4.011e+02 7.450e+02, threshold=6.951e+02, percent-clipped=1.0 2023-06-22 01:07:18,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-22 01:07:26,208 INFO [train.py:996] (3/4) Epoch 7, batch 1450, loss[loss=0.2907, simple_loss=0.3608, pruned_loss=0.1103, over 21372.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3196, pruned_loss=0.08906, over 4275154.01 frames. ], batch size: 131, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:07:42,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1106502.0, ans=0.2 2023-06-22 01:07:49,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1106502.0, ans=0.0 2023-06-22 01:07:58,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1106562.0, ans=0.125 2023-06-22 01:08:14,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1106622.0, ans=0.125 2023-06-22 01:08:19,554 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:08:47,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1106682.0, ans=0.0 2023-06-22 01:08:52,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1106742.0, ans=0.125 2023-06-22 01:09:11,597 INFO [train.py:996] (3/4) Epoch 7, batch 1500, loss[loss=0.23, simple_loss=0.3038, pruned_loss=0.07814, over 17582.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3219, pruned_loss=0.08991, over 4275478.08 frames. ], batch size: 60, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:10:15,556 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.335e+02 2.952e+02 3.357e+02 3.782e+02 8.287e+02, threshold=6.713e+02, percent-clipped=2.0 2023-06-22 01:11:03,387 INFO [train.py:996] (3/4) Epoch 7, batch 1550, loss[loss=0.2681, simple_loss=0.3343, pruned_loss=0.101, over 21501.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3187, pruned_loss=0.08772, over 4276480.70 frames. ], batch size: 548, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:11:42,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1107222.0, ans=0.0 2023-06-22 01:11:42,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1107222.0, ans=0.1 2023-06-22 01:12:00,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1107282.0, ans=0.07 2023-06-22 01:12:04,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=22.5 2023-06-22 01:12:43,970 INFO [train.py:996] (3/4) Epoch 7, batch 1600, loss[loss=0.3272, simple_loss=0.3724, pruned_loss=0.141, over 21340.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3169, pruned_loss=0.08682, over 4277013.43 frames. ], batch size: 507, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 01:13:36,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1107582.0, ans=0.1 2023-06-22 01:13:39,486 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.225e+02 3.079e+02 3.590e+02 4.621e+02 8.115e+02, threshold=7.180e+02, percent-clipped=4.0 2023-06-22 01:13:43,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1107582.0, ans=0.125 2023-06-22 01:13:50,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1107582.0, ans=0.125 2023-06-22 01:14:01,979 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:14:18,573 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-06-22 01:14:25,768 INFO [train.py:996] (3/4) Epoch 7, batch 1650, loss[loss=0.2286, simple_loss=0.2991, pruned_loss=0.07906, over 21159.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3181, pruned_loss=0.08656, over 4268972.02 frames. ], batch size: 608, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:14:30,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1107702.0, ans=0.2 2023-06-22 01:15:19,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1107882.0, ans=0.125 2023-06-22 01:16:07,479 INFO [train.py:996] (3/4) Epoch 7, batch 1700, loss[loss=0.2517, simple_loss=0.3438, pruned_loss=0.07982, over 21856.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3197, pruned_loss=0.0874, over 4270578.61 frames. ], batch size: 316, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:16:39,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1108062.0, ans=0.125 2023-06-22 01:17:13,166 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 3.011e+02 3.603e+02 4.344e+02 6.909e+02, threshold=7.205e+02, percent-clipped=0.0 2023-06-22 01:17:33,855 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-22 01:17:54,141 INFO [train.py:996] (3/4) Epoch 7, batch 1750, loss[loss=0.1813, simple_loss=0.2564, pruned_loss=0.05311, over 21264.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3182, pruned_loss=0.08594, over 4262940.71 frames. ], batch size: 159, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:18:03,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1108302.0, ans=0.1 2023-06-22 01:18:06,736 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:18:58,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1108482.0, ans=0.1 2023-06-22 01:19:00,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1108482.0, ans=0.125 2023-06-22 01:19:04,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1108482.0, ans=0.125 2023-06-22 01:19:12,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1108482.0, ans=0.0 2023-06-22 01:19:36,773 INFO [train.py:996] (3/4) Epoch 7, batch 1800, loss[loss=0.2827, simple_loss=0.3615, pruned_loss=0.102, over 21748.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3174, pruned_loss=0.08339, over 4270292.63 frames. ], batch size: 351, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:20:36,853 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.999e+02 3.807e+02 4.640e+02 8.092e+02, threshold=7.614e+02, percent-clipped=1.0 2023-06-22 01:20:40,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1108782.0, ans=0.125 2023-06-22 01:21:12,513 INFO [train.py:996] (3/4) Epoch 7, batch 1850, loss[loss=0.2574, simple_loss=0.3328, pruned_loss=0.091, over 21841.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3162, pruned_loss=0.08158, over 4276110.86 frames. ], batch size: 371, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:21:42,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1108962.0, ans=0.2 2023-06-22 01:22:49,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1109142.0, ans=0.0 2023-06-22 01:22:49,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1109142.0, ans=0.125 2023-06-22 01:22:51,992 INFO [train.py:996] (3/4) Epoch 7, batch 1900, loss[loss=0.2878, simple_loss=0.351, pruned_loss=0.1123, over 21637.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3164, pruned_loss=0.08163, over 4284325.57 frames. ], batch size: 507, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:22:56,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1109202.0, ans=0.125 2023-06-22 01:23:06,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1109262.0, ans=0.125 2023-06-22 01:23:22,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1109262.0, ans=0.125 2023-06-22 01:23:55,369 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.339e+02 3.011e+02 3.315e+02 4.225e+02 7.544e+02, threshold=6.631e+02, percent-clipped=0.0 2023-06-22 01:24:01,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1109382.0, ans=0.125 2023-06-22 01:24:31,559 INFO [train.py:996] (3/4) Epoch 7, batch 1950, loss[loss=0.2818, simple_loss=0.3388, pruned_loss=0.1124, over 21532.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3129, pruned_loss=0.08135, over 4289498.07 frames. ], batch size: 212, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:25:14,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1109622.0, ans=0.0 2023-06-22 01:25:58,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1109742.0, ans=0.0 2023-06-22 01:26:08,021 INFO [train.py:996] (3/4) Epoch 7, batch 2000, loss[loss=0.3036, simple_loss=0.4077, pruned_loss=0.09969, over 21183.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3092, pruned_loss=0.07993, over 4281151.56 frames. ], batch size: 548, lr: 4.46e-03, grad_scale: 32.0 2023-06-22 01:26:34,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1109862.0, ans=0.2 2023-06-22 01:27:08,093 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.319e+02 3.011e+02 3.534e+02 4.204e+02 7.079e+02, threshold=7.069e+02, percent-clipped=1.0 2023-06-22 01:27:32,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1110042.0, ans=0.125 2023-06-22 01:27:37,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1110042.0, ans=0.125 2023-06-22 01:27:37,996 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=12.0 2023-06-22 01:27:43,210 INFO [train.py:996] (3/4) Epoch 7, batch 2050, loss[loss=0.2563, simple_loss=0.3239, pruned_loss=0.09434, over 21656.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3122, pruned_loss=0.08166, over 4286191.98 frames. ], batch size: 263, lr: 4.46e-03, grad_scale: 32.0 2023-06-22 01:28:28,104 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=22.5 2023-06-22 01:29:28,337 INFO [train.py:996] (3/4) Epoch 7, batch 2100, loss[loss=0.2328, simple_loss=0.3034, pruned_loss=0.08106, over 21600.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.315, pruned_loss=0.08346, over 4282519.19 frames. ], batch size: 414, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:29:55,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1110462.0, ans=0.125 2023-06-22 01:30:28,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1110582.0, ans=0.2 2023-06-22 01:30:34,417 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.474e+02 3.462e+02 4.025e+02 4.907e+02 9.309e+02, threshold=8.051e+02, percent-clipped=5.0 2023-06-22 01:31:08,535 INFO [train.py:996] (3/4) Epoch 7, batch 2150, loss[loss=0.2778, simple_loss=0.3166, pruned_loss=0.1195, over 21474.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3168, pruned_loss=0.08605, over 4288430.26 frames. ], batch size: 511, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:31:10,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1110702.0, ans=0.1 2023-06-22 01:31:17,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1110702.0, ans=0.125 2023-06-22 01:31:44,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1110762.0, ans=0.125 2023-06-22 01:32:09,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1110882.0, ans=0.0 2023-06-22 01:32:10,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1110882.0, ans=0.1 2023-06-22 01:32:24,164 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.18 vs. limit=15.0 2023-06-22 01:32:45,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1110942.0, ans=0.1 2023-06-22 01:32:48,364 INFO [train.py:996] (3/4) Epoch 7, batch 2200, loss[loss=0.214, simple_loss=0.2938, pruned_loss=0.06708, over 21723.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3216, pruned_loss=0.08574, over 4285688.67 frames. ], batch size: 247, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:32:59,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1111002.0, ans=0.2 2023-06-22 01:33:30,291 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-22 01:33:48,368 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.551e+02 3.115e+02 3.820e+02 5.117e+02 8.192e+02, threshold=7.640e+02, percent-clipped=1.0 2023-06-22 01:33:57,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1111182.0, ans=0.125 2023-06-22 01:34:27,178 INFO [train.py:996] (3/4) Epoch 7, batch 2250, loss[loss=0.2239, simple_loss=0.3063, pruned_loss=0.07081, over 21432.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3179, pruned_loss=0.08446, over 4278947.91 frames. ], batch size: 211, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:34:38,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1111302.0, ans=0.125 2023-06-22 01:34:38,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1111302.0, ans=0.0 2023-06-22 01:35:03,003 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:35:05,110 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-22 01:35:43,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1111542.0, ans=0.125 2023-06-22 01:35:45,072 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.569e-03 2023-06-22 01:36:02,572 INFO [train.py:996] (3/4) Epoch 7, batch 2300, loss[loss=0.211, simple_loss=0.2713, pruned_loss=0.07532, over 21764.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3132, pruned_loss=0.08372, over 4279382.02 frames. ], batch size: 112, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:37:09,394 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.274e+02 3.004e+02 3.529e+02 4.212e+02 9.324e+02, threshold=7.058e+02, percent-clipped=1.0 2023-06-22 01:37:22,954 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=22.5 2023-06-22 01:37:42,524 INFO [train.py:996] (3/4) Epoch 7, batch 2350, loss[loss=0.2473, simple_loss=0.3132, pruned_loss=0.09067, over 21283.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3119, pruned_loss=0.0851, over 4276807.35 frames. ], batch size: 159, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:37:42,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1111902.0, ans=0.2 2023-06-22 01:38:10,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1111962.0, ans=0.125 2023-06-22 01:39:16,783 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=22.5 2023-06-22 01:39:17,639 INFO [train.py:996] (3/4) Epoch 7, batch 2400, loss[loss=0.2104, simple_loss=0.3016, pruned_loss=0.05961, over 21683.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3145, pruned_loss=0.08682, over 4283148.47 frames. ], batch size: 263, lr: 4.46e-03, grad_scale: 32.0 2023-06-22 01:39:21,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1112202.0, ans=0.1 2023-06-22 01:39:47,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1112262.0, ans=0.125 2023-06-22 01:39:55,806 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:40:25,227 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.202e+02 3.130e+02 3.615e+02 4.219e+02 6.751e+02, threshold=7.231e+02, percent-clipped=0.0 2023-06-22 01:40:30,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1112382.0, ans=0.0 2023-06-22 01:40:54,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1112442.0, ans=0.125 2023-06-22 01:40:57,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1112502.0, ans=0.2 2023-06-22 01:40:59,120 INFO [train.py:996] (3/4) Epoch 7, batch 2450, loss[loss=0.2608, simple_loss=0.3201, pruned_loss=0.1007, over 15757.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3181, pruned_loss=0.0878, over 4272604.24 frames. ], batch size: 62, lr: 4.46e-03, grad_scale: 32.0 2023-06-22 01:41:01,869 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-22 01:41:02,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1112502.0, ans=0.0 2023-06-22 01:41:06,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1112502.0, ans=0.0 2023-06-22 01:41:27,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1112562.0, ans=0.04949747468305833 2023-06-22 01:41:41,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1112622.0, ans=0.1 2023-06-22 01:42:40,146 INFO [train.py:996] (3/4) Epoch 7, batch 2500, loss[loss=0.2694, simple_loss=0.3511, pruned_loss=0.09385, over 21579.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3153, pruned_loss=0.08659, over 4275399.08 frames. ], batch size: 414, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:42:53,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1112802.0, ans=0.1 2023-06-22 01:43:25,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1112922.0, ans=0.0 2023-06-22 01:43:48,530 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 3.102e+02 3.610e+02 4.513e+02 8.483e+02, threshold=7.220e+02, percent-clipped=3.0 2023-06-22 01:44:12,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1113042.0, ans=0.2 2023-06-22 01:44:18,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1113042.0, ans=0.125 2023-06-22 01:44:21,322 INFO [train.py:996] (3/4) Epoch 7, batch 2550, loss[loss=0.226, simple_loss=0.302, pruned_loss=0.07504, over 21330.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.315, pruned_loss=0.08583, over 4259160.12 frames. ], batch size: 144, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:44:54,635 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-06-22 01:45:40,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1113342.0, ans=0.125 2023-06-22 01:45:41,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1113342.0, ans=0.125 2023-06-22 01:45:57,670 INFO [train.py:996] (3/4) Epoch 7, batch 2600, loss[loss=0.299, simple_loss=0.3604, pruned_loss=0.1188, over 21814.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3174, pruned_loss=0.0874, over 4260196.30 frames. ], batch size: 124, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:47:06,146 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.648e+02 3.288e+02 3.616e+02 4.316e+02 7.089e+02, threshold=7.232e+02, percent-clipped=0.0 2023-06-22 01:47:39,072 INFO [train.py:996] (3/4) Epoch 7, batch 2650, loss[loss=0.2002, simple_loss=0.2573, pruned_loss=0.0715, over 21183.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3172, pruned_loss=0.08778, over 4266402.40 frames. ], batch size: 548, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:48:06,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1113762.0, ans=0.125 2023-06-22 01:48:15,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1113762.0, ans=0.2 2023-06-22 01:48:21,849 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.30 vs. limit=12.0 2023-06-22 01:48:53,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1113882.0, ans=0.125 2023-06-22 01:49:19,712 INFO [train.py:996] (3/4) Epoch 7, batch 2700, loss[loss=0.2436, simple_loss=0.3396, pruned_loss=0.07384, over 21338.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3145, pruned_loss=0.08666, over 4271855.74 frames. ], batch size: 548, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:49:20,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1114002.0, ans=0.09899494936611666 2023-06-22 01:49:49,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1114062.0, ans=0.1 2023-06-22 01:49:49,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1114062.0, ans=0.125 2023-06-22 01:50:03,154 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=10.0 2023-06-22 01:50:10,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=1114122.0, ans=0.2 2023-06-22 01:50:28,234 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.479e+02 2.952e+02 3.435e+02 4.194e+02 7.834e+02, threshold=6.870e+02, percent-clipped=2.0 2023-06-22 01:51:00,682 INFO [train.py:996] (3/4) Epoch 7, batch 2750, loss[loss=0.2541, simple_loss=0.3429, pruned_loss=0.08262, over 21835.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.315, pruned_loss=0.08703, over 4273301.13 frames. ], batch size: 371, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:51:39,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1114362.0, ans=0.2 2023-06-22 01:52:53,282 INFO [train.py:996] (3/4) Epoch 7, batch 2800, loss[loss=0.2735, simple_loss=0.3438, pruned_loss=0.1016, over 21609.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3202, pruned_loss=0.0882, over 4269563.54 frames. ], batch size: 441, lr: 4.45e-03, grad_scale: 32.0 2023-06-22 01:53:06,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1114602.0, ans=0.035 2023-06-22 01:53:11,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1114662.0, ans=0.125 2023-06-22 01:53:45,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1114722.0, ans=0.0 2023-06-22 01:53:57,888 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.503e+02 3.206e+02 3.798e+02 4.545e+02 8.220e+02, threshold=7.596e+02, percent-clipped=2.0 2023-06-22 01:54:35,958 INFO [train.py:996] (3/4) Epoch 7, batch 2850, loss[loss=0.2357, simple_loss=0.3181, pruned_loss=0.07666, over 21738.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3207, pruned_loss=0.08871, over 4275326.71 frames. ], batch size: 391, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:54:36,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1114902.0, ans=0.125 2023-06-22 01:55:03,998 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-22 01:55:30,895 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-22 01:55:33,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1115082.0, ans=0.1 2023-06-22 01:55:39,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1115082.0, ans=0.125 2023-06-22 01:55:39,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1115082.0, ans=0.125 2023-06-22 01:55:44,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1115082.0, ans=0.125 2023-06-22 01:55:59,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1115142.0, ans=0.0 2023-06-22 01:56:16,764 INFO [train.py:996] (3/4) Epoch 7, batch 2900, loss[loss=0.2376, simple_loss=0.3159, pruned_loss=0.07966, over 21723.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3188, pruned_loss=0.0889, over 4280691.53 frames. ], batch size: 298, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:56:25,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1115202.0, ans=0.025 2023-06-22 01:56:45,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1115262.0, ans=0.07 2023-06-22 01:57:00,628 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.41 vs. limit=10.0 2023-06-22 01:57:10,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1115322.0, ans=0.0 2023-06-22 01:57:11,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1115322.0, ans=0.125 2023-06-22 01:57:11,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1115322.0, ans=0.125 2023-06-22 01:57:15,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1115382.0, ans=0.125 2023-06-22 01:57:22,698 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.184e+02 3.787e+02 4.850e+02 9.590e+02, threshold=7.574e+02, percent-clipped=4.0 2023-06-22 01:57:38,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1115442.0, ans=0.0 2023-06-22 01:57:58,390 INFO [train.py:996] (3/4) Epoch 7, batch 2950, loss[loss=0.2461, simple_loss=0.3236, pruned_loss=0.08431, over 19942.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3204, pruned_loss=0.089, over 4283313.73 frames. ], batch size: 702, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:58:15,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1115562.0, ans=0.0 2023-06-22 01:58:42,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1115622.0, ans=0.1 2023-06-22 01:59:02,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1115682.0, ans=0.05 2023-06-22 01:59:39,987 INFO [train.py:996] (3/4) Epoch 7, batch 3000, loss[loss=0.2399, simple_loss=0.3508, pruned_loss=0.0645, over 20747.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.323, pruned_loss=0.08862, over 4289687.68 frames. ], batch size: 607, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:59:39,987 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 01:59:56,486 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2473, simple_loss=0.3435, pruned_loss=0.07556, over 1796401.00 frames. 2023-06-22 01:59:56,486 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-22 02:00:00,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1115802.0, ans=0.0 2023-06-22 02:00:57,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1115922.0, ans=0.125 2023-06-22 02:01:11,443 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.189e+02 3.683e+02 4.814e+02 8.214e+02, threshold=7.366e+02, percent-clipped=1.0 2023-06-22 02:01:36,604 INFO [train.py:996] (3/4) Epoch 7, batch 3050, loss[loss=0.2595, simple_loss=0.3269, pruned_loss=0.09603, over 21399.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3242, pruned_loss=0.08751, over 4292293.40 frames. ], batch size: 176, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:01:51,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1116102.0, ans=0.1 2023-06-22 02:02:11,590 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:03:04,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=1116342.0, ans=0.2 2023-06-22 02:03:24,153 INFO [train.py:996] (3/4) Epoch 7, batch 3100, loss[loss=0.2761, simple_loss=0.3635, pruned_loss=0.09431, over 21475.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3216, pruned_loss=0.08635, over 4289798.58 frames. ], batch size: 471, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:03:54,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1116462.0, ans=0.0 2023-06-22 02:04:30,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1116582.0, ans=0.0 2023-06-22 02:04:34,768 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 2.895e+02 3.297e+02 4.092e+02 7.123e+02, threshold=6.595e+02, percent-clipped=0.0 2023-06-22 02:04:57,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1116642.0, ans=0.125 2023-06-22 02:05:11,843 INFO [train.py:996] (3/4) Epoch 7, batch 3150, loss[loss=0.3011, simple_loss=0.3666, pruned_loss=0.1178, over 21293.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3212, pruned_loss=0.08598, over 4288711.33 frames. ], batch size: 159, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:05:32,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1116762.0, ans=0.1 2023-06-22 02:05:33,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1116762.0, ans=0.1 2023-06-22 02:06:33,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1116942.0, ans=0.0 2023-06-22 02:06:52,841 INFO [train.py:996] (3/4) Epoch 7, batch 3200, loss[loss=0.2671, simple_loss=0.3436, pruned_loss=0.09534, over 20654.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3243, pruned_loss=0.08772, over 4286582.19 frames. ], batch size: 607, lr: 4.45e-03, grad_scale: 32.0 2023-06-22 02:08:05,125 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.937e+02 3.458e+02 4.160e+02 8.829e+02, threshold=6.916e+02, percent-clipped=6.0 2023-06-22 02:08:34,736 INFO [train.py:996] (3/4) Epoch 7, batch 3250, loss[loss=0.2727, simple_loss=0.3081, pruned_loss=0.1187, over 21405.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3245, pruned_loss=0.08935, over 4281576.37 frames. ], batch size: 510, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:08:38,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1117302.0, ans=0.2 2023-06-22 02:08:41,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1117302.0, ans=0.125 2023-06-22 02:08:42,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1117302.0, ans=0.125 2023-06-22 02:08:55,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1117302.0, ans=0.125 2023-06-22 02:08:56,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1117302.0, ans=0.125 2023-06-22 02:09:25,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1117422.0, ans=0.0 2023-06-22 02:09:38,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1117482.0, ans=0.125 2023-06-22 02:10:20,726 INFO [train.py:996] (3/4) Epoch 7, batch 3300, loss[loss=0.2225, simple_loss=0.3181, pruned_loss=0.06343, over 19987.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3203, pruned_loss=0.08883, over 4276840.71 frames. ], batch size: 703, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:10:21,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1117602.0, ans=0.125 2023-06-22 02:10:21,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1117602.0, ans=0.2 2023-06-22 02:10:21,738 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.95 vs. limit=10.0 2023-06-22 02:10:41,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1117662.0, ans=0.125 2023-06-22 02:10:49,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1117662.0, ans=0.125 2023-06-22 02:11:26,398 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 2.992e+02 3.641e+02 4.480e+02 7.487e+02, threshold=7.281e+02, percent-clipped=2.0 2023-06-22 02:12:00,459 INFO [train.py:996] (3/4) Epoch 7, batch 3350, loss[loss=0.2262, simple_loss=0.2998, pruned_loss=0.07631, over 21862.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3254, pruned_loss=0.09122, over 4279207.06 frames. ], batch size: 282, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:12:12,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=22.5 2023-06-22 02:12:13,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1117902.0, ans=0.125 2023-06-22 02:12:14,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1117902.0, ans=0.2 2023-06-22 02:13:45,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1118202.0, ans=0.0 2023-06-22 02:13:46,396 INFO [train.py:996] (3/4) Epoch 7, batch 3400, loss[loss=0.2126, simple_loss=0.2921, pruned_loss=0.06654, over 21321.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3257, pruned_loss=0.09159, over 4279189.63 frames. ], batch size: 159, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:14:23,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1118322.0, ans=0.1 2023-06-22 02:14:52,633 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.339e+02 3.058e+02 3.533e+02 4.088e+02 6.686e+02, threshold=7.066e+02, percent-clipped=0.0 2023-06-22 02:15:01,834 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.90 vs. limit=22.5 2023-06-22 02:15:19,413 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=15.0 2023-06-22 02:15:20,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1118442.0, ans=0.1 2023-06-22 02:15:25,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1118502.0, ans=0.1 2023-06-22 02:15:25,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1118502.0, ans=0.125 2023-06-22 02:15:25,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1118502.0, ans=0.125 2023-06-22 02:15:26,620 INFO [train.py:996] (3/4) Epoch 7, batch 3450, loss[loss=0.3393, simple_loss=0.4116, pruned_loss=0.1336, over 21734.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3203, pruned_loss=0.09023, over 4278921.09 frames. ], batch size: 351, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:15:30,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1118502.0, ans=0.125 2023-06-22 02:15:30,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1118502.0, ans=0.125 2023-06-22 02:15:48,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1118562.0, ans=0.0 2023-06-22 02:16:28,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1118682.0, ans=0.125 2023-06-22 02:17:03,866 INFO [train.py:996] (3/4) Epoch 7, batch 3500, loss[loss=0.2726, simple_loss=0.3426, pruned_loss=0.1013, over 21481.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3296, pruned_loss=0.09379, over 4275420.62 frames. ], batch size: 194, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:17:18,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1118802.0, ans=15.0 2023-06-22 02:17:35,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1118862.0, ans=0.125 2023-06-22 02:17:56,029 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.98 vs. limit=15.0 2023-06-22 02:18:14,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1118982.0, ans=0.125 2023-06-22 02:18:20,458 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.661e+02 3.453e+02 3.920e+02 4.708e+02 8.175e+02, threshold=7.839e+02, percent-clipped=4.0 2023-06-22 02:18:24,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1118982.0, ans=0.0 2023-06-22 02:18:24,845 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=12.0 2023-06-22 02:18:35,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1119042.0, ans=0.125 2023-06-22 02:18:44,610 INFO [train.py:996] (3/4) Epoch 7, batch 3550, loss[loss=0.236, simple_loss=0.2935, pruned_loss=0.08921, over 21210.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3339, pruned_loss=0.09528, over 4275732.24 frames. ], batch size: 176, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:18:49,134 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-22 02:19:20,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1119222.0, ans=0.125 2023-06-22 02:19:37,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1119222.0, ans=0.0 2023-06-22 02:20:20,916 INFO [train.py:996] (3/4) Epoch 7, batch 3600, loss[loss=0.2172, simple_loss=0.2723, pruned_loss=0.081, over 21496.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3263, pruned_loss=0.09351, over 4279945.34 frames. ], batch size: 230, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:21:02,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1119462.0, ans=0.07 2023-06-22 02:21:03,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1119522.0, ans=0.2 2023-06-22 02:21:18,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1119522.0, ans=0.1 2023-06-22 02:21:38,448 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.471e+02 3.375e+02 4.106e+02 5.045e+02 9.366e+02, threshold=8.213e+02, percent-clipped=2.0 2023-06-22 02:22:03,418 INFO [train.py:996] (3/4) Epoch 7, batch 3650, loss[loss=0.257, simple_loss=0.3303, pruned_loss=0.09183, over 21860.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3274, pruned_loss=0.09362, over 4282833.57 frames. ], batch size: 118, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:22:31,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1119762.0, ans=0.1 2023-06-22 02:22:51,464 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-22 02:23:16,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1119882.0, ans=0.0 2023-06-22 02:23:29,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1119942.0, ans=0.0 2023-06-22 02:23:43,415 INFO [train.py:996] (3/4) Epoch 7, batch 3700, loss[loss=0.2123, simple_loss=0.2886, pruned_loss=0.06801, over 21495.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3267, pruned_loss=0.09239, over 4283659.80 frames. ], batch size: 212, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:24:54,529 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=22.5 2023-06-22 02:25:01,657 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.265e+02 3.250e+02 3.889e+02 4.859e+02 8.141e+02, threshold=7.777e+02, percent-clipped=0.0 2023-06-22 02:25:03,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1120182.0, ans=0.09899494936611666 2023-06-22 02:25:24,490 INFO [train.py:996] (3/4) Epoch 7, batch 3750, loss[loss=0.2352, simple_loss=0.3175, pruned_loss=0.07646, over 21649.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3234, pruned_loss=0.09114, over 4288989.39 frames. ], batch size: 441, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:25:24,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1120302.0, ans=0.125 2023-06-22 02:25:37,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1120302.0, ans=0.125 2023-06-22 02:26:41,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1120482.0, ans=0.125 2023-06-22 02:26:41,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1120482.0, ans=0.125 2023-06-22 02:27:10,037 INFO [train.py:996] (3/4) Epoch 7, batch 3800, loss[loss=0.2952, simple_loss=0.3744, pruned_loss=0.108, over 21800.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3206, pruned_loss=0.0895, over 4285917.05 frames. ], batch size: 118, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:27:12,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1120602.0, ans=0.125 2023-06-22 02:27:37,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1120662.0, ans=0.0 2023-06-22 02:28:03,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1120722.0, ans=0.125 2023-06-22 02:28:06,995 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.90 vs. limit=5.0 2023-06-22 02:28:12,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1120782.0, ans=0.0 2023-06-22 02:28:18,656 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.223e+02 3.848e+02 4.886e+02 9.152e+02, threshold=7.696e+02, percent-clipped=1.0 2023-06-22 02:28:39,083 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-22 02:28:46,243 INFO [train.py:996] (3/4) Epoch 7, batch 3850, loss[loss=0.2201, simple_loss=0.2757, pruned_loss=0.08225, over 21265.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3189, pruned_loss=0.08963, over 4285550.40 frames. ], batch size: 159, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:29:05,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1120962.0, ans=0.1 2023-06-22 02:29:32,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1121022.0, ans=0.2 2023-06-22 02:30:02,275 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.86 vs. limit=15.0 2023-06-22 02:30:25,855 INFO [train.py:996] (3/4) Epoch 7, batch 3900, loss[loss=0.2254, simple_loss=0.2838, pruned_loss=0.08352, over 21588.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3144, pruned_loss=0.08938, over 4286643.27 frames. ], batch size: 263, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:31:03,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1121262.0, ans=0.1 2023-06-22 02:31:38,624 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.017e+02 3.574e+02 4.086e+02 6.704e+02, threshold=7.148e+02, percent-clipped=0.0 2023-06-22 02:31:44,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1121442.0, ans=0.125 2023-06-22 02:32:12,130 INFO [train.py:996] (3/4) Epoch 7, batch 3950, loss[loss=0.1806, simple_loss=0.2494, pruned_loss=0.05586, over 21116.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3161, pruned_loss=0.08849, over 4283857.69 frames. ], batch size: 143, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:32:20,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1121502.0, ans=0.1 2023-06-22 02:32:31,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1121562.0, ans=0.125 2023-06-22 02:32:46,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1121562.0, ans=0.2 2023-06-22 02:32:48,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1121562.0, ans=0.0 2023-06-22 02:32:52,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1121622.0, ans=0.0 2023-06-22 02:33:17,751 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.43 vs. limit=15.0 2023-06-22 02:33:53,230 INFO [train.py:996] (3/4) Epoch 7, batch 4000, loss[loss=0.1754, simple_loss=0.2564, pruned_loss=0.04721, over 21758.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3105, pruned_loss=0.08539, over 4274650.82 frames. ], batch size: 316, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:34:24,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1121862.0, ans=0.05 2023-06-22 02:34:54,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1121982.0, ans=0.2 2023-06-22 02:34:56,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1121982.0, ans=0.0 2023-06-22 02:34:59,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1121982.0, ans=0.1 2023-06-22 02:35:00,540 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 2.938e+02 3.344e+02 4.048e+02 7.852e+02, threshold=6.687e+02, percent-clipped=1.0 2023-06-22 02:35:34,241 INFO [train.py:996] (3/4) Epoch 7, batch 4050, loss[loss=0.2367, simple_loss=0.3296, pruned_loss=0.07188, over 21738.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.309, pruned_loss=0.08331, over 4279525.47 frames. ], batch size: 282, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:36:26,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1122222.0, ans=0.125 2023-06-22 02:36:32,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1122282.0, ans=0.04949747468305833 2023-06-22 02:37:13,261 INFO [train.py:996] (3/4) Epoch 7, batch 4100, loss[loss=0.2798, simple_loss=0.3377, pruned_loss=0.111, over 21553.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3101, pruned_loss=0.08349, over 4277146.47 frames. ], batch size: 548, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:37:27,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1122402.0, ans=0.0 2023-06-22 02:37:42,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1122462.0, ans=0.0 2023-06-22 02:38:06,160 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2023-06-22 02:38:26,483 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.663e+02 2.992e+02 3.569e+02 4.943e+02, threshold=5.983e+02, percent-clipped=0.0 2023-06-22 02:38:27,819 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.75 vs. limit=22.5 2023-06-22 02:38:49,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1122642.0, ans=0.2 2023-06-22 02:38:51,599 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=22.5 2023-06-22 02:38:53,873 INFO [train.py:996] (3/4) Epoch 7, batch 4150, loss[loss=0.2711, simple_loss=0.3554, pruned_loss=0.09339, over 21549.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3102, pruned_loss=0.08063, over 4274747.83 frames. ], batch size: 508, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:40:14,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1122882.0, ans=0.125 2023-06-22 02:40:39,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1123002.0, ans=0.125 2023-06-22 02:40:45,124 INFO [train.py:996] (3/4) Epoch 7, batch 4200, loss[loss=0.1939, simple_loss=0.2841, pruned_loss=0.05182, over 21499.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3113, pruned_loss=0.08123, over 4268401.38 frames. ], batch size: 212, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:40:45,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1123002.0, ans=0.0 2023-06-22 02:41:07,143 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:41:56,340 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 3.116e+02 3.775e+02 4.930e+02 8.993e+02, threshold=7.550e+02, percent-clipped=12.0 2023-06-22 02:41:58,489 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:42:20,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1123242.0, ans=0.125 2023-06-22 02:42:27,348 INFO [train.py:996] (3/4) Epoch 7, batch 4250, loss[loss=0.3036, simple_loss=0.375, pruned_loss=0.1161, over 20690.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3216, pruned_loss=0.08447, over 4269961.70 frames. ], batch size: 607, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:42:37,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1123302.0, ans=0.05 2023-06-22 02:43:22,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1123422.0, ans=0.125 2023-06-22 02:44:05,325 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-06-22 02:44:10,792 INFO [train.py:996] (3/4) Epoch 7, batch 4300, loss[loss=0.265, simple_loss=0.339, pruned_loss=0.09549, over 21751.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3254, pruned_loss=0.08598, over 4268823.25 frames. ], batch size: 118, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:44:14,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1123602.0, ans=0.0 2023-06-22 02:44:24,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1123602.0, ans=0.0 2023-06-22 02:45:10,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1123722.0, ans=0.125 2023-06-22 02:45:10,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1123722.0, ans=0.125 2023-06-22 02:45:22,239 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:45:30,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1123782.0, ans=0.125 2023-06-22 02:45:31,291 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.519e+02 4.292e+02 5.383e+02 8.752e+02, threshold=8.584e+02, percent-clipped=3.0 2023-06-22 02:45:52,091 INFO [train.py:996] (3/4) Epoch 7, batch 4350, loss[loss=0.2639, simple_loss=0.3277, pruned_loss=0.1, over 21602.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3262, pruned_loss=0.08599, over 4262111.52 frames. ], batch size: 414, lr: 4.43e-03, grad_scale: 8.0 2023-06-22 02:47:21,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1124142.0, ans=0.125 2023-06-22 02:47:32,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1124202.0, ans=0.125 2023-06-22 02:47:39,352 INFO [train.py:996] (3/4) Epoch 7, batch 4400, loss[loss=0.2824, simple_loss=0.3479, pruned_loss=0.1084, over 19903.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3218, pruned_loss=0.0857, over 4251642.05 frames. ], batch size: 702, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:47:40,584 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-22 02:48:25,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1124262.0, ans=0.125 2023-06-22 02:48:57,458 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.429e+02 3.105e+02 3.566e+02 4.210e+02 6.733e+02, threshold=7.132e+02, percent-clipped=0.0 2023-06-22 02:49:21,668 INFO [train.py:996] (3/4) Epoch 7, batch 4450, loss[loss=0.235, simple_loss=0.3181, pruned_loss=0.0759, over 21441.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3269, pruned_loss=0.08687, over 4256866.01 frames. ], batch size: 211, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:49:32,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1124502.0, ans=0.2 2023-06-22 02:49:38,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1124502.0, ans=0.2 2023-06-22 02:51:06,300 INFO [train.py:996] (3/4) Epoch 7, batch 4500, loss[loss=0.2586, simple_loss=0.3467, pruned_loss=0.08521, over 20760.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3299, pruned_loss=0.08969, over 4264133.27 frames. ], batch size: 607, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:51:27,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1124862.0, ans=0.07 2023-06-22 02:51:59,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1124922.0, ans=0.0 2023-06-22 02:52:03,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1124982.0, ans=0.0 2023-06-22 02:52:12,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1124982.0, ans=0.125 2023-06-22 02:52:22,683 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 3.121e+02 3.546e+02 4.306e+02 8.092e+02, threshold=7.092e+02, percent-clipped=3.0 2023-06-22 02:52:47,535 INFO [train.py:996] (3/4) Epoch 7, batch 4550, loss[loss=0.3126, simple_loss=0.3806, pruned_loss=0.1223, over 21581.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3315, pruned_loss=0.08965, over 4265714.83 frames. ], batch size: 414, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:53:23,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1125162.0, ans=0.125 2023-06-22 02:53:34,506 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.34 vs. limit=10.0 2023-06-22 02:53:46,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1125282.0, ans=0.07 2023-06-22 02:53:59,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1125282.0, ans=0.125 2023-06-22 02:54:04,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1125282.0, ans=0.2 2023-06-22 02:54:29,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1125342.0, ans=0.125 2023-06-22 02:54:33,341 INFO [train.py:996] (3/4) Epoch 7, batch 4600, loss[loss=0.2226, simple_loss=0.296, pruned_loss=0.07461, over 21426.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3321, pruned_loss=0.09048, over 4273964.51 frames. ], batch size: 211, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:54:54,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1125462.0, ans=0.2 2023-06-22 02:54:59,448 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:55:24,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1125582.0, ans=0.1 2023-06-22 02:55:36,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1125582.0, ans=0.0 2023-06-22 02:55:41,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1125582.0, ans=0.05 2023-06-22 02:55:43,895 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.487e+02 3.156e+02 3.559e+02 4.324e+02 6.713e+02, threshold=7.117e+02, percent-clipped=0.0 2023-06-22 02:55:55,157 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=15.0 2023-06-22 02:56:13,808 INFO [train.py:996] (3/4) Epoch 7, batch 4650, loss[loss=0.1842, simple_loss=0.2598, pruned_loss=0.05426, over 21609.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3265, pruned_loss=0.08809, over 4278362.04 frames. ], batch size: 263, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:57:42,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1125942.0, ans=0.125 2023-06-22 02:57:42,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1125942.0, ans=0.125 2023-06-22 02:57:49,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1125942.0, ans=0.0 2023-06-22 02:57:58,271 INFO [train.py:996] (3/4) Epoch 7, batch 4700, loss[loss=0.2442, simple_loss=0.299, pruned_loss=0.09466, over 21279.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3164, pruned_loss=0.08573, over 4278490.34 frames. ], batch size: 159, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:59:03,083 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 2.887e+02 3.239e+02 4.275e+02 6.571e+02, threshold=6.478e+02, percent-clipped=0.0 2023-06-22 02:59:14,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1126242.0, ans=0.125 2023-06-22 02:59:31,460 INFO [train.py:996] (3/4) Epoch 7, batch 4750, loss[loss=0.2224, simple_loss=0.2905, pruned_loss=0.07713, over 21697.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3114, pruned_loss=0.08572, over 4285674.12 frames. ], batch size: 298, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:59:42,201 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.19 vs. limit=10.0 2023-06-22 03:00:11,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1126422.0, ans=0.0 2023-06-22 03:00:43,708 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-22 03:01:16,456 INFO [train.py:996] (3/4) Epoch 7, batch 4800, loss[loss=0.2674, simple_loss=0.3249, pruned_loss=0.105, over 21547.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3143, pruned_loss=0.08669, over 4289782.56 frames. ], batch size: 473, lr: 4.43e-03, grad_scale: 32.0 2023-06-22 03:02:01,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1126722.0, ans=0.0 2023-06-22 03:02:23,964 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.114e+02 3.560e+02 4.146e+02 5.866e+02, threshold=7.121e+02, percent-clipped=0.0 2023-06-22 03:02:29,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1126842.0, ans=0.125 2023-06-22 03:02:40,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1126842.0, ans=0.125 2023-06-22 03:02:56,144 INFO [train.py:996] (3/4) Epoch 7, batch 4850, loss[loss=0.2204, simple_loss=0.3364, pruned_loss=0.05215, over 20947.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3141, pruned_loss=0.08592, over 4288505.88 frames. ], batch size: 608, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 03:02:58,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1126902.0, ans=0.2 2023-06-22 03:03:36,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1127022.0, ans=0.2 2023-06-22 03:04:36,844 INFO [train.py:996] (3/4) Epoch 7, batch 4900, loss[loss=0.2518, simple_loss=0.3493, pruned_loss=0.0772, over 21722.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.316, pruned_loss=0.08569, over 4271429.32 frames. ], batch size: 298, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 03:04:52,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1127262.0, ans=0.1 2023-06-22 03:05:55,057 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.055e+02 3.401e+02 3.973e+02 6.495e+02, threshold=6.802e+02, percent-clipped=0.0 2023-06-22 03:06:13,322 INFO [train.py:996] (3/4) Epoch 7, batch 4950, loss[loss=0.2737, simple_loss=0.3681, pruned_loss=0.08964, over 21648.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3194, pruned_loss=0.08536, over 4268649.89 frames. ], batch size: 441, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 03:06:23,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1127502.0, ans=0.125 2023-06-22 03:07:02,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1127622.0, ans=0.125 2023-06-22 03:07:52,437 INFO [train.py:996] (3/4) Epoch 7, batch 5000, loss[loss=0.2479, simple_loss=0.3201, pruned_loss=0.08785, over 21769.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3183, pruned_loss=0.08248, over 4272639.59 frames. ], batch size: 112, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 03:08:05,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1127802.0, ans=0.125 2023-06-22 03:08:10,226 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.32 vs. limit=10.0 2023-06-22 03:08:49,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1127982.0, ans=0.0 2023-06-22 03:08:51,286 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:09:08,691 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.797e+02 3.102e+02 3.690e+02 6.361e+02, threshold=6.203e+02, percent-clipped=0.0 2023-06-22 03:09:18,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1128042.0, ans=0.125 2023-06-22 03:09:30,774 INFO [train.py:996] (3/4) Epoch 7, batch 5050, loss[loss=0.2175, simple_loss=0.3077, pruned_loss=0.06364, over 19976.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3173, pruned_loss=0.08383, over 4276638.45 frames. ], batch size: 702, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:10:41,176 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=12.0 2023-06-22 03:10:48,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1128342.0, ans=0.0 2023-06-22 03:11:05,873 INFO [train.py:996] (3/4) Epoch 7, batch 5100, loss[loss=0.2273, simple_loss=0.2911, pruned_loss=0.08174, over 21825.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3155, pruned_loss=0.08329, over 4282970.66 frames. ], batch size: 124, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:11:09,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1128402.0, ans=0.0 2023-06-22 03:11:43,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1128522.0, ans=0.125 2023-06-22 03:11:55,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1128582.0, ans=0.0 2023-06-22 03:12:17,459 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.054e+02 3.634e+02 4.109e+02 8.042e+02, threshold=7.267e+02, percent-clipped=4.0 2023-06-22 03:12:33,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1128642.0, ans=0.125 2023-06-22 03:12:37,634 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-22 03:12:39,998 INFO [train.py:996] (3/4) Epoch 7, batch 5150, loss[loss=0.2575, simple_loss=0.3337, pruned_loss=0.09071, over 21819.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3145, pruned_loss=0.08437, over 4287965.69 frames. ], batch size: 332, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:12:42,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-22 03:12:57,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1128762.0, ans=0.5 2023-06-22 03:13:42,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1128882.0, ans=0.0 2023-06-22 03:14:15,495 INFO [train.py:996] (3/4) Epoch 7, batch 5200, loss[loss=0.2447, simple_loss=0.3313, pruned_loss=0.07906, over 21303.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.319, pruned_loss=0.08467, over 4279042.41 frames. ], batch size: 176, lr: 4.42e-03, grad_scale: 32.0 2023-06-22 03:14:51,514 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-22 03:15:01,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1129122.0, ans=0.2 2023-06-22 03:15:10,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1129122.0, ans=0.0 2023-06-22 03:15:19,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1129182.0, ans=0.125 2023-06-22 03:15:38,261 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.246e+02 3.920e+02 4.845e+02 8.696e+02, threshold=7.839e+02, percent-clipped=4.0 2023-06-22 03:15:54,263 INFO [train.py:996] (3/4) Epoch 7, batch 5250, loss[loss=0.2527, simple_loss=0.3336, pruned_loss=0.0859, over 21649.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3225, pruned_loss=0.08351, over 4273795.30 frames. ], batch size: 263, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:16:05,362 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:16:05,835 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=15.0 2023-06-22 03:16:16,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1129362.0, ans=0.0 2023-06-22 03:17:32,606 INFO [train.py:996] (3/4) Epoch 7, batch 5300, loss[loss=0.2224, simple_loss=0.2917, pruned_loss=0.07662, over 21587.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3217, pruned_loss=0.0847, over 4284303.30 frames. ], batch size: 548, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:17:36,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1129602.0, ans=0.1 2023-06-22 03:17:38,275 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-22 03:17:53,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1129662.0, ans=0.05 2023-06-22 03:18:00,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1129662.0, ans=0.125 2023-06-22 03:18:14,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1129722.0, ans=0.0 2023-06-22 03:18:54,578 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.014e+02 3.652e+02 4.152e+02 6.819e+02, threshold=7.305e+02, percent-clipped=0.0 2023-06-22 03:19:07,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1129842.0, ans=0.125 2023-06-22 03:19:09,863 INFO [train.py:996] (3/4) Epoch 7, batch 5350, loss[loss=0.2862, simple_loss=0.3346, pruned_loss=0.1189, over 21757.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3201, pruned_loss=0.08644, over 4287913.16 frames. ], batch size: 441, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:19:37,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1129962.0, ans=0.0 2023-06-22 03:19:42,662 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.52 vs. limit=10.0 2023-06-22 03:19:50,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1130022.0, ans=0.125 2023-06-22 03:20:24,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1130082.0, ans=0.125 2023-06-22 03:20:50,426 INFO [train.py:996] (3/4) Epoch 7, batch 5400, loss[loss=0.262, simple_loss=0.3267, pruned_loss=0.09866, over 21689.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3182, pruned_loss=0.08673, over 4292965.73 frames. ], batch size: 508, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:20:56,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1130202.0, ans=0.125 2023-06-22 03:21:23,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1130262.0, ans=0.125 2023-06-22 03:21:24,946 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.32 vs. limit=6.0 2023-06-22 03:22:14,200 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.826e+02 3.234e+02 3.815e+02 6.268e+02, threshold=6.469e+02, percent-clipped=0.0 2023-06-22 03:22:30,499 INFO [train.py:996] (3/4) Epoch 7, batch 5450, loss[loss=0.265, simple_loss=0.3729, pruned_loss=0.07854, over 21736.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3183, pruned_loss=0.08451, over 4291799.54 frames. ], batch size: 332, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:22:40,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1130502.0, ans=0.125 2023-06-22 03:22:45,817 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-22 03:22:46,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1130562.0, ans=0.125 2023-06-22 03:23:42,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1130682.0, ans=0.125 2023-06-22 03:23:57,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1130742.0, ans=0.0 2023-06-22 03:24:12,379 INFO [train.py:996] (3/4) Epoch 7, batch 5500, loss[loss=0.3249, simple_loss=0.403, pruned_loss=0.1234, over 21468.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3218, pruned_loss=0.08145, over 4288946.46 frames. ], batch size: 507, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:24:24,195 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:25:01,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1130922.0, ans=0.95 2023-06-22 03:25:16,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1130982.0, ans=0.125 2023-06-22 03:25:19,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1130982.0, ans=0.125 2023-06-22 03:25:31,982 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 3.055e+02 3.640e+02 4.334e+02 7.311e+02, threshold=7.280e+02, percent-clipped=2.0 2023-06-22 03:25:53,702 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:25:57,918 INFO [train.py:996] (3/4) Epoch 7, batch 5550, loss[loss=0.2244, simple_loss=0.3082, pruned_loss=0.07032, over 21701.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3221, pruned_loss=0.07911, over 4280923.22 frames. ], batch size: 247, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:26:44,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1131222.0, ans=0.125 2023-06-22 03:26:53,122 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.13 vs. limit=22.5 2023-06-22 03:27:26,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1131342.0, ans=0.125 2023-06-22 03:27:43,801 INFO [train.py:996] (3/4) Epoch 7, batch 5600, loss[loss=0.2626, simple_loss=0.3583, pruned_loss=0.0834, over 21878.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3227, pruned_loss=0.07846, over 4280775.80 frames. ], batch size: 317, lr: 4.42e-03, grad_scale: 32.0 2023-06-22 03:29:03,133 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.861e+02 3.641e+02 4.597e+02 1.091e+03, threshold=7.283e+02, percent-clipped=6.0 2023-06-22 03:29:05,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1131642.0, ans=0.125 2023-06-22 03:29:22,532 INFO [train.py:996] (3/4) Epoch 7, batch 5650, loss[loss=0.2428, simple_loss=0.3117, pruned_loss=0.08697, over 21893.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3251, pruned_loss=0.08074, over 4288492.96 frames. ], batch size: 316, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:29:37,430 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=11.14 vs. limit=15.0 2023-06-22 03:29:51,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1131762.0, ans=0.0 2023-06-22 03:29:53,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1131762.0, ans=0.1 2023-06-22 03:31:06,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1132002.0, ans=0.0 2023-06-22 03:31:06,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1132002.0, ans=0.125 2023-06-22 03:31:07,561 INFO [train.py:996] (3/4) Epoch 7, batch 5700, loss[loss=0.2511, simple_loss=0.3385, pruned_loss=0.08182, over 21639.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3244, pruned_loss=0.08204, over 4292035.33 frames. ], batch size: 441, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:31:14,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1132002.0, ans=0.125 2023-06-22 03:31:39,651 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=17.89 vs. limit=15.0 2023-06-22 03:31:51,200 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.13 vs. limit=15.0 2023-06-22 03:31:51,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1132122.0, ans=0.125 2023-06-22 03:32:19,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1132182.0, ans=0.2 2023-06-22 03:32:33,701 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.355e+02 3.008e+02 3.484e+02 4.188e+02 7.295e+02, threshold=6.968e+02, percent-clipped=1.0 2023-06-22 03:32:48,508 INFO [train.py:996] (3/4) Epoch 7, batch 5750, loss[loss=0.1965, simple_loss=0.2947, pruned_loss=0.04919, over 21748.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3196, pruned_loss=0.07895, over 4281993.02 frames. ], batch size: 332, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:33:12,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1132362.0, ans=0.125 2023-06-22 03:34:01,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1132482.0, ans=0.125 2023-06-22 03:34:25,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1132542.0, ans=0.0 2023-06-22 03:34:28,144 INFO [train.py:996] (3/4) Epoch 7, batch 5800, loss[loss=0.2342, simple_loss=0.3307, pruned_loss=0.06889, over 21784.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3183, pruned_loss=0.07726, over 4273662.58 frames. ], batch size: 282, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:35:11,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1132722.0, ans=0.125 2023-06-22 03:35:35,780 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:35:55,100 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.844e+02 3.714e+02 4.784e+02 7.655e+02, threshold=7.428e+02, percent-clipped=1.0 2023-06-22 03:36:10,039 INFO [train.py:996] (3/4) Epoch 7, batch 5850, loss[loss=0.1833, simple_loss=0.2858, pruned_loss=0.04037, over 21757.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3154, pruned_loss=0.07282, over 4281716.67 frames. ], batch size: 332, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:37:07,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-22 03:37:16,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1133082.0, ans=0.0 2023-06-22 03:37:34,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1133142.0, ans=0.1 2023-06-22 03:37:49,848 INFO [train.py:996] (3/4) Epoch 7, batch 5900, loss[loss=0.1946, simple_loss=0.2689, pruned_loss=0.06013, over 21584.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3073, pruned_loss=0.06745, over 4286255.99 frames. ], batch size: 211, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:37:50,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1133202.0, ans=0.2 2023-06-22 03:37:56,824 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-22 03:38:17,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1133262.0, ans=0.0 2023-06-22 03:39:08,927 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.538e+02 2.951e+02 3.905e+02 7.879e+02, threshold=5.902e+02, percent-clipped=2.0 2023-06-22 03:39:23,027 INFO [train.py:996] (3/4) Epoch 7, batch 5950, loss[loss=0.2011, simple_loss=0.2779, pruned_loss=0.06216, over 21806.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3064, pruned_loss=0.0704, over 4290755.28 frames. ], batch size: 247, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:40:19,702 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.25 vs. limit=10.0 2023-06-22 03:41:00,574 INFO [train.py:996] (3/4) Epoch 7, batch 6000, loss[loss=0.2386, simple_loss=0.2962, pruned_loss=0.09055, over 21786.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3048, pruned_loss=0.07467, over 4262710.23 frames. ], batch size: 112, lr: 4.41e-03, grad_scale: 32.0 2023-06-22 03:41:00,575 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 03:41:21,112 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2587, simple_loss=0.3532, pruned_loss=0.08209, over 1796401.00 frames. 2023-06-22 03:41:21,112 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-22 03:41:27,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-22 03:42:11,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1133922.0, ans=0.0 2023-06-22 03:42:43,734 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 3.441e+02 4.226e+02 5.483e+02 1.064e+03, threshold=8.451e+02, percent-clipped=15.0 2023-06-22 03:43:01,815 INFO [train.py:996] (3/4) Epoch 7, batch 6050, loss[loss=0.1942, simple_loss=0.2709, pruned_loss=0.05881, over 21503.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3031, pruned_loss=0.07541, over 4262179.15 frames. ], batch size: 441, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:43:14,230 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-22 03:43:33,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1134162.0, ans=0.2 2023-06-22 03:44:02,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1134282.0, ans=0.04949747468305833 2023-06-22 03:44:37,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1134342.0, ans=0.0 2023-06-22 03:44:39,792 INFO [train.py:996] (3/4) Epoch 7, batch 6100, loss[loss=0.2806, simple_loss=0.3399, pruned_loss=0.1106, over 21800.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3014, pruned_loss=0.07392, over 4270101.12 frames. ], batch size: 112, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:45:06,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1134462.0, ans=0.2 2023-06-22 03:45:12,255 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:45:16,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1134462.0, ans=0.125 2023-06-22 03:45:21,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1134522.0, ans=0.0 2023-06-22 03:45:23,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1134522.0, ans=0.125 2023-06-22 03:45:29,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1134522.0, ans=0.0 2023-06-22 03:45:32,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1134522.0, ans=0.0 2023-06-22 03:45:39,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1134582.0, ans=0.2 2023-06-22 03:45:44,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1134582.0, ans=0.125 2023-06-22 03:45:58,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1134642.0, ans=0.2 2023-06-22 03:46:01,318 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.844e+02 3.267e+02 3.769e+02 7.598e+02, threshold=6.534e+02, percent-clipped=0.0 2023-06-22 03:46:23,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1134702.0, ans=0.125 2023-06-22 03:46:24,526 INFO [train.py:996] (3/4) Epoch 7, batch 6150, loss[loss=0.2021, simple_loss=0.2795, pruned_loss=0.06241, over 21594.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3028, pruned_loss=0.07686, over 4271514.03 frames. ], batch size: 230, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:46:42,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1134702.0, ans=0.0 2023-06-22 03:47:00,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1134822.0, ans=0.125 2023-06-22 03:47:10,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1134822.0, ans=0.1 2023-06-22 03:47:11,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1134822.0, ans=0.025 2023-06-22 03:48:01,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1135002.0, ans=0.125 2023-06-22 03:48:02,647 INFO [train.py:996] (3/4) Epoch 7, batch 6200, loss[loss=0.2487, simple_loss=0.3222, pruned_loss=0.08755, over 21858.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3047, pruned_loss=0.07748, over 4270200.86 frames. ], batch size: 118, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:48:09,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1135002.0, ans=0.1 2023-06-22 03:48:30,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1135062.0, ans=0.2 2023-06-22 03:48:35,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1135062.0, ans=0.125 2023-06-22 03:48:57,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1135182.0, ans=0.2 2023-06-22 03:49:28,284 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.024e+02 2.933e+02 3.461e+02 4.493e+02 7.617e+02, threshold=6.923e+02, percent-clipped=2.0 2023-06-22 03:49:33,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1135242.0, ans=0.2 2023-06-22 03:49:36,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1135242.0, ans=0.0 2023-06-22 03:49:40,998 INFO [train.py:996] (3/4) Epoch 7, batch 6250, loss[loss=0.1938, simple_loss=0.2864, pruned_loss=0.05061, over 21792.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.308, pruned_loss=0.07684, over 4271699.71 frames. ], batch size: 282, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:50:00,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1135362.0, ans=10.0 2023-06-22 03:50:04,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1135362.0, ans=0.125 2023-06-22 03:50:18,226 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:50:26,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1135422.0, ans=0.0 2023-06-22 03:50:27,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1135422.0, ans=0.125 2023-06-22 03:51:25,835 INFO [train.py:996] (3/4) Epoch 7, batch 6300, loss[loss=0.2576, simple_loss=0.3191, pruned_loss=0.09802, over 21330.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.311, pruned_loss=0.07632, over 4272572.74 frames. ], batch size: 159, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:51:39,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1135602.0, ans=0.125 2023-06-22 03:51:45,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1135662.0, ans=0.1 2023-06-22 03:52:12,679 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=15.0 2023-06-22 03:52:28,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1135782.0, ans=0.125 2023-06-22 03:52:42,533 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.30 vs. limit=15.0 2023-06-22 03:52:52,249 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 3.056e+02 3.560e+02 4.261e+02 7.497e+02, threshold=7.120e+02, percent-clipped=1.0 2023-06-22 03:53:05,181 INFO [train.py:996] (3/4) Epoch 7, batch 6350, loss[loss=0.2527, simple_loss=0.3124, pruned_loss=0.09652, over 21440.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3156, pruned_loss=0.08197, over 4279842.78 frames. ], batch size: 211, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:53:24,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1135962.0, ans=0.125 2023-06-22 03:54:09,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1136082.0, ans=0.1 2023-06-22 03:54:09,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1136082.0, ans=0.95 2023-06-22 03:54:35,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1136142.0, ans=0.1 2023-06-22 03:54:45,886 INFO [train.py:996] (3/4) Epoch 7, batch 6400, loss[loss=0.2599, simple_loss=0.3323, pruned_loss=0.09374, over 21872.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3205, pruned_loss=0.08568, over 4281719.55 frames. ], batch size: 124, lr: 4.41e-03, grad_scale: 32.0 2023-06-22 03:54:49,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1136202.0, ans=0.1 2023-06-22 03:55:05,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1136262.0, ans=0.125 2023-06-22 03:55:24,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1136322.0, ans=0.1 2023-06-22 03:56:10,186 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.525e+02 3.095e+02 3.612e+02 4.103e+02 7.644e+02, threshold=7.224e+02, percent-clipped=1.0 2023-06-22 03:56:21,477 INFO [train.py:996] (3/4) Epoch 7, batch 6450, loss[loss=0.2052, simple_loss=0.2853, pruned_loss=0.06255, over 16290.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3238, pruned_loss=0.0859, over 4279576.70 frames. ], batch size: 63, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:56:22,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1136502.0, ans=0.2 2023-06-22 03:56:27,113 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=22.5 2023-06-22 03:56:29,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1136502.0, ans=0.1 2023-06-22 03:56:40,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1136562.0, ans=0.125 2023-06-22 03:57:08,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1136622.0, ans=0.1 2023-06-22 03:57:27,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1136682.0, ans=0.1 2023-06-22 03:57:56,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1136742.0, ans=0.1 2023-06-22 03:57:56,867 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-22 03:58:00,817 INFO [train.py:996] (3/4) Epoch 7, batch 6500, loss[loss=0.1907, simple_loss=0.272, pruned_loss=0.05468, over 21587.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.317, pruned_loss=0.0843, over 4279630.59 frames. ], batch size: 230, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:58:41,407 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.16 vs. limit=10.0 2023-06-22 03:59:29,992 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 3.034e+02 3.497e+02 4.367e+02 8.159e+02, threshold=6.993e+02, percent-clipped=3.0 2023-06-22 03:59:40,145 INFO [train.py:996] (3/4) Epoch 7, batch 6550, loss[loss=0.2911, simple_loss=0.3538, pruned_loss=0.1142, over 21865.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3172, pruned_loss=0.08315, over 4279118.92 frames. ], batch size: 107, lr: 4.41e-03, grad_scale: 8.0 2023-06-22 04:00:50,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1137282.0, ans=0.1 2023-06-22 04:01:19,382 INFO [train.py:996] (3/4) Epoch 7, batch 6600, loss[loss=0.2216, simple_loss=0.2759, pruned_loss=0.08362, over 21594.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3102, pruned_loss=0.08246, over 4276424.76 frames. ], batch size: 247, lr: 4.41e-03, grad_scale: 8.0 2023-06-22 04:02:26,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1137582.0, ans=0.1 2023-06-22 04:02:49,268 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 2.768e+02 3.180e+02 3.748e+02 5.312e+02, threshold=6.360e+02, percent-clipped=0.0 2023-06-22 04:02:59,291 INFO [train.py:996] (3/4) Epoch 7, batch 6650, loss[loss=0.2291, simple_loss=0.2918, pruned_loss=0.0832, over 21969.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3026, pruned_loss=0.08009, over 4274496.35 frames. ], batch size: 103, lr: 4.41e-03, grad_scale: 8.0 2023-06-22 04:03:52,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1137822.0, ans=0.125 2023-06-22 04:04:18,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1137882.0, ans=0.125 2023-06-22 04:04:39,335 INFO [train.py:996] (3/4) Epoch 7, batch 6700, loss[loss=0.2065, simple_loss=0.2562, pruned_loss=0.07836, over 20414.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2975, pruned_loss=0.07866, over 4266057.92 frames. ], batch size: 703, lr: 4.41e-03, grad_scale: 8.0 2023-06-22 04:04:55,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1138002.0, ans=0.125 2023-06-22 04:05:24,784 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-22 04:05:42,845 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.78 vs. limit=15.0 2023-06-22 04:05:47,630 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-22 04:06:04,844 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-22 04:06:08,318 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.940e+02 3.406e+02 4.084e+02 6.605e+02, threshold=6.813e+02, percent-clipped=1.0 2023-06-22 04:06:17,940 INFO [train.py:996] (3/4) Epoch 7, batch 6750, loss[loss=0.227, simple_loss=0.2909, pruned_loss=0.08154, over 21812.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2956, pruned_loss=0.07894, over 4270127.54 frames. ], batch size: 282, lr: 4.40e-03, grad_scale: 8.0 2023-06-22 04:06:54,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1138362.0, ans=0.125 2023-06-22 04:07:40,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1138542.0, ans=0.0 2023-06-22 04:07:48,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1138542.0, ans=0.125 2023-06-22 04:07:55,306 INFO [train.py:996] (3/4) Epoch 7, batch 6800, loss[loss=0.1886, simple_loss=0.2634, pruned_loss=0.05694, over 21624.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2976, pruned_loss=0.08091, over 4265941.73 frames. ], batch size: 263, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:08:32,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1138662.0, ans=0.125 2023-06-22 04:08:45,885 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=22.5 2023-06-22 04:09:24,422 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.399e+02 3.025e+02 3.541e+02 4.379e+02 6.653e+02, threshold=7.081e+02, percent-clipped=0.0 2023-06-22 04:09:33,718 INFO [train.py:996] (3/4) Epoch 7, batch 6850, loss[loss=0.2209, simple_loss=0.2853, pruned_loss=0.07826, over 21669.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.2967, pruned_loss=0.08229, over 4271283.28 frames. ], batch size: 264, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:09:52,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1138902.0, ans=0.125 2023-06-22 04:10:08,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1138962.0, ans=0.2 2023-06-22 04:10:29,810 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-22 04:10:40,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1139082.0, ans=0.125 2023-06-22 04:11:02,313 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=11.52 vs. limit=15.0 2023-06-22 04:11:08,468 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-22 04:11:14,095 INFO [train.py:996] (3/4) Epoch 7, batch 6900, loss[loss=0.3039, simple_loss=0.3488, pruned_loss=0.1295, over 21765.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.2987, pruned_loss=0.08259, over 4281200.87 frames. ], batch size: 441, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:11:21,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1139202.0, ans=0.125 2023-06-22 04:11:33,376 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:12:06,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1139322.0, ans=0.1 2023-06-22 04:12:14,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1139322.0, ans=0.125 2023-06-22 04:12:22,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1139382.0, ans=0.125 2023-06-22 04:12:23,126 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.55 vs. limit=22.5 2023-06-22 04:12:24,864 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.24 vs. limit=15.0 2023-06-22 04:12:30,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1139382.0, ans=0.2 2023-06-22 04:12:41,059 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.939e+02 3.532e+02 4.220e+02 8.926e+02, threshold=7.064e+02, percent-clipped=5.0 2023-06-22 04:12:55,490 INFO [train.py:996] (3/4) Epoch 7, batch 6950, loss[loss=0.2846, simple_loss=0.3487, pruned_loss=0.1103, over 21553.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3013, pruned_loss=0.07955, over 4284377.62 frames. ], batch size: 211, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:13:15,011 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=22.5 2023-06-22 04:13:17,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1139502.0, ans=0.125 2023-06-22 04:13:27,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1139562.0, ans=0.0 2023-06-22 04:13:53,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1139622.0, ans=0.2 2023-06-22 04:14:35,491 INFO [train.py:996] (3/4) Epoch 7, batch 7000, loss[loss=0.2158, simple_loss=0.2788, pruned_loss=0.07638, over 21181.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3056, pruned_loss=0.08292, over 4290324.58 frames. ], batch size: 143, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:15:00,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1139862.0, ans=0.125 2023-06-22 04:15:15,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1139862.0, ans=0.125 2023-06-22 04:15:34,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1139982.0, ans=0.125 2023-06-22 04:15:35,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1139982.0, ans=0.0 2023-06-22 04:15:48,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1140042.0, ans=0.125 2023-06-22 04:16:01,352 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.353e+02 3.073e+02 3.537e+02 4.508e+02 8.250e+02, threshold=7.073e+02, percent-clipped=4.0 2023-06-22 04:16:10,891 INFO [train.py:996] (3/4) Epoch 7, batch 7050, loss[loss=0.1885, simple_loss=0.2773, pruned_loss=0.04985, over 21797.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3039, pruned_loss=0.0823, over 4282076.36 frames. ], batch size: 282, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:16:27,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1140102.0, ans=0.125 2023-06-22 04:16:31,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1140162.0, ans=0.125 2023-06-22 04:17:52,893 INFO [train.py:996] (3/4) Epoch 7, batch 7100, loss[loss=0.261, simple_loss=0.3371, pruned_loss=0.09243, over 21786.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3091, pruned_loss=0.08388, over 4282319.68 frames. ], batch size: 124, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:18:19,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1140462.0, ans=0.1 2023-06-22 04:18:22,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1140462.0, ans=0.0 2023-06-22 04:18:49,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1140582.0, ans=0.0 2023-06-22 04:18:58,427 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.86 vs. limit=15.0 2023-06-22 04:19:05,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1140582.0, ans=0.125 2023-06-22 04:19:25,192 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.884e+02 3.183e+02 3.903e+02 6.649e+02, threshold=6.367e+02, percent-clipped=0.0 2023-06-22 04:19:34,806 INFO [train.py:996] (3/4) Epoch 7, batch 7150, loss[loss=0.1735, simple_loss=0.2465, pruned_loss=0.0503, over 21239.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3062, pruned_loss=0.08158, over 4274385.67 frames. ], batch size: 176, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:19:38,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1140702.0, ans=0.125 2023-06-22 04:19:38,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1140702.0, ans=0.125 2023-06-22 04:19:52,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1140762.0, ans=0.125 2023-06-22 04:20:07,846 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:20:14,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1140822.0, ans=0.125 2023-06-22 04:21:14,895 INFO [train.py:996] (3/4) Epoch 7, batch 7200, loss[loss=0.2604, simple_loss=0.319, pruned_loss=0.1009, over 21336.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3078, pruned_loss=0.08262, over 4272702.18 frames. ], batch size: 131, lr: 4.40e-03, grad_scale: 32.0 2023-06-22 04:22:38,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1141242.0, ans=0.125 2023-06-22 04:22:44,332 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 2.920e+02 3.463e+02 4.119e+02 7.524e+02, threshold=6.925e+02, percent-clipped=3.0 2023-06-22 04:22:53,903 INFO [train.py:996] (3/4) Epoch 7, batch 7250, loss[loss=0.2338, simple_loss=0.2853, pruned_loss=0.09112, over 21595.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3041, pruned_loss=0.08294, over 4272105.25 frames. ], batch size: 247, lr: 4.40e-03, grad_scale: 32.0 2023-06-22 04:22:54,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1141302.0, ans=0.0 2023-06-22 04:23:01,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1141302.0, ans=0.035 2023-06-22 04:23:03,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1141302.0, ans=0.125 2023-06-22 04:23:22,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1141362.0, ans=0.0 2023-06-22 04:23:24,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1141422.0, ans=0.125 2023-06-22 04:23:28,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1141422.0, ans=0.0 2023-06-22 04:23:28,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1141422.0, ans=0.125 2023-06-22 04:24:33,090 INFO [train.py:996] (3/4) Epoch 7, batch 7300, loss[loss=0.2197, simple_loss=0.2717, pruned_loss=0.08385, over 21429.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.2977, pruned_loss=0.08205, over 4272908.44 frames. ], batch size: 212, lr: 4.40e-03, grad_scale: 32.0 2023-06-22 04:24:53,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1141662.0, ans=0.0 2023-06-22 04:25:08,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1141722.0, ans=0.2 2023-06-22 04:25:37,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1141782.0, ans=0.1 2023-06-22 04:26:04,332 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.235e+02 2.850e+02 3.371e+02 4.063e+02 7.936e+02, threshold=6.743e+02, percent-clipped=2.0 2023-06-22 04:26:13,046 INFO [train.py:996] (3/4) Epoch 7, batch 7350, loss[loss=0.2343, simple_loss=0.2961, pruned_loss=0.08626, over 21249.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2957, pruned_loss=0.08184, over 4270897.93 frames. ], batch size: 159, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:27:04,794 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=22.5 2023-06-22 04:27:07,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1142022.0, ans=0.125 2023-06-22 04:27:25,642 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:27:37,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1142142.0, ans=0.2 2023-06-22 04:27:43,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1142142.0, ans=0.0 2023-06-22 04:27:49,466 INFO [train.py:996] (3/4) Epoch 7, batch 7400, loss[loss=0.2651, simple_loss=0.3393, pruned_loss=0.09543, over 21620.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3032, pruned_loss=0.08461, over 4273170.20 frames. ], batch size: 389, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:28:06,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1142262.0, ans=0.2 2023-06-22 04:28:32,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1142322.0, ans=0.0 2023-06-22 04:28:42,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1142322.0, ans=0.125 2023-06-22 04:28:56,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1142382.0, ans=0.1 2023-06-22 04:29:06,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1142382.0, ans=0.125 2023-06-22 04:29:21,611 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.129e+02 3.591e+02 4.476e+02 8.193e+02, threshold=7.182e+02, percent-clipped=3.0 2023-06-22 04:29:22,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1142442.0, ans=0.0 2023-06-22 04:29:29,601 INFO [train.py:996] (3/4) Epoch 7, batch 7450, loss[loss=0.2256, simple_loss=0.3058, pruned_loss=0.0727, over 21532.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3032, pruned_loss=0.08379, over 4278257.07 frames. ], batch size: 441, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:29:45,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1142562.0, ans=0.125 2023-06-22 04:30:33,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1142622.0, ans=0.125 2023-06-22 04:30:48,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1142682.0, ans=0.125 2023-06-22 04:30:52,903 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:31:10,680 INFO [train.py:996] (3/4) Epoch 7, batch 7500, loss[loss=0.3531, simple_loss=0.4317, pruned_loss=0.1373, over 21475.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.309, pruned_loss=0.08629, over 4282855.22 frames. ], batch size: 471, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:31:16,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1142802.0, ans=0.125 2023-06-22 04:31:36,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1142862.0, ans=0.2 2023-06-22 04:32:21,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1142982.0, ans=0.125 2023-06-22 04:32:28,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1142982.0, ans=0.125 2023-06-22 04:32:40,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1143042.0, ans=0.125 2023-06-22 04:32:40,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1143042.0, ans=0.09899494936611666 2023-06-22 04:32:42,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1143042.0, ans=0.125 2023-06-22 04:32:43,204 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 3.393e+02 4.362e+02 5.679e+02 1.317e+03, threshold=8.723e+02, percent-clipped=9.0 2023-06-22 04:32:51,274 INFO [train.py:996] (3/4) Epoch 7, batch 7550, loss[loss=0.2403, simple_loss=0.3323, pruned_loss=0.07411, over 21765.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3164, pruned_loss=0.08448, over 4283250.64 frames. ], batch size: 351, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:32:52,123 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-22 04:33:15,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1143162.0, ans=0.125 2023-06-22 04:33:17,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1143162.0, ans=0.0 2023-06-22 04:33:44,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1143222.0, ans=0.0 2023-06-22 04:33:44,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1143222.0, ans=0.125 2023-06-22 04:34:30,100 INFO [train.py:996] (3/4) Epoch 7, batch 7600, loss[loss=0.2573, simple_loss=0.3251, pruned_loss=0.09476, over 21778.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3142, pruned_loss=0.08267, over 4288790.75 frames. ], batch size: 441, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:34:50,973 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.90 vs. limit=15.0 2023-06-22 04:35:27,052 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-22 04:35:37,954 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-22 04:35:40,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1143582.0, ans=0.0 2023-06-22 04:35:56,522 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.783e+02 3.344e+02 4.108e+02 6.348e+02, threshold=6.687e+02, percent-clipped=0.0 2023-06-22 04:36:04,728 INFO [train.py:996] (3/4) Epoch 7, batch 7650, loss[loss=0.2846, simple_loss=0.346, pruned_loss=0.1116, over 21866.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3132, pruned_loss=0.08438, over 4294943.13 frames. ], batch size: 107, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:36:13,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1143702.0, ans=0.0 2023-06-22 04:36:25,746 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.58 vs. limit=10.0 2023-06-22 04:36:47,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1143762.0, ans=0.125 2023-06-22 04:37:31,007 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=15.0 2023-06-22 04:37:44,832 INFO [train.py:996] (3/4) Epoch 7, batch 7700, loss[loss=0.3061, simple_loss=0.3727, pruned_loss=0.1197, over 21493.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3185, pruned_loss=0.08807, over 4289931.07 frames. ], batch size: 131, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:38:35,713 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-22 04:39:18,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1144242.0, ans=0.1 2023-06-22 04:39:20,942 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.234e+02 3.101e+02 3.626e+02 4.244e+02 7.117e+02, threshold=7.252e+02, percent-clipped=1.0 2023-06-22 04:39:33,823 INFO [train.py:996] (3/4) Epoch 7, batch 7750, loss[loss=0.2512, simple_loss=0.3291, pruned_loss=0.08663, over 21262.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3246, pruned_loss=0.08853, over 4285162.34 frames. ], batch size: 143, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:39:50,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1144302.0, ans=0.1 2023-06-22 04:40:14,116 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-22 04:41:20,053 INFO [train.py:996] (3/4) Epoch 7, batch 7800, loss[loss=0.2279, simple_loss=0.2901, pruned_loss=0.08283, over 21466.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3265, pruned_loss=0.08959, over 4284580.94 frames. ], batch size: 212, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:41:52,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1144722.0, ans=0.125 2023-06-22 04:41:56,648 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-22 04:42:02,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1144722.0, ans=0.05 2023-06-22 04:42:37,942 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.647e+02 3.499e+02 4.174e+02 5.699e+02 9.171e+02, threshold=8.349e+02, percent-clipped=6.0 2023-06-22 04:42:47,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1144902.0, ans=0.1 2023-06-22 04:42:49,092 INFO [train.py:996] (3/4) Epoch 7, batch 7850, loss[loss=0.2089, simple_loss=0.2714, pruned_loss=0.07321, over 21540.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3184, pruned_loss=0.08878, over 4278367.24 frames. ], batch size: 212, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:42:55,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1144902.0, ans=0.1 2023-06-22 04:43:12,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1144902.0, ans=0.2 2023-06-22 04:44:20,954 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.45 vs. limit=10.0 2023-06-22 04:44:24,078 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.14 vs. limit=22.5 2023-06-22 04:44:41,140 INFO [train.py:996] (3/4) Epoch 7, batch 7900, loss[loss=0.2393, simple_loss=0.3135, pruned_loss=0.08252, over 21255.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3139, pruned_loss=0.08784, over 4282799.56 frames. ], batch size: 548, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:44:41,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1145202.0, ans=0.2 2023-06-22 04:44:50,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1145202.0, ans=0.125 2023-06-22 04:45:31,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1145382.0, ans=0.125 2023-06-22 04:45:43,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1145382.0, ans=0.125 2023-06-22 04:45:43,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1145382.0, ans=0.125 2023-06-22 04:46:16,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1145442.0, ans=0.0 2023-06-22 04:46:17,641 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.334e+02 3.136e+02 3.570e+02 4.500e+02 9.857e+02, threshold=7.139e+02, percent-clipped=1.0 2023-06-22 04:46:23,931 INFO [train.py:996] (3/4) Epoch 7, batch 7950, loss[loss=0.2221, simple_loss=0.3526, pruned_loss=0.04576, over 19771.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3179, pruned_loss=0.08601, over 4268247.13 frames. ], batch size: 702, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:48:05,912 INFO [train.py:996] (3/4) Epoch 7, batch 8000, loss[loss=0.2926, simple_loss=0.3552, pruned_loss=0.115, over 21273.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3222, pruned_loss=0.08827, over 4266356.46 frames. ], batch size: 143, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:48:13,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1145802.0, ans=0.125 2023-06-22 04:48:16,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1145802.0, ans=0.5 2023-06-22 04:48:44,869 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=22.5 2023-06-22 04:49:43,956 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.592e+02 3.352e+02 3.873e+02 5.175e+02 9.395e+02, threshold=7.746e+02, percent-clipped=4.0 2023-06-22 04:49:50,528 INFO [train.py:996] (3/4) Epoch 7, batch 8050, loss[loss=0.1891, simple_loss=0.2425, pruned_loss=0.06785, over 21722.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3235, pruned_loss=0.08825, over 4259297.90 frames. ], batch size: 124, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:51:10,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1146282.0, ans=0.0 2023-06-22 04:51:13,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1146342.0, ans=0.125 2023-06-22 04:51:15,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1146342.0, ans=0.1 2023-06-22 04:51:31,141 INFO [train.py:996] (3/4) Epoch 7, batch 8100, loss[loss=0.2512, simple_loss=0.3167, pruned_loss=0.09286, over 21530.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3236, pruned_loss=0.08912, over 4270434.33 frames. ], batch size: 131, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:51:35,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1146402.0, ans=0.0 2023-06-22 04:52:05,857 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-06-22 04:52:20,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1146462.0, ans=0.125 2023-06-22 04:52:21,018 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.84 vs. limit=22.5 2023-06-22 04:52:38,945 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=15.0 2023-06-22 04:52:50,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1146582.0, ans=0.2 2023-06-22 04:52:55,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1146582.0, ans=0.125 2023-06-22 04:53:01,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1146642.0, ans=0.1 2023-06-22 04:53:20,128 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.528e+02 3.325e+02 3.912e+02 5.287e+02 8.623e+02, threshold=7.823e+02, percent-clipped=4.0 2023-06-22 04:53:29,623 INFO [train.py:996] (3/4) Epoch 7, batch 8150, loss[loss=0.2858, simple_loss=0.3829, pruned_loss=0.09434, over 21633.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3264, pruned_loss=0.08911, over 4268749.64 frames. ], batch size: 414, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:53:31,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1146702.0, ans=0.1 2023-06-22 04:53:42,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1146702.0, ans=0.125 2023-06-22 04:53:44,756 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-06-22 04:53:46,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1146702.0, ans=0.125 2023-06-22 04:53:50,055 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.46 vs. limit=22.5 2023-06-22 04:53:52,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1146762.0, ans=0.125 2023-06-22 04:53:55,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1146762.0, ans=0.5 2023-06-22 04:55:08,824 INFO [train.py:996] (3/4) Epoch 7, batch 8200, loss[loss=0.2475, simple_loss=0.3071, pruned_loss=0.09396, over 21887.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3194, pruned_loss=0.08653, over 4259657.11 frames. ], batch size: 373, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:55:22,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1147002.0, ans=0.1 2023-06-22 04:55:47,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1147122.0, ans=0.0 2023-06-22 04:56:06,063 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=12.0 2023-06-22 04:56:38,808 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 2.942e+02 3.673e+02 4.818e+02 8.671e+02, threshold=7.346e+02, percent-clipped=2.0 2023-06-22 04:56:48,472 INFO [train.py:996] (3/4) Epoch 7, batch 8250, loss[loss=0.2533, simple_loss=0.3232, pruned_loss=0.09165, over 21238.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3202, pruned_loss=0.08773, over 4259811.28 frames. ], batch size: 143, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:56:50,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1147302.0, ans=0.2 2023-06-22 04:57:44,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1147482.0, ans=0.05 2023-06-22 04:57:52,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1147482.0, ans=0.1 2023-06-22 04:58:28,866 INFO [train.py:996] (3/4) Epoch 7, batch 8300, loss[loss=0.2116, simple_loss=0.2801, pruned_loss=0.0716, over 21756.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3191, pruned_loss=0.08478, over 4265484.07 frames. ], batch size: 124, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:59:03,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1147662.0, ans=0.0 2023-06-22 04:59:07,525 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-22 04:59:13,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1147722.0, ans=0.0 2023-06-22 04:59:16,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1147722.0, ans=0.1 2023-06-22 04:59:28,144 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=22.5 2023-06-22 05:00:04,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 2.891e+02 3.462e+02 4.321e+02 7.253e+02, threshold=6.923e+02, percent-clipped=0.0 2023-06-22 05:00:14,262 INFO [train.py:996] (3/4) Epoch 7, batch 8350, loss[loss=0.2681, simple_loss=0.3476, pruned_loss=0.09429, over 21713.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3191, pruned_loss=0.08331, over 4274570.44 frames. ], batch size: 351, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 05:00:43,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1147962.0, ans=0.125 2023-06-22 05:01:00,401 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.31 vs. limit=10.0 2023-06-22 05:01:49,554 INFO [train.py:996] (3/4) Epoch 7, batch 8400, loss[loss=0.2183, simple_loss=0.3147, pruned_loss=0.06092, over 21442.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.316, pruned_loss=0.07984, over 4273314.51 frames. ], batch size: 471, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 05:01:51,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1148202.0, ans=0.0 2023-06-22 05:02:29,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1148322.0, ans=0.0 2023-06-22 05:03:23,504 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.841e+02 3.519e+02 4.158e+02 9.923e+02, threshold=7.039e+02, percent-clipped=4.0 2023-06-22 05:03:28,528 INFO [train.py:996] (3/4) Epoch 7, batch 8450, loss[loss=0.2184, simple_loss=0.2891, pruned_loss=0.07382, over 21808.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.314, pruned_loss=0.07886, over 4275935.09 frames. ], batch size: 371, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 05:03:43,413 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=22.5 2023-06-22 05:03:46,309 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-22 05:03:52,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1148562.0, ans=0.125 2023-06-22 05:03:55,923 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.49 vs. limit=6.0 2023-06-22 05:04:37,790 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=12.0 2023-06-22 05:05:07,090 INFO [train.py:996] (3/4) Epoch 7, batch 8500, loss[loss=0.2359, simple_loss=0.2892, pruned_loss=0.09135, over 21402.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3103, pruned_loss=0.08058, over 4280212.77 frames. ], batch size: 211, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:05:25,437 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-22 05:05:48,451 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-22 05:05:50,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1148922.0, ans=0.0 2023-06-22 05:05:58,176 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.13 vs. limit=10.0 2023-06-22 05:06:16,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1148982.0, ans=0.07 2023-06-22 05:06:43,153 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.101e+02 3.750e+02 4.685e+02 7.391e+02, threshold=7.500e+02, percent-clipped=2.0 2023-06-22 05:06:46,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1149102.0, ans=0.1 2023-06-22 05:06:47,859 INFO [train.py:996] (3/4) Epoch 7, batch 8550, loss[loss=0.2572, simple_loss=0.3253, pruned_loss=0.09456, over 21625.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.315, pruned_loss=0.08395, over 4277532.89 frames. ], batch size: 263, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:06:59,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.52 vs. limit=15.0 2023-06-22 05:07:37,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1149222.0, ans=0.125 2023-06-22 05:08:13,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1149342.0, ans=0.125 2023-06-22 05:08:19,016 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.45 vs. limit=15.0 2023-06-22 05:08:29,780 INFO [train.py:996] (3/4) Epoch 7, batch 8600, loss[loss=0.2868, simple_loss=0.3509, pruned_loss=0.1113, over 21790.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3226, pruned_loss=0.08642, over 4283752.38 frames. ], batch size: 118, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:08:37,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1149402.0, ans=0.125 2023-06-22 05:09:38,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1149582.0, ans=0.1 2023-06-22 05:10:06,555 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 3.158e+02 3.954e+02 4.821e+02 7.985e+02, threshold=7.909e+02, percent-clipped=1.0 2023-06-22 05:10:11,784 INFO [train.py:996] (3/4) Epoch 7, batch 8650, loss[loss=0.1899, simple_loss=0.2973, pruned_loss=0.0412, over 21793.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3272, pruned_loss=0.08633, over 4274826.02 frames. ], batch size: 332, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:10:29,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1149762.0, ans=0.05 2023-06-22 05:10:48,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1149822.0, ans=0.125 2023-06-22 05:11:31,304 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-22 05:11:32,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1149942.0, ans=0.125 2023-06-22 05:11:38,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1149942.0, ans=0.1 2023-06-22 05:11:46,153 INFO [train.py:996] (3/4) Epoch 7, batch 8700, loss[loss=0.199, simple_loss=0.2671, pruned_loss=0.06544, over 21616.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3202, pruned_loss=0.08275, over 4278522.89 frames. ], batch size: 298, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:12:03,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1150062.0, ans=0.0 2023-06-22 05:12:59,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1150182.0, ans=0.125 2023-06-22 05:12:59,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1150182.0, ans=0.2 2023-06-22 05:13:21,474 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.826e+02 3.648e+02 4.622e+02 7.671e+02, threshold=7.296e+02, percent-clipped=0.0 2023-06-22 05:13:24,953 INFO [train.py:996] (3/4) Epoch 7, batch 8750, loss[loss=0.2362, simple_loss=0.3668, pruned_loss=0.05277, over 19903.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3173, pruned_loss=0.0826, over 4269634.24 frames. ], batch size: 702, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:13:26,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1150302.0, ans=0.125 2023-06-22 05:13:33,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1150302.0, ans=0.125 2023-06-22 05:13:37,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1150302.0, ans=0.125 2023-06-22 05:13:47,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-22 05:13:50,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1150362.0, ans=0.1 2023-06-22 05:13:50,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1150362.0, ans=0.1 2023-06-22 05:13:50,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1150362.0, ans=0.125 2023-06-22 05:14:42,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1150482.0, ans=0.1 2023-06-22 05:14:49,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1150542.0, ans=0.07 2023-06-22 05:15:06,508 INFO [train.py:996] (3/4) Epoch 7, batch 8800, loss[loss=0.3028, simple_loss=0.4149, pruned_loss=0.09534, over 21244.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3259, pruned_loss=0.08601, over 4274180.84 frames. ], batch size: 548, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:15:15,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1150602.0, ans=0.1 2023-06-22 05:15:41,122 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-22 05:15:50,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1150662.0, ans=0.125 2023-06-22 05:16:09,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1150722.0, ans=0.2 2023-06-22 05:16:09,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1150722.0, ans=0.2 2023-06-22 05:16:12,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1150722.0, ans=0.1 2023-06-22 05:16:33,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1150842.0, ans=0.0 2023-06-22 05:16:34,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1150842.0, ans=0.125 2023-06-22 05:16:45,286 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 3.569e+02 4.667e+02 6.070e+02 1.023e+03, threshold=9.335e+02, percent-clipped=11.0 2023-06-22 05:16:47,122 INFO [train.py:996] (3/4) Epoch 7, batch 8850, loss[loss=0.2422, simple_loss=0.3354, pruned_loss=0.07452, over 21629.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3318, pruned_loss=0.08937, over 4272650.19 frames. ], batch size: 414, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:17:07,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1150962.0, ans=10.0 2023-06-22 05:17:26,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1150962.0, ans=0.5 2023-06-22 05:17:37,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1151022.0, ans=0.125 2023-06-22 05:17:49,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1151022.0, ans=0.125 2023-06-22 05:17:53,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1151082.0, ans=0.125 2023-06-22 05:18:00,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1151082.0, ans=0.125 2023-06-22 05:18:33,240 INFO [train.py:996] (3/4) Epoch 7, batch 8900, loss[loss=0.283, simple_loss=0.3514, pruned_loss=0.1073, over 21402.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.326, pruned_loss=0.0881, over 4266809.31 frames. ], batch size: 507, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:19:00,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1151262.0, ans=0.0 2023-06-22 05:19:24,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1151322.0, ans=0.2 2023-06-22 05:20:05,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1151442.0, ans=0.5 2023-06-22 05:20:19,276 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.570e+02 3.272e+02 4.158e+02 4.869e+02 9.581e+02, threshold=8.315e+02, percent-clipped=1.0 2023-06-22 05:20:19,307 INFO [train.py:996] (3/4) Epoch 7, batch 8950, loss[loss=0.3069, simple_loss=0.3805, pruned_loss=0.1166, over 21654.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3251, pruned_loss=0.08685, over 4265004.89 frames. ], batch size: 389, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:20:31,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1151502.0, ans=0.0 2023-06-22 05:20:38,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1151502.0, ans=0.0 2023-06-22 05:20:44,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1151562.0, ans=0.125 2023-06-22 05:21:58,735 INFO [train.py:996] (3/4) Epoch 7, batch 9000, loss[loss=0.2124, simple_loss=0.2776, pruned_loss=0.07359, over 21645.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3205, pruned_loss=0.08623, over 4261050.94 frames. ], batch size: 332, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:21:58,736 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 05:22:20,472 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2667, simple_loss=0.3612, pruned_loss=0.08614, over 1796401.00 frames. 2023-06-22 05:22:20,473 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-22 05:22:36,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1151862.0, ans=0.2 2023-06-22 05:22:54,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1151922.0, ans=0.125 2023-06-22 05:23:12,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1151982.0, ans=0.0 2023-06-22 05:23:57,237 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 2.838e+02 3.460e+02 4.383e+02 1.064e+03, threshold=6.920e+02, percent-clipped=1.0 2023-06-22 05:23:57,258 INFO [train.py:996] (3/4) Epoch 7, batch 9050, loss[loss=0.2568, simple_loss=0.3232, pruned_loss=0.09518, over 21391.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3147, pruned_loss=0.08347, over 4269387.49 frames. ], batch size: 211, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:24:25,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1152162.0, ans=0.0 2023-06-22 05:25:29,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1152342.0, ans=0.125 2023-06-22 05:25:29,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1152342.0, ans=0.0 2023-06-22 05:25:38,178 INFO [train.py:996] (3/4) Epoch 7, batch 9100, loss[loss=0.2031, simple_loss=0.2615, pruned_loss=0.07234, over 19946.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3181, pruned_loss=0.08449, over 4270085.38 frames. ], batch size: 702, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:25:43,612 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 05:25:51,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1152402.0, ans=0.1 2023-06-22 05:26:23,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1152522.0, ans=0.125 2023-06-22 05:27:18,305 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.226e+02 3.929e+02 4.790e+02 9.193e+02, threshold=7.858e+02, percent-clipped=7.0 2023-06-22 05:27:18,325 INFO [train.py:996] (3/4) Epoch 7, batch 9150, loss[loss=0.1984, simple_loss=0.3224, pruned_loss=0.03718, over 20688.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.321, pruned_loss=0.08195, over 4268962.26 frames. ], batch size: 607, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:27:29,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1152702.0, ans=0.125 2023-06-22 05:27:32,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1152762.0, ans=0.125 2023-06-22 05:28:09,384 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.29 vs. limit=22.5 2023-06-22 05:28:38,188 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-06-22 05:28:48,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1152942.0, ans=0.125 2023-06-22 05:28:57,906 INFO [train.py:996] (3/4) Epoch 7, batch 9200, loss[loss=0.2281, simple_loss=0.3183, pruned_loss=0.06895, over 21606.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3236, pruned_loss=0.08059, over 4264572.25 frames. ], batch size: 263, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:29:47,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1153122.0, ans=0.125 2023-06-22 05:30:38,528 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.323e+02 3.168e+02 3.950e+02 4.666e+02 8.453e+02, threshold=7.900e+02, percent-clipped=2.0 2023-06-22 05:30:38,549 INFO [train.py:996] (3/4) Epoch 7, batch 9250, loss[loss=0.2367, simple_loss=0.297, pruned_loss=0.08826, over 21456.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3264, pruned_loss=0.08451, over 4271230.93 frames. ], batch size: 389, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:30:58,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1153362.0, ans=0.125 2023-06-22 05:31:22,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1153422.0, ans=0.0 2023-06-22 05:32:01,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1153542.0, ans=0.125 2023-06-22 05:32:09,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1153542.0, ans=0.0 2023-06-22 05:32:23,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1153602.0, ans=0.2 2023-06-22 05:32:24,608 INFO [train.py:996] (3/4) Epoch 7, batch 9300, loss[loss=0.2271, simple_loss=0.287, pruned_loss=0.08366, over 21802.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3202, pruned_loss=0.08499, over 4261737.87 frames. ], batch size: 118, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:32:53,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1153662.0, ans=0.1 2023-06-22 05:33:13,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1153722.0, ans=0.1 2023-06-22 05:33:39,616 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=15.0 2023-06-22 05:34:11,589 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.246e+02 3.753e+02 4.890e+02 7.964e+02, threshold=7.506e+02, percent-clipped=1.0 2023-06-22 05:34:11,609 INFO [train.py:996] (3/4) Epoch 7, batch 9350, loss[loss=0.2465, simple_loss=0.3274, pruned_loss=0.0828, over 21823.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3257, pruned_loss=0.08578, over 4256994.69 frames. ], batch size: 282, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:34:26,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1153962.0, ans=0.125 2023-06-22 05:35:10,002 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-06-22 05:35:10,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1154082.0, ans=0.0 2023-06-22 05:35:28,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1154082.0, ans=0.0 2023-06-22 05:35:52,119 INFO [train.py:996] (3/4) Epoch 7, batch 9400, loss[loss=0.1881, simple_loss=0.2602, pruned_loss=0.05806, over 21405.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3276, pruned_loss=0.08697, over 4263223.09 frames. ], batch size: 211, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:36:42,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1154322.0, ans=0.0 2023-06-22 05:37:19,315 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.28 vs. limit=10.0 2023-06-22 05:37:32,762 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.206e+02 3.638e+02 4.538e+02 8.694e+02, threshold=7.276e+02, percent-clipped=2.0 2023-06-22 05:37:32,785 INFO [train.py:996] (3/4) Epoch 7, batch 9450, loss[loss=0.2113, simple_loss=0.2785, pruned_loss=0.07205, over 21639.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3208, pruned_loss=0.08639, over 4266119.16 frames. ], batch size: 333, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:38:10,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1154562.0, ans=0.0 2023-06-22 05:38:18,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1154622.0, ans=0.07 2023-06-22 05:38:50,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1154742.0, ans=0.1 2023-06-22 05:39:11,629 INFO [train.py:996] (3/4) Epoch 7, batch 9500, loss[loss=0.2267, simple_loss=0.3016, pruned_loss=0.07594, over 21396.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3132, pruned_loss=0.08488, over 4257153.20 frames. ], batch size: 194, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:39:12,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1154802.0, ans=0.2 2023-06-22 05:40:10,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1154982.0, ans=0.035 2023-06-22 05:40:25,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1154982.0, ans=0.2 2023-06-22 05:40:52,007 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 3.257e+02 3.921e+02 4.943e+02 1.018e+03, threshold=7.842e+02, percent-clipped=7.0 2023-06-22 05:40:52,028 INFO [train.py:996] (3/4) Epoch 7, batch 9550, loss[loss=0.2617, simple_loss=0.3409, pruned_loss=0.09126, over 21759.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3185, pruned_loss=0.0869, over 4258655.99 frames. ], batch size: 124, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:42:09,600 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 05:42:26,563 INFO [train.py:996] (3/4) Epoch 7, batch 9600, loss[loss=0.2058, simple_loss=0.2871, pruned_loss=0.06229, over 21830.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3217, pruned_loss=0.08802, over 4265840.20 frames. ], batch size: 298, lr: 4.37e-03, grad_scale: 32.0 2023-06-22 05:43:00,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1155462.0, ans=0.125 2023-06-22 05:43:01,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1155462.0, ans=0.125 2023-06-22 05:43:36,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1155582.0, ans=0.04949747468305833 2023-06-22 05:44:06,658 INFO [train.py:996] (3/4) Epoch 7, batch 9650, loss[loss=0.2642, simple_loss=0.3307, pruned_loss=0.09887, over 21925.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3205, pruned_loss=0.08628, over 4271547.01 frames. ], batch size: 316, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:44:08,110 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.313e+02 3.176e+02 3.740e+02 4.596e+02 7.915e+02, threshold=7.479e+02, percent-clipped=1.0 2023-06-22 05:45:07,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1155822.0, ans=0.125 2023-06-22 05:45:12,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1155882.0, ans=0.04949747468305833 2023-06-22 05:45:13,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1155882.0, ans=0.125 2023-06-22 05:45:51,571 INFO [train.py:996] (3/4) Epoch 7, batch 9700, loss[loss=0.2298, simple_loss=0.3057, pruned_loss=0.07694, over 21508.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3251, pruned_loss=0.08705, over 4271683.30 frames. ], batch size: 548, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:46:17,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1156062.0, ans=0.5 2023-06-22 05:46:21,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1156062.0, ans=0.125 2023-06-22 05:46:45,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1156122.0, ans=0.5 2023-06-22 05:47:01,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1156182.0, ans=0.1 2023-06-22 05:47:05,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1156182.0, ans=0.125 2023-06-22 05:47:35,168 INFO [train.py:996] (3/4) Epoch 7, batch 9750, loss[loss=0.275, simple_loss=0.3139, pruned_loss=0.1181, over 21430.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3189, pruned_loss=0.08609, over 4275373.48 frames. ], batch size: 509, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:47:36,493 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.485e+02 3.111e+02 3.618e+02 4.143e+02 7.836e+02, threshold=7.236e+02, percent-clipped=1.0 2023-06-22 05:47:37,450 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.09 vs. limit=15.0 2023-06-22 05:47:44,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1156302.0, ans=0.1 2023-06-22 05:47:47,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1156302.0, ans=0.125 2023-06-22 05:47:58,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1156362.0, ans=0.125 2023-06-22 05:48:08,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1156422.0, ans=0.04949747468305833 2023-06-22 05:48:27,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1156482.0, ans=0.0 2023-06-22 05:48:51,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1156542.0, ans=0.125 2023-06-22 05:48:58,418 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-22 05:49:08,228 INFO [train.py:996] (3/4) Epoch 7, batch 9800, loss[loss=0.2337, simple_loss=0.3006, pruned_loss=0.08345, over 21807.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3177, pruned_loss=0.08604, over 4277036.66 frames. ], batch size: 414, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:49:23,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1156602.0, ans=0.05 2023-06-22 05:49:45,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1156722.0, ans=0.0 2023-06-22 05:50:41,991 INFO [train.py:996] (3/4) Epoch 7, batch 9850, loss[loss=0.2324, simple_loss=0.3069, pruned_loss=0.07893, over 16874.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3146, pruned_loss=0.086, over 4278437.73 frames. ], batch size: 62, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:50:43,432 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 3.145e+02 3.713e+02 4.993e+02 9.640e+02, threshold=7.425e+02, percent-clipped=7.0 2023-06-22 05:50:50,604 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-22 05:50:59,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1156902.0, ans=0.125 2023-06-22 05:51:07,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1156962.0, ans=0.0 2023-06-22 05:51:13,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1156962.0, ans=0.125 2023-06-22 05:51:22,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1157022.0, ans=0.0 2023-06-22 05:51:34,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1157022.0, ans=0.125 2023-06-22 05:51:55,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1157082.0, ans=0.2 2023-06-22 05:52:19,267 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-22 05:52:21,248 INFO [train.py:996] (3/4) Epoch 7, batch 9900, loss[loss=0.2572, simple_loss=0.3298, pruned_loss=0.09231, over 21468.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3123, pruned_loss=0.08608, over 4267692.07 frames. ], batch size: 211, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:52:45,405 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=12.0 2023-06-22 05:53:02,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1157322.0, ans=0.125 2023-06-22 05:54:06,747 INFO [train.py:996] (3/4) Epoch 7, batch 9950, loss[loss=0.2347, simple_loss=0.2892, pruned_loss=0.09013, over 21256.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3143, pruned_loss=0.08856, over 4263834.16 frames. ], batch size: 471, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:54:08,112 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.677e+02 3.173e+02 3.693e+02 4.396e+02 6.940e+02, threshold=7.386e+02, percent-clipped=0.0 2023-06-22 05:54:16,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1157502.0, ans=0.125 2023-06-22 05:54:27,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1157562.0, ans=0.0 2023-06-22 05:54:30,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1157562.0, ans=0.0 2023-06-22 05:54:57,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1157622.0, ans=0.125 2023-06-22 05:55:54,204 INFO [train.py:996] (3/4) Epoch 7, batch 10000, loss[loss=0.1841, simple_loss=0.2544, pruned_loss=0.05688, over 21550.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.311, pruned_loss=0.08767, over 4264586.88 frames. ], batch size: 230, lr: 4.37e-03, grad_scale: 32.0 2023-06-22 05:56:08,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-22 05:56:43,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1157922.0, ans=0.0 2023-06-22 05:56:51,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1157982.0, ans=0.125 2023-06-22 05:56:51,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1157982.0, ans=0.2 2023-06-22 05:57:35,224 INFO [train.py:996] (3/4) Epoch 7, batch 10050, loss[loss=0.2556, simple_loss=0.3191, pruned_loss=0.09603, over 20742.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3121, pruned_loss=0.08713, over 4266122.99 frames. ], batch size: 608, lr: 4.37e-03, grad_scale: 32.0 2023-06-22 05:57:35,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1158102.0, ans=0.0 2023-06-22 05:57:36,646 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.289e+02 3.023e+02 3.455e+02 4.258e+02 6.801e+02, threshold=6.910e+02, percent-clipped=0.0 2023-06-22 05:57:56,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1158162.0, ans=0.1 2023-06-22 05:59:05,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1158342.0, ans=0.0 2023-06-22 05:59:15,532 INFO [train.py:996] (3/4) Epoch 7, batch 10100, loss[loss=0.2786, simple_loss=0.3324, pruned_loss=0.1124, over 21350.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3082, pruned_loss=0.08449, over 4274395.42 frames. ], batch size: 159, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:59:39,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1158462.0, ans=0.1 2023-06-22 06:00:24,204 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-22 06:00:47,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1158642.0, ans=0.125 2023-06-22 06:00:49,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1158642.0, ans=0.125 2023-06-22 06:00:49,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1158642.0, ans=0.125 2023-06-22 06:00:55,639 INFO [train.py:996] (3/4) Epoch 7, batch 10150, loss[loss=0.2211, simple_loss=0.2869, pruned_loss=0.07767, over 21779.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3156, pruned_loss=0.08755, over 4279607.72 frames. ], batch size: 102, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 06:00:58,867 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.393e+02 3.244e+02 3.861e+02 4.882e+02 7.298e+02, threshold=7.722e+02, percent-clipped=2.0 2023-06-22 06:01:35,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1158762.0, ans=0.0 2023-06-22 06:01:36,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-06-22 06:02:02,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1158882.0, ans=0.125 2023-06-22 06:02:28,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1158942.0, ans=0.125 2023-06-22 06:02:29,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1158942.0, ans=0.125 2023-06-22 06:02:35,504 INFO [train.py:996] (3/4) Epoch 7, batch 10200, loss[loss=0.2291, simple_loss=0.3193, pruned_loss=0.06943, over 21704.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3138, pruned_loss=0.08508, over 4277740.13 frames. ], batch size: 332, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 06:02:52,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1159002.0, ans=0.125 2023-06-22 06:03:04,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1159062.0, ans=0.125 2023-06-22 06:03:16,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1159122.0, ans=0.0 2023-06-22 06:03:20,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1159122.0, ans=0.2 2023-06-22 06:04:14,094 INFO [train.py:996] (3/4) Epoch 7, batch 10250, loss[loss=0.15, simple_loss=0.2219, pruned_loss=0.03909, over 21145.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3073, pruned_loss=0.07881, over 4273285.03 frames. ], batch size: 176, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:04:17,127 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.009e+02 2.655e+02 3.086e+02 4.104e+02 7.872e+02, threshold=6.172e+02, percent-clipped=2.0 2023-06-22 06:04:37,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1159362.0, ans=0.05 2023-06-22 06:05:11,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1159422.0, ans=0.0 2023-06-22 06:06:01,463 INFO [train.py:996] (3/4) Epoch 7, batch 10300, loss[loss=0.197, simple_loss=0.249, pruned_loss=0.07252, over 20917.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3102, pruned_loss=0.0799, over 4267550.44 frames. ], batch size: 608, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:06:12,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1159602.0, ans=0.0 2023-06-22 06:07:17,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1159782.0, ans=0.125 2023-06-22 06:07:22,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1159842.0, ans=0.2 2023-06-22 06:07:34,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1159842.0, ans=0.125 2023-06-22 06:07:44,238 INFO [train.py:996] (3/4) Epoch 7, batch 10350, loss[loss=0.2222, simple_loss=0.3026, pruned_loss=0.07094, over 21698.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3128, pruned_loss=0.08051, over 4269280.15 frames. ], batch size: 351, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:07:44,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1159902.0, ans=0.1 2023-06-22 06:07:47,492 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 3.368e+02 3.957e+02 4.921e+02 8.307e+02, threshold=7.914e+02, percent-clipped=7.0 2023-06-22 06:08:02,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1159902.0, ans=0.0 2023-06-22 06:08:51,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1160082.0, ans=0.125 2023-06-22 06:09:31,075 INFO [train.py:996] (3/4) Epoch 7, batch 10400, loss[loss=0.1933, simple_loss=0.2709, pruned_loss=0.05786, over 21765.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3047, pruned_loss=0.07845, over 4267995.64 frames. ], batch size: 282, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:09:33,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1160202.0, ans=0.1 2023-06-22 06:09:34,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1160202.0, ans=0.0 2023-06-22 06:10:01,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1160262.0, ans=0.0 2023-06-22 06:10:11,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1160262.0, ans=0.07 2023-06-22 06:10:35,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1160382.0, ans=0.2 2023-06-22 06:10:44,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1160382.0, ans=0.125 2023-06-22 06:10:51,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1160442.0, ans=0.125 2023-06-22 06:11:13,824 INFO [train.py:996] (3/4) Epoch 7, batch 10450, loss[loss=0.2355, simple_loss=0.3087, pruned_loss=0.08114, over 21422.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.31, pruned_loss=0.08069, over 4261510.75 frames. ], batch size: 194, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:11:16,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.616e+02 3.263e+02 3.725e+02 4.769e+02 8.321e+02, threshold=7.450e+02, percent-clipped=2.0 2023-06-22 06:12:17,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1160682.0, ans=0.125 2023-06-22 06:12:36,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1160742.0, ans=0.125 2023-06-22 06:12:58,239 INFO [train.py:996] (3/4) Epoch 7, batch 10500, loss[loss=0.239, simple_loss=0.2989, pruned_loss=0.08958, over 21632.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.312, pruned_loss=0.08075, over 4260012.43 frames. ], batch size: 247, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:13:20,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1160862.0, ans=0.1 2023-06-22 06:13:24,111 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.19 vs. limit=10.0 2023-06-22 06:13:36,840 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.12 vs. limit=6.0 2023-06-22 06:13:48,330 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=22.5 2023-06-22 06:14:27,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1161042.0, ans=0.125 2023-06-22 06:14:37,814 INFO [train.py:996] (3/4) Epoch 7, batch 10550, loss[loss=0.1952, simple_loss=0.2549, pruned_loss=0.06773, over 21321.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.307, pruned_loss=0.0808, over 4260349.65 frames. ], batch size: 551, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:14:40,895 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.300e+02 2.934e+02 3.554e+02 4.294e+02 7.411e+02, threshold=7.109e+02, percent-clipped=0.0 2023-06-22 06:14:44,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-22 06:14:46,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1161102.0, ans=0.025 2023-06-22 06:14:49,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1161102.0, ans=0.125 2023-06-22 06:14:55,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1161102.0, ans=0.125 2023-06-22 06:15:02,616 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-22 06:15:03,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1161162.0, ans=0.125 2023-06-22 06:15:09,420 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=22.5 2023-06-22 06:15:10,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1161162.0, ans=0.0 2023-06-22 06:15:56,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1161342.0, ans=10.0 2023-06-22 06:16:16,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1161342.0, ans=0.2 2023-06-22 06:16:18,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1161402.0, ans=0.125 2023-06-22 06:16:19,505 INFO [train.py:996] (3/4) Epoch 7, batch 10600, loss[loss=0.2141, simple_loss=0.2705, pruned_loss=0.07887, over 15214.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3026, pruned_loss=0.07964, over 4248102.24 frames. ], batch size: 62, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:16:21,637 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:17:01,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1161522.0, ans=0.125 2023-06-22 06:17:26,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1161582.0, ans=0.0 2023-06-22 06:17:34,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1161582.0, ans=0.5 2023-06-22 06:17:48,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1161642.0, ans=0.0 2023-06-22 06:18:06,263 INFO [train.py:996] (3/4) Epoch 7, batch 10650, loss[loss=0.1787, simple_loss=0.2552, pruned_loss=0.05113, over 21308.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3016, pruned_loss=0.07701, over 4248417.94 frames. ], batch size: 194, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:18:11,058 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.321e+02 3.046e+02 3.763e+02 4.720e+02 8.386e+02, threshold=7.526e+02, percent-clipped=4.0 2023-06-22 06:18:13,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1161702.0, ans=0.125 2023-06-22 06:18:17,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1161702.0, ans=0.125 2023-06-22 06:18:59,265 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-22 06:19:07,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-22 06:19:47,762 INFO [train.py:996] (3/4) Epoch 7, batch 10700, loss[loss=0.1793, simple_loss=0.2487, pruned_loss=0.05496, over 21552.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3005, pruned_loss=0.07684, over 4258700.26 frames. ], batch size: 230, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:19:54,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1162002.0, ans=0.125 2023-06-22 06:20:00,031 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-06-22 06:20:13,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1162062.0, ans=0.125 2023-06-22 06:20:53,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1162182.0, ans=0.125 2023-06-22 06:20:56,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1162182.0, ans=0.1 2023-06-22 06:20:56,772 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-22 06:21:16,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1162242.0, ans=0.2 2023-06-22 06:21:30,721 INFO [train.py:996] (3/4) Epoch 7, batch 10750, loss[loss=0.2857, simple_loss=0.3843, pruned_loss=0.09351, over 21322.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3119, pruned_loss=0.08162, over 4262500.91 frames. ], batch size: 548, lr: 4.36e-03, grad_scale: 8.0 2023-06-22 06:21:31,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1162302.0, ans=0.125 2023-06-22 06:21:34,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1162302.0, ans=0.125 2023-06-22 06:21:42,676 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 3.648e+02 4.416e+02 6.142e+02 1.061e+03, threshold=8.833e+02, percent-clipped=11.0 2023-06-22 06:21:44,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1162302.0, ans=0.0 2023-06-22 06:21:44,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1162302.0, ans=0.125 2023-06-22 06:22:19,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1162422.0, ans=0.125 2023-06-22 06:22:25,454 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.56 vs. limit=10.0 2023-06-22 06:22:46,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1162482.0, ans=0.125 2023-06-22 06:23:17,162 INFO [train.py:996] (3/4) Epoch 7, batch 10800, loss[loss=0.209, simple_loss=0.2957, pruned_loss=0.0612, over 20720.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3181, pruned_loss=0.08363, over 4269291.96 frames. ], batch size: 607, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:23:20,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1162602.0, ans=0.2 2023-06-22 06:23:31,050 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=22.5 2023-06-22 06:23:53,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1162722.0, ans=0.1 2023-06-22 06:24:03,092 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.33 vs. limit=15.0 2023-06-22 06:24:19,028 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-22 06:24:31,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=1162782.0, ans=22.5 2023-06-22 06:24:56,602 INFO [train.py:996] (3/4) Epoch 7, batch 10850, loss[loss=0.2036, simple_loss=0.2718, pruned_loss=0.0677, over 21333.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3211, pruned_loss=0.08439, over 4261575.62 frames. ], batch size: 211, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:25:07,277 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=22.5 2023-06-22 06:25:07,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 3.182e+02 4.104e+02 5.003e+02 8.249e+02, threshold=8.208e+02, percent-clipped=0.0 2023-06-22 06:26:01,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1163082.0, ans=0.125 2023-06-22 06:26:41,640 INFO [train.py:996] (3/4) Epoch 7, batch 10900, loss[loss=0.2361, simple_loss=0.285, pruned_loss=0.09354, over 21414.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3135, pruned_loss=0.08273, over 4269798.65 frames. ], batch size: 475, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:27:03,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1163262.0, ans=0.125 2023-06-22 06:27:04,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1163262.0, ans=0.1 2023-06-22 06:27:25,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1163322.0, ans=0.125 2023-06-22 06:27:28,530 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:27:44,971 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.08 vs. limit=15.0 2023-06-22 06:28:16,262 INFO [train.py:996] (3/4) Epoch 7, batch 10950, loss[loss=0.2275, simple_loss=0.2901, pruned_loss=0.08246, over 21205.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3095, pruned_loss=0.0807, over 4268568.18 frames. ], batch size: 176, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:28:27,453 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.461e+02 3.282e+02 3.918e+02 4.735e+02 6.803e+02, threshold=7.835e+02, percent-clipped=0.0 2023-06-22 06:28:37,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1163562.0, ans=0.0 2023-06-22 06:28:39,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1163562.0, ans=0.2 2023-06-22 06:28:56,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1163622.0, ans=0.0 2023-06-22 06:29:01,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1163622.0, ans=0.2 2023-06-22 06:29:43,568 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.16 vs. limit=15.0 2023-06-22 06:29:50,120 INFO [train.py:996] (3/4) Epoch 7, batch 11000, loss[loss=0.2443, simple_loss=0.3103, pruned_loss=0.08915, over 21897.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3088, pruned_loss=0.08228, over 4280427.89 frames. ], batch size: 351, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:29:53,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1163802.0, ans=0.125 2023-06-22 06:30:08,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1163802.0, ans=0.125 2023-06-22 06:30:16,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1163862.0, ans=0.035 2023-06-22 06:30:36,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1163922.0, ans=0.125 2023-06-22 06:30:54,134 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-06-22 06:31:20,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1164042.0, ans=0.125 2023-06-22 06:31:30,316 INFO [train.py:996] (3/4) Epoch 7, batch 11050, loss[loss=0.1968, simple_loss=0.2522, pruned_loss=0.07076, over 21248.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3068, pruned_loss=0.08326, over 4273140.78 frames. ], batch size: 548, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:31:40,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.144e+02 3.658e+02 4.347e+02 7.948e+02, threshold=7.316e+02, percent-clipped=1.0 2023-06-22 06:31:41,388 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:31:50,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1164162.0, ans=0.125 2023-06-22 06:32:21,210 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-06-22 06:32:25,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1164282.0, ans=0.2 2023-06-22 06:33:02,551 INFO [train.py:996] (3/4) Epoch 7, batch 11100, loss[loss=0.2508, simple_loss=0.3059, pruned_loss=0.09783, over 21256.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3061, pruned_loss=0.08374, over 4279813.01 frames. ], batch size: 471, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:33:07,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1164402.0, ans=0.125 2023-06-22 06:34:05,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2023-06-22 06:34:10,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1164582.0, ans=15.0 2023-06-22 06:34:42,428 INFO [train.py:996] (3/4) Epoch 7, batch 11150, loss[loss=0.2152, simple_loss=0.2827, pruned_loss=0.07388, over 21609.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.304, pruned_loss=0.08372, over 4285695.17 frames. ], batch size: 332, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:34:48,747 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.401e+02 2.924e+02 3.298e+02 3.958e+02 6.309e+02, threshold=6.596e+02, percent-clipped=0.0 2023-06-22 06:36:13,499 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-22 06:36:16,921 INFO [train.py:996] (3/4) Epoch 7, batch 11200, loss[loss=0.2195, simple_loss=0.2825, pruned_loss=0.07826, over 21698.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3021, pruned_loss=0.08278, over 4290238.54 frames. ], batch size: 333, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:36:17,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1165002.0, ans=0.025 2023-06-22 06:36:22,664 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.34 vs. limit=10.0 2023-06-22 06:36:33,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1165002.0, ans=0.1 2023-06-22 06:36:33,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=1165002.0, ans=12.0 2023-06-22 06:36:47,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1165062.0, ans=0.2 2023-06-22 06:37:01,850 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-22 06:37:10,675 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:37:52,318 INFO [train.py:996] (3/4) Epoch 7, batch 11250, loss[loss=0.2282, simple_loss=0.3082, pruned_loss=0.07411, over 21182.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3026, pruned_loss=0.08276, over 4287346.46 frames. ], batch size: 143, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:37:58,484 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 2.901e+02 3.332e+02 3.824e+02 5.999e+02, threshold=6.664e+02, percent-clipped=0.0 2023-06-22 06:38:18,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1165362.0, ans=0.0 2023-06-22 06:38:18,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1165362.0, ans=0.125 2023-06-22 06:39:31,696 INFO [train.py:996] (3/4) Epoch 7, batch 11300, loss[loss=0.1942, simple_loss=0.2668, pruned_loss=0.06083, over 15931.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3028, pruned_loss=0.08247, over 4286824.95 frames. ], batch size: 60, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:40:33,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.70 vs. limit=10.0 2023-06-22 06:41:01,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1165842.0, ans=0.2 2023-06-22 06:41:12,116 INFO [train.py:996] (3/4) Epoch 7, batch 11350, loss[loss=0.2658, simple_loss=0.3338, pruned_loss=0.09889, over 21615.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3044, pruned_loss=0.08169, over 4285047.44 frames. ], batch size: 230, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:41:23,494 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 2.937e+02 3.595e+02 4.319e+02 9.423e+02, threshold=7.190e+02, percent-clipped=3.0 2023-06-22 06:41:44,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1165962.0, ans=0.1 2023-06-22 06:42:06,668 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-06-22 06:42:14,727 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-22 06:42:59,005 INFO [train.py:996] (3/4) Epoch 7, batch 11400, loss[loss=0.2433, simple_loss=0.3062, pruned_loss=0.09019, over 21385.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3109, pruned_loss=0.0847, over 4283681.66 frames. ], batch size: 159, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:43:05,717 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:43:06,214 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-22 06:43:31,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1166262.0, ans=0.125 2023-06-22 06:44:40,297 INFO [train.py:996] (3/4) Epoch 7, batch 11450, loss[loss=0.2348, simple_loss=0.3077, pruned_loss=0.08093, over 21463.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3105, pruned_loss=0.08325, over 4279980.16 frames. ], batch size: 194, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:44:52,081 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 3.083e+02 3.885e+02 5.108e+02 7.985e+02, threshold=7.771e+02, percent-clipped=2.0 2023-06-22 06:45:00,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1166502.0, ans=0.125 2023-06-22 06:45:03,411 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:45:03,973 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.11 vs. limit=10.0 2023-06-22 06:45:14,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1166562.0, ans=0.0 2023-06-22 06:45:19,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1166562.0, ans=0.125 2023-06-22 06:45:53,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1166682.0, ans=0.07 2023-06-22 06:46:14,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1166742.0, ans=0.125 2023-06-22 06:46:21,783 INFO [train.py:996] (3/4) Epoch 7, batch 11500, loss[loss=0.2326, simple_loss=0.3149, pruned_loss=0.07513, over 21482.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3142, pruned_loss=0.08543, over 4275729.57 frames. ], batch size: 194, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:46:51,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1166862.0, ans=0.0 2023-06-22 06:47:51,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1167042.0, ans=0.1 2023-06-22 06:47:52,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1167042.0, ans=0.125 2023-06-22 06:47:56,761 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.62 vs. limit=10.0 2023-06-22 06:48:14,284 INFO [train.py:996] (3/4) Epoch 7, batch 11550, loss[loss=0.2817, simple_loss=0.3877, pruned_loss=0.08786, over 21671.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3202, pruned_loss=0.08496, over 4279506.13 frames. ], batch size: 414, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:48:21,199 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.417e+02 3.086e+02 3.744e+02 4.289e+02 8.491e+02, threshold=7.488e+02, percent-clipped=1.0 2023-06-22 06:49:07,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-06-22 06:49:14,954 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:49:47,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1167342.0, ans=0.125 2023-06-22 06:49:56,234 INFO [train.py:996] (3/4) Epoch 7, batch 11600, loss[loss=0.2426, simple_loss=0.3396, pruned_loss=0.07282, over 21571.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3339, pruned_loss=0.08642, over 4279748.71 frames. ], batch size: 230, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:50:20,883 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.97 vs. limit=15.0 2023-06-22 06:50:54,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1167582.0, ans=0.05 2023-06-22 06:51:12,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1167582.0, ans=0.125 2023-06-22 06:51:26,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1167642.0, ans=0.125 2023-06-22 06:51:37,740 INFO [train.py:996] (3/4) Epoch 7, batch 11650, loss[loss=0.2616, simple_loss=0.3434, pruned_loss=0.08996, over 20707.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3403, pruned_loss=0.08665, over 4274885.82 frames. ], batch size: 607, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:51:52,648 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 3.531e+02 4.483e+02 5.705e+02 9.764e+02, threshold=8.966e+02, percent-clipped=9.0 2023-06-22 06:51:54,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1167702.0, ans=0.0 2023-06-22 06:52:23,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1167822.0, ans=0.2 2023-06-22 06:52:45,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1167882.0, ans=0.125 2023-06-22 06:53:18,242 INFO [train.py:996] (3/4) Epoch 7, batch 11700, loss[loss=0.2171, simple_loss=0.285, pruned_loss=0.07458, over 15780.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3313, pruned_loss=0.08637, over 4264759.68 frames. ], batch size: 65, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:53:20,498 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:53:39,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1168062.0, ans=0.2 2023-06-22 06:54:03,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1168122.0, ans=0.0 2023-06-22 06:54:07,359 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-22 06:54:14,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1168182.0, ans=0.125 2023-06-22 06:54:44,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1168242.0, ans=0.125 2023-06-22 06:54:44,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1168242.0, ans=0.125 2023-06-22 06:54:56,759 INFO [train.py:996] (3/4) Epoch 7, batch 11750, loss[loss=0.2551, simple_loss=0.3242, pruned_loss=0.09298, over 21792.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3235, pruned_loss=0.08574, over 4253561.83 frames. ], batch size: 118, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:54:58,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1168302.0, ans=0.1 2023-06-22 06:55:11,697 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.565e+02 3.104e+02 3.664e+02 4.523e+02 8.929e+02, threshold=7.328e+02, percent-clipped=0.0 2023-06-22 06:55:24,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1168362.0, ans=0.2 2023-06-22 06:55:24,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1168362.0, ans=0.125 2023-06-22 06:56:44,766 INFO [train.py:996] (3/4) Epoch 7, batch 11800, loss[loss=0.3036, simple_loss=0.3698, pruned_loss=0.1186, over 21404.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3249, pruned_loss=0.0883, over 4263780.76 frames. ], batch size: 507, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:56:50,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=1168602.0, ans=0.2 2023-06-22 06:56:56,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1168602.0, ans=0.1 2023-06-22 06:58:06,737 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-22 06:58:25,531 INFO [train.py:996] (3/4) Epoch 7, batch 11850, loss[loss=0.2758, simple_loss=0.4062, pruned_loss=0.07269, over 20748.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.328, pruned_loss=0.08743, over 4265747.80 frames. ], batch size: 607, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:58:37,629 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=15.0 2023-06-22 06:58:39,970 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.224e+02 3.745e+02 4.482e+02 9.714e+02, threshold=7.491e+02, percent-clipped=2.0 2023-06-22 06:58:41,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1168902.0, ans=0.125 2023-06-22 06:59:34,354 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=22.5 2023-06-22 07:00:12,046 INFO [train.py:996] (3/4) Epoch 7, batch 11900, loss[loss=0.2332, simple_loss=0.2969, pruned_loss=0.08475, over 21423.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.327, pruned_loss=0.08464, over 4266827.77 frames. ], batch size: 194, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 07:00:55,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1169322.0, ans=0.2 2023-06-22 07:01:02,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1169322.0, ans=0.125 2023-06-22 07:01:48,746 INFO [train.py:996] (3/4) Epoch 7, batch 11950, loss[loss=0.192, simple_loss=0.2886, pruned_loss=0.04772, over 21724.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3287, pruned_loss=0.08173, over 4260037.75 frames. ], batch size: 351, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 07:01:50,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1169502.0, ans=0.125 2023-06-22 07:01:58,125 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 3.043e+02 3.599e+02 4.818e+02 9.282e+02, threshold=7.198e+02, percent-clipped=3.0 2023-06-22 07:02:14,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1169562.0, ans=0.0 2023-06-22 07:02:26,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1169562.0, ans=0.125 2023-06-22 07:02:32,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1169622.0, ans=15.0 2023-06-22 07:02:35,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1169622.0, ans=0.125 2023-06-22 07:02:35,936 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.80 vs. limit=10.0 2023-06-22 07:03:27,369 INFO [train.py:996] (3/4) Epoch 7, batch 12000, loss[loss=0.204, simple_loss=0.2647, pruned_loss=0.07166, over 21229.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3233, pruned_loss=0.08045, over 4259486.82 frames. ], batch size: 176, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:03:27,370 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 07:03:36,379 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.7335, 3.0576, 2.9713, 1.6600], device='cuda:3') 2023-06-22 07:03:39,558 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.1708, 3.6099, 1.8156, 1.8269], device='cuda:3') 2023-06-22 07:03:43,841 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2652, simple_loss=0.3601, pruned_loss=0.08515, over 1796401.00 frames. 2023-06-22 07:03:43,842 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-22 07:04:22,196 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:04:25,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1169922.0, ans=0.0 2023-06-22 07:04:41,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1169982.0, ans=0.125 2023-06-22 07:04:54,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1169982.0, ans=0.1 2023-06-22 07:05:08,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1170042.0, ans=0.0 2023-06-22 07:05:20,569 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:05:23,288 INFO [train.py:996] (3/4) Epoch 7, batch 12050, loss[loss=0.2715, simple_loss=0.3313, pruned_loss=0.1058, over 21853.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3194, pruned_loss=0.08329, over 4263634.66 frames. ], batch size: 391, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:05:37,733 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.333e+02 3.086e+02 3.580e+02 4.845e+02 1.189e+03, threshold=7.160e+02, percent-clipped=3.0 2023-06-22 07:05:41,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1170102.0, ans=0.125 2023-06-22 07:05:42,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1170162.0, ans=0.05 2023-06-22 07:05:56,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1170162.0, ans=0.0 2023-06-22 07:06:45,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1170282.0, ans=0.0 2023-06-22 07:06:45,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1170282.0, ans=0.1 2023-06-22 07:07:08,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1170402.0, ans=0.125 2023-06-22 07:07:09,344 INFO [train.py:996] (3/4) Epoch 7, batch 12100, loss[loss=0.3357, simple_loss=0.3822, pruned_loss=0.1446, over 21206.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3239, pruned_loss=0.08816, over 4273090.64 frames. ], batch size: 143, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:07:13,627 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-22 07:08:10,641 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.48 vs. limit=15.0 2023-06-22 07:08:18,672 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:08:57,244 INFO [train.py:996] (3/4) Epoch 7, batch 12150, loss[loss=0.225, simple_loss=0.3001, pruned_loss=0.07498, over 21295.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3254, pruned_loss=0.08666, over 4264806.54 frames. ], batch size: 176, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:09:07,097 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.430e+02 3.401e+02 4.092e+02 5.164e+02 8.690e+02, threshold=8.185e+02, percent-clipped=4.0 2023-06-22 07:10:08,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1170882.0, ans=0.0 2023-06-22 07:10:35,912 INFO [train.py:996] (3/4) Epoch 7, batch 12200, loss[loss=0.2331, simple_loss=0.2825, pruned_loss=0.09187, over 21207.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.323, pruned_loss=0.08528, over 4262459.59 frames. ], batch size: 159, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:11:21,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1171122.0, ans=0.035 2023-06-22 07:11:43,210 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.14 vs. limit=15.0 2023-06-22 07:11:45,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1171182.0, ans=0.125 2023-06-22 07:11:45,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1171182.0, ans=0.035 2023-06-22 07:12:03,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1171242.0, ans=0.0 2023-06-22 07:12:07,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1171242.0, ans=0.125 2023-06-22 07:12:13,481 INFO [train.py:996] (3/4) Epoch 7, batch 12250, loss[loss=0.1834, simple_loss=0.2722, pruned_loss=0.04733, over 21686.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3138, pruned_loss=0.08189, over 4262499.80 frames. ], batch size: 391, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:12:24,061 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 3.345e+02 4.478e+02 6.132e+02 1.246e+03, threshold=8.957e+02, percent-clipped=10.0 2023-06-22 07:12:44,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1171362.0, ans=0.0 2023-06-22 07:13:06,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1171422.0, ans=0.125 2023-06-22 07:13:27,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1171482.0, ans=0.0 2023-06-22 07:13:52,883 INFO [train.py:996] (3/4) Epoch 7, batch 12300, loss[loss=0.1558, simple_loss=0.2329, pruned_loss=0.0394, over 21268.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3037, pruned_loss=0.07609, over 4253474.54 frames. ], batch size: 176, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:14:29,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1171722.0, ans=0.1 2023-06-22 07:15:26,371 INFO [train.py:996] (3/4) Epoch 7, batch 12350, loss[loss=0.2639, simple_loss=0.3313, pruned_loss=0.09828, over 21498.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3071, pruned_loss=0.07632, over 4256003.77 frames. ], batch size: 131, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:15:34,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1171902.0, ans=0.0 2023-06-22 07:15:37,079 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 2.594e+02 3.277e+02 4.549e+02 8.356e+02, threshold=6.553e+02, percent-clipped=0.0 2023-06-22 07:15:40,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1171962.0, ans=0.04949747468305833 2023-06-22 07:16:03,999 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-22 07:16:21,209 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2023-06-22 07:16:25,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1172082.0, ans=0.1 2023-06-22 07:16:35,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1172082.0, ans=0.2 2023-06-22 07:17:05,103 INFO [train.py:996] (3/4) Epoch 7, batch 12400, loss[loss=0.2757, simple_loss=0.3427, pruned_loss=0.1044, over 21851.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3115, pruned_loss=0.08043, over 4268005.48 frames. ], batch size: 414, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:17:05,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1172202.0, ans=0.0 2023-06-22 07:17:13,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1172202.0, ans=0.125 2023-06-22 07:17:15,462 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-06-22 07:18:44,306 INFO [train.py:996] (3/4) Epoch 7, batch 12450, loss[loss=0.3106, simple_loss=0.3884, pruned_loss=0.1164, over 21811.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3157, pruned_loss=0.0838, over 4271493.24 frames. ], batch size: 124, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:19:01,080 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.229e+02 3.781e+02 4.439e+02 8.175e+02, threshold=7.562e+02, percent-clipped=5.0 2023-06-22 07:19:11,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1172562.0, ans=0.125 2023-06-22 07:19:39,047 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-22 07:20:32,513 INFO [train.py:996] (3/4) Epoch 7, batch 12500, loss[loss=0.2246, simple_loss=0.3613, pruned_loss=0.044, over 19783.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3271, pruned_loss=0.08654, over 4273591.79 frames. ], batch size: 702, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:21:06,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1172862.0, ans=0.125 2023-06-22 07:22:14,662 INFO [train.py:996] (3/4) Epoch 7, batch 12550, loss[loss=0.2477, simple_loss=0.332, pruned_loss=0.0817, over 21729.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.331, pruned_loss=0.08909, over 4276136.39 frames. ], batch size: 332, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:22:32,598 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.611e+02 3.175e+02 3.622e+02 4.685e+02 7.876e+02, threshold=7.244e+02, percent-clipped=1.0 2023-06-22 07:23:00,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1173222.0, ans=0.0 2023-06-22 07:23:18,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1173282.0, ans=0.125 2023-06-22 07:23:36,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1173282.0, ans=0.2 2023-06-22 07:23:53,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1173342.0, ans=0.125 2023-06-22 07:23:53,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1173342.0, ans=0.2 2023-06-22 07:24:00,199 INFO [train.py:996] (3/4) Epoch 7, batch 12600, loss[loss=0.2333, simple_loss=0.3188, pruned_loss=0.07385, over 21697.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3296, pruned_loss=0.08672, over 4275600.93 frames. ], batch size: 351, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:24:14,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1173402.0, ans=0.125 2023-06-22 07:24:49,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1173522.0, ans=0.125 2023-06-22 07:25:06,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1173582.0, ans=0.125 2023-06-22 07:25:15,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1173582.0, ans=0.125 2023-06-22 07:25:23,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1173642.0, ans=0.125 2023-06-22 07:25:29,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1173642.0, ans=0.125 2023-06-22 07:25:38,763 INFO [train.py:996] (3/4) Epoch 7, batch 12650, loss[loss=0.2376, simple_loss=0.3033, pruned_loss=0.08592, over 21282.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3235, pruned_loss=0.0835, over 4277391.04 frames. ], batch size: 143, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:25:51,193 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.154e+02 3.639e+02 4.446e+02 1.064e+03, threshold=7.278e+02, percent-clipped=5.0 2023-06-22 07:26:07,148 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-22 07:26:47,824 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:26:54,768 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=15.0 2023-06-22 07:27:19,803 INFO [train.py:996] (3/4) Epoch 7, batch 12700, loss[loss=0.2678, simple_loss=0.3332, pruned_loss=0.1012, over 21744.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3222, pruned_loss=0.08595, over 4286678.68 frames. ], batch size: 332, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:27:32,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1174002.0, ans=0.07 2023-06-22 07:27:32,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1174002.0, ans=0.1 2023-06-22 07:27:37,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1174002.0, ans=0.0 2023-06-22 07:28:10,713 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-22 07:28:26,445 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=12.0 2023-06-22 07:28:32,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1174182.0, ans=0.015 2023-06-22 07:29:00,187 INFO [train.py:996] (3/4) Epoch 7, batch 12750, loss[loss=0.2372, simple_loss=0.3244, pruned_loss=0.07499, over 21644.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3236, pruned_loss=0.08639, over 4289461.16 frames. ], batch size: 263, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:29:07,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1174302.0, ans=0.0 2023-06-22 07:29:16,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1174302.0, ans=0.05 2023-06-22 07:29:17,308 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-22 07:29:17,965 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.566e+02 3.268e+02 3.639e+02 4.556e+02 7.416e+02, threshold=7.278e+02, percent-clipped=1.0 2023-06-22 07:29:18,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1174302.0, ans=0.0 2023-06-22 07:29:18,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1174302.0, ans=0.125 2023-06-22 07:29:28,114 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2023-06-22 07:29:40,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1174422.0, ans=0.0 2023-06-22 07:29:48,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1174422.0, ans=0.025 2023-06-22 07:30:44,402 INFO [train.py:996] (3/4) Epoch 7, batch 12800, loss[loss=0.2335, simple_loss=0.312, pruned_loss=0.07751, over 20073.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3245, pruned_loss=0.08706, over 4284392.17 frames. ], batch size: 702, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:31:35,583 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-06-22 07:31:44,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=1174782.0, ans=0.2 2023-06-22 07:32:03,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-06-22 07:32:25,151 INFO [train.py:996] (3/4) Epoch 7, batch 12850, loss[loss=0.2048, simple_loss=0.2989, pruned_loss=0.05534, over 21749.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3257, pruned_loss=0.08757, over 4282227.49 frames. ], batch size: 282, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:32:39,817 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.419e+02 3.112e+02 3.554e+02 4.407e+02 7.373e+02, threshold=7.108e+02, percent-clipped=1.0 2023-06-22 07:33:25,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1175022.0, ans=0.125 2023-06-22 07:33:50,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1175142.0, ans=0.0 2023-06-22 07:33:50,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1175142.0, ans=0.04949747468305833 2023-06-22 07:34:01,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1175142.0, ans=0.125 2023-06-22 07:34:06,254 INFO [train.py:996] (3/4) Epoch 7, batch 12900, loss[loss=0.2509, simple_loss=0.3403, pruned_loss=0.08079, over 21652.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3234, pruned_loss=0.08385, over 4276131.92 frames. ], batch size: 389, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:34:28,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1175262.0, ans=0.125 2023-06-22 07:35:00,566 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=22.5 2023-06-22 07:35:27,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1175382.0, ans=0.1 2023-06-22 07:35:53,252 INFO [train.py:996] (3/4) Epoch 7, batch 12950, loss[loss=0.2347, simple_loss=0.3045, pruned_loss=0.08245, over 21414.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3221, pruned_loss=0.08295, over 4279598.91 frames. ], batch size: 194, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:36:12,621 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.305e+02 2.893e+02 3.599e+02 4.715e+02 8.391e+02, threshold=7.198e+02, percent-clipped=5.0 2023-06-22 07:36:53,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1175622.0, ans=0.0 2023-06-22 07:37:33,414 INFO [train.py:996] (3/4) Epoch 7, batch 13000, loss[loss=0.2048, simple_loss=0.2849, pruned_loss=0.06233, over 21732.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3216, pruned_loss=0.08269, over 4276173.41 frames. ], batch size: 298, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:37:36,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1175802.0, ans=0.0 2023-06-22 07:37:53,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1175862.0, ans=0.1 2023-06-22 07:38:37,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1175982.0, ans=0.125 2023-06-22 07:38:46,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1175982.0, ans=0.125 2023-06-22 07:39:07,117 INFO [train.py:996] (3/4) Epoch 7, batch 13050, loss[loss=0.2457, simple_loss=0.3064, pruned_loss=0.09247, over 21343.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3167, pruned_loss=0.08066, over 4284213.23 frames. ], batch size: 144, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:39:30,662 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.865e+02 2.860e+02 3.531e+02 4.680e+02 1.133e+03, threshold=7.061e+02, percent-clipped=2.0 2023-06-22 07:39:45,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1176162.0, ans=0.1 2023-06-22 07:40:32,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1176342.0, ans=0.0 2023-06-22 07:40:56,509 INFO [train.py:996] (3/4) Epoch 7, batch 13100, loss[loss=0.2369, simple_loss=0.3152, pruned_loss=0.07928, over 21746.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3175, pruned_loss=0.08037, over 4291801.48 frames. ], batch size: 247, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:41:01,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1176402.0, ans=0.95 2023-06-22 07:41:18,673 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=22.5 2023-06-22 07:41:37,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1176522.0, ans=0.125 2023-06-22 07:41:45,961 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.70 vs. limit=12.0 2023-06-22 07:42:02,071 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-22 07:42:22,940 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.77 vs. limit=15.0 2023-06-22 07:42:23,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1176642.0, ans=0.125 2023-06-22 07:42:33,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1176642.0, ans=0.125 2023-06-22 07:42:42,748 INFO [train.py:996] (3/4) Epoch 7, batch 13150, loss[loss=0.2775, simple_loss=0.347, pruned_loss=0.104, over 21454.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.322, pruned_loss=0.0841, over 4292835.09 frames. ], batch size: 471, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:42:46,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1176702.0, ans=0.0 2023-06-22 07:43:01,881 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.295e+02 3.619e+02 4.512e+02 5.792e+02 9.632e+02, threshold=9.025e+02, percent-clipped=11.0 2023-06-22 07:43:12,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1176762.0, ans=0.0 2023-06-22 07:44:11,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1176942.0, ans=0.125 2023-06-22 07:44:12,362 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=12.0 2023-06-22 07:44:14,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1176942.0, ans=0.125 2023-06-22 07:44:16,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1176942.0, ans=0.1 2023-06-22 07:44:23,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1177002.0, ans=0.125 2023-06-22 07:44:24,033 INFO [train.py:996] (3/4) Epoch 7, batch 13200, loss[loss=0.2629, simple_loss=0.3331, pruned_loss=0.09639, over 21825.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3216, pruned_loss=0.08478, over 4292645.87 frames. ], batch size: 124, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:44:43,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1177062.0, ans=0.125 2023-06-22 07:45:09,234 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=22.5 2023-06-22 07:45:32,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1177182.0, ans=0.125 2023-06-22 07:45:34,682 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.84 vs. limit=6.0 2023-06-22 07:46:09,406 INFO [train.py:996] (3/4) Epoch 7, batch 13250, loss[loss=0.218, simple_loss=0.3008, pruned_loss=0.06763, over 21472.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.322, pruned_loss=0.08741, over 4294762.62 frames. ], batch size: 211, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:46:24,177 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.563e+02 3.281e+02 4.048e+02 5.234e+02 8.486e+02, threshold=8.096e+02, percent-clipped=0.0 2023-06-22 07:46:34,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1177362.0, ans=0.07 2023-06-22 07:47:08,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1177482.0, ans=0.0 2023-06-22 07:47:50,901 INFO [train.py:996] (3/4) Epoch 7, batch 13300, loss[loss=0.2937, simple_loss=0.3628, pruned_loss=0.1122, over 21865.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3254, pruned_loss=0.08703, over 4279965.11 frames. ], batch size: 118, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:48:12,676 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-06-22 07:48:32,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1177722.0, ans=0.2 2023-06-22 07:48:46,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1177722.0, ans=0.0 2023-06-22 07:49:10,612 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.25 vs. limit=12.0 2023-06-22 07:49:24,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1177842.0, ans=0.0 2023-06-22 07:49:28,818 INFO [train.py:996] (3/4) Epoch 7, batch 13350, loss[loss=0.222, simple_loss=0.3146, pruned_loss=0.0647, over 20808.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3295, pruned_loss=0.08964, over 4278853.66 frames. ], batch size: 607, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:49:43,389 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.412e+02 3.139e+02 3.531e+02 4.158e+02 7.079e+02, threshold=7.062e+02, percent-clipped=0.0 2023-06-22 07:51:01,029 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-06-22 07:51:08,304 INFO [train.py:996] (3/4) Epoch 7, batch 13400, loss[loss=0.2676, simple_loss=0.3354, pruned_loss=0.09991, over 21794.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3316, pruned_loss=0.0922, over 4280523.33 frames. ], batch size: 414, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:51:30,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1178262.0, ans=0.125 2023-06-22 07:51:59,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1178322.0, ans=0.125 2023-06-22 07:52:25,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1178382.0, ans=0.1 2023-06-22 07:52:37,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1178442.0, ans=0.5 2023-06-22 07:52:39,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1178442.0, ans=0.125 2023-06-22 07:52:48,418 INFO [train.py:996] (3/4) Epoch 7, batch 13450, loss[loss=0.239, simple_loss=0.3107, pruned_loss=0.08367, over 21459.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3327, pruned_loss=0.09454, over 4283491.69 frames. ], batch size: 131, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:53:02,583 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.55 vs. limit=22.5 2023-06-22 07:53:12,720 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.492e+02 3.365e+02 3.946e+02 4.575e+02 8.284e+02, threshold=7.892e+02, percent-clipped=1.0 2023-06-22 07:53:13,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1178562.0, ans=0.125 2023-06-22 07:54:28,478 INFO [train.py:996] (3/4) Epoch 7, batch 13500, loss[loss=0.247, simple_loss=0.3117, pruned_loss=0.09118, over 21305.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.323, pruned_loss=0.09087, over 4281942.82 frames. ], batch size: 159, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:55:29,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1178922.0, ans=0.1 2023-06-22 07:55:30,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1178922.0, ans=0.125 2023-06-22 07:55:50,162 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:56:06,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1179042.0, ans=0.2 2023-06-22 07:56:08,980 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.80 vs. limit=22.5 2023-06-22 07:56:15,633 INFO [train.py:996] (3/4) Epoch 7, batch 13550, loss[loss=0.2822, simple_loss=0.3849, pruned_loss=0.08974, over 21680.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3283, pruned_loss=0.09058, over 4281084.24 frames. ], batch size: 414, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:56:33,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1179102.0, ans=0.125 2023-06-22 07:56:36,013 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.476e+02 3.442e+02 4.149e+02 5.236e+02 8.278e+02, threshold=8.298e+02, percent-clipped=4.0 2023-06-22 07:56:37,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1179162.0, ans=0.1 2023-06-22 07:56:38,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1179162.0, ans=0.125 2023-06-22 07:56:43,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-22 07:57:19,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1179282.0, ans=0.125 2023-06-22 07:57:38,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1179342.0, ans=0.125 2023-06-22 07:57:54,928 INFO [train.py:996] (3/4) Epoch 7, batch 13600, loss[loss=0.1962, simple_loss=0.2743, pruned_loss=0.05906, over 21483.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3282, pruned_loss=0.09007, over 4280659.22 frames. ], batch size: 194, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:58:19,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.29 vs. limit=15.0 2023-06-22 07:58:20,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1179462.0, ans=0.0 2023-06-22 07:58:32,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1179462.0, ans=0.0 2023-06-22 07:58:46,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1179522.0, ans=0.125 2023-06-22 07:58:50,628 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-22 07:59:34,111 INFO [train.py:996] (3/4) Epoch 7, batch 13650, loss[loss=0.2555, simple_loss=0.3137, pruned_loss=0.09868, over 21568.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3222, pruned_loss=0.08601, over 4281109.88 frames. ], batch size: 414, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:59:39,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1179702.0, ans=0.0 2023-06-22 07:59:54,406 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.926e+02 3.440e+02 4.459e+02 9.365e+02, threshold=6.879e+02, percent-clipped=1.0 2023-06-22 07:59:59,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1179762.0, ans=0.125 2023-06-22 08:00:15,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1179822.0, ans=0.05 2023-06-22 08:00:29,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1179822.0, ans=0.2 2023-06-22 08:00:42,547 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-22 08:01:13,429 INFO [train.py:996] (3/4) Epoch 7, batch 13700, loss[loss=0.1968, simple_loss=0.2494, pruned_loss=0.07211, over 21756.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3159, pruned_loss=0.08526, over 4266876.30 frames. ], batch size: 124, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 08:01:33,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1180062.0, ans=0.0 2023-06-22 08:01:40,545 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=22.5 2023-06-22 08:01:54,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1180122.0, ans=0.1 2023-06-22 08:02:41,391 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=15.0 2023-06-22 08:02:51,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1180242.0, ans=0.0 2023-06-22 08:02:53,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1180242.0, ans=0.125 2023-06-22 08:02:59,601 INFO [train.py:996] (3/4) Epoch 7, batch 13750, loss[loss=0.2062, simple_loss=0.2804, pruned_loss=0.06601, over 21595.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.313, pruned_loss=0.08362, over 4271913.75 frames. ], batch size: 230, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 08:03:08,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1180302.0, ans=0.0 2023-06-22 08:03:20,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1180362.0, ans=0.0 2023-06-22 08:03:23,016 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.446e+02 3.304e+02 4.106e+02 4.985e+02 1.123e+03, threshold=8.212e+02, percent-clipped=9.0 2023-06-22 08:03:36,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1180422.0, ans=0.125 2023-06-22 08:04:11,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-22 08:04:47,319 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:04:48,518 INFO [train.py:996] (3/4) Epoch 7, batch 13800, loss[loss=0.264, simple_loss=0.3658, pruned_loss=0.08112, over 21784.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3178, pruned_loss=0.08247, over 4273795.27 frames. ], batch size: 332, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 08:04:55,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1180602.0, ans=0.0 2023-06-22 08:05:35,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1180722.0, ans=0.1 2023-06-22 08:05:56,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1180782.0, ans=0.125 2023-06-22 08:06:20,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1180842.0, ans=0.2 2023-06-22 08:06:29,555 INFO [train.py:996] (3/4) Epoch 7, batch 13850, loss[loss=0.3094, simple_loss=0.3864, pruned_loss=0.1162, over 21324.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3249, pruned_loss=0.08459, over 4271125.93 frames. ], batch size: 548, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:06:51,702 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 3.586e+02 4.613e+02 6.020e+02 1.189e+03, threshold=9.227e+02, percent-clipped=5.0 2023-06-22 08:07:29,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1181022.0, ans=0.2 2023-06-22 08:08:08,722 INFO [train.py:996] (3/4) Epoch 7, batch 13900, loss[loss=0.2812, simple_loss=0.3455, pruned_loss=0.1084, over 21804.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3277, pruned_loss=0.08776, over 4272920.70 frames. ], batch size: 351, lr: 4.32e-03, grad_scale: 8.0 2023-06-22 08:08:44,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1181262.0, ans=0.125 2023-06-22 08:08:55,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1181322.0, ans=0.1 2023-06-22 08:08:56,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1181322.0, ans=0.125 2023-06-22 08:09:36,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1181442.0, ans=0.0 2023-06-22 08:09:49,796 INFO [train.py:996] (3/4) Epoch 7, batch 13950, loss[loss=0.2701, simple_loss=0.3413, pruned_loss=0.09942, over 21850.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3274, pruned_loss=0.08989, over 4285384.46 frames. ], batch size: 414, lr: 4.32e-03, grad_scale: 8.0 2023-06-22 08:09:59,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1181502.0, ans=0.2 2023-06-22 08:10:02,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1181502.0, ans=0.2 2023-06-22 08:10:06,837 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-22 08:10:18,726 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.662e+02 3.399e+02 3.923e+02 4.848e+02 6.986e+02, threshold=7.845e+02, percent-clipped=0.0 2023-06-22 08:10:39,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1181622.0, ans=0.125 2023-06-22 08:10:39,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1181622.0, ans=0.125 2023-06-22 08:11:27,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1181802.0, ans=0.0 2023-06-22 08:11:28,462 INFO [train.py:996] (3/4) Epoch 7, batch 14000, loss[loss=0.2471, simple_loss=0.3114, pruned_loss=0.09147, over 20065.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3255, pruned_loss=0.08786, over 4280750.67 frames. ], batch size: 702, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:11:30,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1181802.0, ans=0.125 2023-06-22 08:11:56,840 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=22.5 2023-06-22 08:12:27,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1181922.0, ans=0.05 2023-06-22 08:12:49,939 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.65 vs. limit=10.0 2023-06-22 08:13:10,908 INFO [train.py:996] (3/4) Epoch 7, batch 14050, loss[loss=0.2054, simple_loss=0.3015, pruned_loss=0.05467, over 21748.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3196, pruned_loss=0.08324, over 4282574.72 frames. ], batch size: 298, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:13:34,944 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 3.004e+02 3.495e+02 4.384e+02 1.047e+03, threshold=6.990e+02, percent-clipped=3.0 2023-06-22 08:13:57,823 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-22 08:14:27,012 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.94 vs. limit=15.0 2023-06-22 08:14:49,751 INFO [train.py:996] (3/4) Epoch 7, batch 14100, loss[loss=0.2527, simple_loss=0.3164, pruned_loss=0.09444, over 21332.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3131, pruned_loss=0.08226, over 4283306.72 frames. ], batch size: 131, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:15:21,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1182462.0, ans=0.2 2023-06-22 08:15:25,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1182462.0, ans=0.125 2023-06-22 08:15:31,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1182522.0, ans=0.0 2023-06-22 08:15:46,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1182582.0, ans=0.0 2023-06-22 08:15:54,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1182582.0, ans=0.125 2023-06-22 08:16:04,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1182582.0, ans=0.125 2023-06-22 08:16:07,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1182642.0, ans=0.0 2023-06-22 08:16:21,796 INFO [train.py:996] (3/4) Epoch 7, batch 14150, loss[loss=0.2248, simple_loss=0.3153, pruned_loss=0.06712, over 21630.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3167, pruned_loss=0.08377, over 4283985.57 frames. ], batch size: 263, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:16:44,669 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.262e+02 2.878e+02 3.254e+02 3.924e+02 9.436e+02, threshold=6.508e+02, percent-clipped=4.0 2023-06-22 08:16:51,684 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=22.5 2023-06-22 08:17:57,549 INFO [train.py:996] (3/4) Epoch 7, batch 14200, loss[loss=0.229, simple_loss=0.3001, pruned_loss=0.07898, over 21374.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3152, pruned_loss=0.08161, over 4277910.59 frames. ], batch size: 194, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:18:05,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-22 08:18:13,232 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.02 vs. limit=15.0 2023-06-22 08:18:15,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1183062.0, ans=0.0 2023-06-22 08:18:18,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1183062.0, ans=0.125 2023-06-22 08:18:28,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1183062.0, ans=0.035 2023-06-22 08:18:46,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1183122.0, ans=0.125 2023-06-22 08:18:46,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1183122.0, ans=0.2 2023-06-22 08:18:57,107 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.44 vs. limit=6.0 2023-06-22 08:19:30,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1183242.0, ans=0.0 2023-06-22 08:19:36,399 INFO [train.py:996] (3/4) Epoch 7, batch 14250, loss[loss=0.2301, simple_loss=0.2917, pruned_loss=0.08428, over 20600.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3109, pruned_loss=0.08222, over 4267074.29 frames. ], batch size: 607, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:19:41,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1183302.0, ans=0.125 2023-06-22 08:19:55,864 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 2.870e+02 3.314e+02 3.996e+02 6.865e+02, threshold=6.627e+02, percent-clipped=2.0 2023-06-22 08:20:32,298 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-06-22 08:20:43,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=15.0 2023-06-22 08:21:07,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1183542.0, ans=0.125 2023-06-22 08:21:16,214 INFO [train.py:996] (3/4) Epoch 7, batch 14300, loss[loss=0.2725, simple_loss=0.3574, pruned_loss=0.09385, over 21676.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3136, pruned_loss=0.08284, over 4266481.28 frames. ], batch size: 247, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:22:11,420 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-06-22 08:22:18,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1183782.0, ans=0.125 2023-06-22 08:22:20,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1183782.0, ans=0.1 2023-06-22 08:22:24,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1183782.0, ans=0.05 2023-06-22 08:22:29,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1183782.0, ans=0.125 2023-06-22 08:22:56,605 INFO [train.py:996] (3/4) Epoch 7, batch 14350, loss[loss=0.2251, simple_loss=0.3039, pruned_loss=0.07313, over 21846.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3184, pruned_loss=0.08274, over 4272562.89 frames. ], batch size: 351, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:23:05,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1183902.0, ans=0.125 2023-06-22 08:23:15,122 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.444e+02 3.408e+02 4.555e+02 6.047e+02 1.523e+03, threshold=9.110e+02, percent-clipped=21.0 2023-06-22 08:23:39,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1184022.0, ans=0.025 2023-06-22 08:23:41,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1184022.0, ans=0.0 2023-06-22 08:24:34,922 INFO [train.py:996] (3/4) Epoch 7, batch 14400, loss[loss=0.2398, simple_loss=0.2981, pruned_loss=0.09073, over 21429.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3154, pruned_loss=0.08331, over 4269335.48 frames. ], batch size: 212, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:25:06,340 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-22 08:25:57,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1184442.0, ans=0.125 2023-06-22 08:26:11,481 INFO [train.py:996] (3/4) Epoch 7, batch 14450, loss[loss=0.2554, simple_loss=0.3022, pruned_loss=0.1043, over 21593.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.31, pruned_loss=0.08365, over 4265754.72 frames. ], batch size: 508, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:26:30,337 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.298e+02 2.987e+02 3.327e+02 4.057e+02 7.605e+02, threshold=6.653e+02, percent-clipped=0.0 2023-06-22 08:26:30,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1184562.0, ans=0.2 2023-06-22 08:27:36,495 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2023-06-22 08:27:52,114 INFO [train.py:996] (3/4) Epoch 7, batch 14500, loss[loss=0.258, simple_loss=0.3854, pruned_loss=0.06526, over 20813.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3069, pruned_loss=0.08289, over 4268427.98 frames. ], batch size: 607, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:28:47,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1184922.0, ans=0.125 2023-06-22 08:28:50,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1184982.0, ans=0.0 2023-06-22 08:29:28,322 INFO [train.py:996] (3/4) Epoch 7, batch 14550, loss[loss=0.2714, simple_loss=0.3445, pruned_loss=0.09917, over 21689.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3125, pruned_loss=0.08455, over 4268853.29 frames. ], batch size: 351, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:29:57,824 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.305e+02 3.217e+02 4.103e+02 5.336e+02 9.308e+02, threshold=8.206e+02, percent-clipped=6.0 2023-06-22 08:30:07,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1185162.0, ans=0.125 2023-06-22 08:30:11,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1185162.0, ans=0.2 2023-06-22 08:30:11,564 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=22.5 2023-06-22 08:30:15,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1185222.0, ans=0.07 2023-06-22 08:30:27,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1185222.0, ans=0.2 2023-06-22 08:30:52,474 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=15.0 2023-06-22 08:30:57,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1185342.0, ans=0.125 2023-06-22 08:31:09,739 INFO [train.py:996] (3/4) Epoch 7, batch 14600, loss[loss=0.2292, simple_loss=0.3159, pruned_loss=0.07127, over 21321.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3204, pruned_loss=0.08882, over 4263558.06 frames. ], batch size: 176, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:32:05,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1185522.0, ans=0.125 2023-06-22 08:32:48,041 INFO [train.py:996] (3/4) Epoch 7, batch 14650, loss[loss=0.2796, simple_loss=0.3676, pruned_loss=0.09579, over 21281.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.323, pruned_loss=0.08803, over 4271286.51 frames. ], batch size: 548, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:33:04,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1185702.0, ans=0.2 2023-06-22 08:33:22,486 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.921e+02 3.378e+02 4.532e+02 7.463e+02, threshold=6.756e+02, percent-clipped=1.0 2023-06-22 08:33:24,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1185762.0, ans=0.0 2023-06-22 08:33:24,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1185762.0, ans=0.2 2023-06-22 08:33:26,896 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.02 vs. limit=22.5 2023-06-22 08:33:31,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1185762.0, ans=0.125 2023-06-22 08:33:56,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1185882.0, ans=0.0 2023-06-22 08:34:15,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1185942.0, ans=0.1 2023-06-22 08:34:28,129 INFO [train.py:996] (3/4) Epoch 7, batch 14700, loss[loss=0.2444, simple_loss=0.3466, pruned_loss=0.07107, over 21673.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.315, pruned_loss=0.08137, over 4271391.66 frames. ], batch size: 389, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:35:31,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1186122.0, ans=0.125 2023-06-22 08:36:19,332 INFO [train.py:996] (3/4) Epoch 7, batch 14750, loss[loss=0.1474, simple_loss=0.208, pruned_loss=0.04344, over 16382.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3206, pruned_loss=0.08462, over 4262140.76 frames. ], batch size: 60, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:36:29,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1186302.0, ans=0.125 2023-06-22 08:36:45,467 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 3.126e+02 3.786e+02 4.508e+02 7.747e+02, threshold=7.572e+02, percent-clipped=1.0 2023-06-22 08:37:27,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1186482.0, ans=0.0 2023-06-22 08:38:03,831 INFO [train.py:996] (3/4) Epoch 7, batch 14800, loss[loss=0.3038, simple_loss=0.3534, pruned_loss=0.1271, over 21302.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3327, pruned_loss=0.09074, over 4260790.31 frames. ], batch size: 471, lr: 4.31e-03, grad_scale: 32.0 2023-06-22 08:38:05,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1186602.0, ans=0.0 2023-06-22 08:38:09,172 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:38:20,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1186662.0, ans=0.1 2023-06-22 08:38:50,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1186722.0, ans=0.125 2023-06-22 08:39:01,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1186782.0, ans=0.1 2023-06-22 08:39:42,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1186842.0, ans=0.125 2023-06-22 08:39:45,578 INFO [train.py:996] (3/4) Epoch 7, batch 14850, loss[loss=0.2607, simple_loss=0.3291, pruned_loss=0.0962, over 21752.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3264, pruned_loss=0.09041, over 4265186.41 frames. ], batch size: 282, lr: 4.31e-03, grad_scale: 32.0 2023-06-22 08:39:45,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1186902.0, ans=0.125 2023-06-22 08:39:47,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1186902.0, ans=0.0 2023-06-22 08:39:52,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1186902.0, ans=10.0 2023-06-22 08:40:12,136 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.490e+02 3.436e+02 3.807e+02 4.957e+02 1.167e+03, threshold=7.615e+02, percent-clipped=4.0 2023-06-22 08:40:47,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1187022.0, ans=0.1 2023-06-22 08:41:03,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1187082.0, ans=0.125 2023-06-22 08:41:11,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1187142.0, ans=0.125 2023-06-22 08:41:16,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1187142.0, ans=0.125 2023-06-22 08:41:32,003 INFO [train.py:996] (3/4) Epoch 7, batch 14900, loss[loss=0.3132, simple_loss=0.3711, pruned_loss=0.1276, over 21398.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3304, pruned_loss=0.09235, over 4265953.84 frames. ], batch size: 471, lr: 4.31e-03, grad_scale: 32.0 2023-06-22 08:41:46,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1187262.0, ans=0.2 2023-06-22 08:43:08,559 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:43:12,799 INFO [train.py:996] (3/4) Epoch 7, batch 14950, loss[loss=0.2217, simple_loss=0.3127, pruned_loss=0.06533, over 21741.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3298, pruned_loss=0.09118, over 4264204.23 frames. ], batch size: 351, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:43:13,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1187502.0, ans=0.0 2023-06-22 08:43:22,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1187502.0, ans=0.125 2023-06-22 08:43:39,822 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 3.264e+02 3.667e+02 4.078e+02 7.613e+02, threshold=7.333e+02, percent-clipped=0.0 2023-06-22 08:44:06,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1187622.0, ans=0.125 2023-06-22 08:44:17,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1187682.0, ans=0.0 2023-06-22 08:44:52,918 INFO [train.py:996] (3/4) Epoch 7, batch 15000, loss[loss=0.2459, simple_loss=0.3075, pruned_loss=0.09212, over 21783.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3329, pruned_loss=0.09295, over 4261748.52 frames. ], batch size: 247, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:44:52,918 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 08:45:09,858 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2588, simple_loss=0.3554, pruned_loss=0.08105, over 1796401.00 frames. 2023-06-22 08:45:09,859 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-22 08:46:02,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1187922.0, ans=0.2 2023-06-22 08:46:36,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1188042.0, ans=0.1 2023-06-22 08:46:56,310 INFO [train.py:996] (3/4) Epoch 7, batch 15050, loss[loss=0.2359, simple_loss=0.3207, pruned_loss=0.0755, over 21760.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3343, pruned_loss=0.09402, over 4265441.24 frames. ], batch size: 282, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:47:04,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1188102.0, ans=0.125 2023-06-22 08:47:27,948 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.680e+02 3.352e+02 4.069e+02 4.839e+02 9.529e+02, threshold=8.138e+02, percent-clipped=2.0 2023-06-22 08:47:52,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1188222.0, ans=0.0 2023-06-22 08:48:00,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1188282.0, ans=0.1 2023-06-22 08:48:39,529 INFO [train.py:996] (3/4) Epoch 7, batch 15100, loss[loss=0.2098, simple_loss=0.2748, pruned_loss=0.0724, over 20194.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3365, pruned_loss=0.09385, over 4264483.04 frames. ], batch size: 703, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:49:30,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1188522.0, ans=0.07 2023-06-22 08:49:31,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1188522.0, ans=0.125 2023-06-22 08:49:36,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1188582.0, ans=0.125 2023-06-22 08:50:19,326 INFO [train.py:996] (3/4) Epoch 7, batch 15150, loss[loss=0.2362, simple_loss=0.2971, pruned_loss=0.08768, over 21815.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3321, pruned_loss=0.0938, over 4272441.86 frames. ], batch size: 98, lr: 4.31e-03, grad_scale: 8.0 2023-06-22 08:50:33,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1188702.0, ans=0.0 2023-06-22 08:50:45,418 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.58 vs. limit=15.0 2023-06-22 08:50:49,047 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.627e+02 3.254e+02 3.801e+02 4.686e+02 8.027e+02, threshold=7.602e+02, percent-clipped=0.0 2023-06-22 08:50:54,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1188762.0, ans=0.125 2023-06-22 08:52:04,632 INFO [train.py:996] (3/4) Epoch 7, batch 15200, loss[loss=0.1993, simple_loss=0.2902, pruned_loss=0.05417, over 21700.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3224, pruned_loss=0.08841, over 4272013.84 frames. ], batch size: 351, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:53:11,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1189182.0, ans=0.0 2023-06-22 08:53:44,165 INFO [train.py:996] (3/4) Epoch 7, batch 15250, loss[loss=0.2064, simple_loss=0.2746, pruned_loss=0.0691, over 21668.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3165, pruned_loss=0.08727, over 4264985.98 frames. ], batch size: 282, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:53:47,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1189302.0, ans=0.1 2023-06-22 08:54:13,563 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 3.041e+02 3.715e+02 4.659e+02 9.808e+02, threshold=7.430e+02, percent-clipped=2.0 2023-06-22 08:54:23,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1189422.0, ans=0.2 2023-06-22 08:55:25,347 INFO [train.py:996] (3/4) Epoch 7, batch 15300, loss[loss=0.3318, simple_loss=0.3811, pruned_loss=0.1413, over 21591.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3183, pruned_loss=0.08997, over 4271626.84 frames. ], batch size: 415, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:55:25,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1189602.0, ans=0.1 2023-06-22 08:56:04,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1189722.0, ans=0.05 2023-06-22 08:56:38,021 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.65 vs. limit=22.5 2023-06-22 08:56:45,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1189782.0, ans=0.125 2023-06-22 08:56:57,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1189842.0, ans=0.09899494936611666 2023-06-22 08:57:04,784 INFO [train.py:996] (3/4) Epoch 7, batch 15350, loss[loss=0.2536, simple_loss=0.3299, pruned_loss=0.08861, over 21276.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3238, pruned_loss=0.09243, over 4267054.25 frames. ], batch size: 143, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:57:11,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1189902.0, ans=0.0 2023-06-22 08:57:16,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1189902.0, ans=0.125 2023-06-22 08:57:28,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1189962.0, ans=0.2 2023-06-22 08:57:28,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1189962.0, ans=0.1 2023-06-22 08:57:33,806 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.624e+02 3.368e+02 3.940e+02 5.271e+02 1.051e+03, threshold=7.879e+02, percent-clipped=5.0 2023-06-22 08:57:34,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1189962.0, ans=0.07 2023-06-22 08:58:43,311 INFO [train.py:996] (3/4) Epoch 7, batch 15400, loss[loss=0.2159, simple_loss=0.3132, pruned_loss=0.0593, over 21773.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3233, pruned_loss=0.08986, over 4262179.72 frames. ], batch size: 298, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:58:56,482 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.72 vs. limit=12.0 2023-06-22 08:59:40,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1190382.0, ans=0.125 2023-06-22 08:59:57,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1190442.0, ans=22.5 2023-06-22 09:00:08,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1190442.0, ans=0.1 2023-06-22 09:00:22,613 INFO [train.py:996] (3/4) Epoch 7, batch 15450, loss[loss=0.2316, simple_loss=0.3048, pruned_loss=0.07925, over 21520.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.321, pruned_loss=0.08856, over 4267706.49 frames. ], batch size: 131, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 09:00:25,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1190502.0, ans=0.125 2023-06-22 09:00:51,248 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 2.924e+02 3.383e+02 4.121e+02 7.553e+02, threshold=6.767e+02, percent-clipped=0.0 2023-06-22 09:01:00,478 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-22 09:01:03,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1190622.0, ans=0.0 2023-06-22 09:01:09,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1190622.0, ans=0.125 2023-06-22 09:01:16,942 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.27 vs. limit=6.0 2023-06-22 09:01:17,084 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2023-06-22 09:01:22,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1190682.0, ans=10.0 2023-06-22 09:01:55,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1190742.0, ans=0.1 2023-06-22 09:02:02,958 INFO [train.py:996] (3/4) Epoch 7, batch 15500, loss[loss=0.2441, simple_loss=0.3306, pruned_loss=0.07883, over 21873.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3242, pruned_loss=0.08887, over 4272067.07 frames. ], batch size: 371, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 09:02:36,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1190862.0, ans=0.125 2023-06-22 09:02:52,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1190922.0, ans=0.125 2023-06-22 09:03:22,615 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.95 vs. limit=5.0 2023-06-22 09:03:48,149 INFO [train.py:996] (3/4) Epoch 7, batch 15550, loss[loss=0.2499, simple_loss=0.3314, pruned_loss=0.08417, over 21612.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3234, pruned_loss=0.0864, over 4263937.72 frames. ], batch size: 441, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 09:03:54,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1191102.0, ans=0.125 2023-06-22 09:04:12,441 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.221e+02 3.104e+02 3.542e+02 4.427e+02 7.965e+02, threshold=7.084e+02, percent-clipped=2.0 2023-06-22 09:05:21,968 INFO [train.py:996] (3/4) Epoch 7, batch 15600, loss[loss=0.291, simple_loss=0.3254, pruned_loss=0.1283, over 21352.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3176, pruned_loss=0.08551, over 4258340.58 frames. ], batch size: 507, lr: 4.31e-03, grad_scale: 32.0 2023-06-22 09:06:53,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1191642.0, ans=0.125 2023-06-22 09:07:08,478 INFO [train.py:996] (3/4) Epoch 7, batch 15650, loss[loss=0.236, simple_loss=0.2993, pruned_loss=0.08638, over 21502.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3151, pruned_loss=0.0843, over 4261635.46 frames. ], batch size: 195, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:07:23,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1191762.0, ans=0.125 2023-06-22 09:07:38,620 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.422e+02 3.201e+02 3.774e+02 4.746e+02 8.455e+02, threshold=7.547e+02, percent-clipped=5.0 2023-06-22 09:07:44,559 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.08 vs. limit=15.0 2023-06-22 09:08:09,311 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-22 09:08:17,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1191882.0, ans=0.0 2023-06-22 09:08:47,620 INFO [train.py:996] (3/4) Epoch 7, batch 15700, loss[loss=0.2201, simple_loss=0.2772, pruned_loss=0.08157, over 15449.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.312, pruned_loss=0.0837, over 4263604.50 frames. ], batch size: 60, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:10:02,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1192182.0, ans=0.0 2023-06-22 09:10:07,739 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-22 09:10:27,344 INFO [train.py:996] (3/4) Epoch 7, batch 15750, loss[loss=0.3156, simple_loss=0.3566, pruned_loss=0.1373, over 21411.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.307, pruned_loss=0.08343, over 4260351.24 frames. ], batch size: 508, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:10:27,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1192302.0, ans=0.125 2023-06-22 09:10:32,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1192302.0, ans=0.125 2023-06-22 09:10:35,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1192302.0, ans=0.1 2023-06-22 09:10:40,803 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=15.0 2023-06-22 09:10:44,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1192362.0, ans=0.0 2023-06-22 09:10:56,917 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.451e+02 3.176e+02 3.735e+02 4.754e+02 7.774e+02, threshold=7.471e+02, percent-clipped=1.0 2023-06-22 09:12:06,991 INFO [train.py:996] (3/4) Epoch 7, batch 15800, loss[loss=0.2001, simple_loss=0.2559, pruned_loss=0.07213, over 21478.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.302, pruned_loss=0.0835, over 4261644.11 frames. ], batch size: 230, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:12:40,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1192662.0, ans=0.0 2023-06-22 09:12:57,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1192722.0, ans=0.0 2023-06-22 09:13:27,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1192842.0, ans=0.0 2023-06-22 09:13:39,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1192842.0, ans=0.125 2023-06-22 09:13:44,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1192902.0, ans=0.0 2023-06-22 09:13:45,629 INFO [train.py:996] (3/4) Epoch 7, batch 15850, loss[loss=0.2231, simple_loss=0.2912, pruned_loss=0.07747, over 21875.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3052, pruned_loss=0.08596, over 4260054.33 frames. ], batch size: 317, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:13:52,750 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.46 vs. limit=10.0 2023-06-22 09:14:15,709 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.357e+02 3.067e+02 3.802e+02 4.626e+02 8.154e+02, threshold=7.604e+02, percent-clipped=3.0 2023-06-22 09:14:16,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1192962.0, ans=0.2 2023-06-22 09:14:54,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1193082.0, ans=0.125 2023-06-22 09:14:58,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1193082.0, ans=0.0 2023-06-22 09:15:21,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1193142.0, ans=0.125 2023-06-22 09:15:26,013 INFO [train.py:996] (3/4) Epoch 7, batch 15900, loss[loss=0.2207, simple_loss=0.2809, pruned_loss=0.0802, over 21779.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3032, pruned_loss=0.08562, over 4259604.74 frames. ], batch size: 317, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:15:47,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1193262.0, ans=0.95 2023-06-22 09:15:47,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1193262.0, ans=0.125 2023-06-22 09:15:58,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1193262.0, ans=0.125 2023-06-22 09:16:54,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1193442.0, ans=0.1 2023-06-22 09:17:05,281 INFO [train.py:996] (3/4) Epoch 7, batch 15950, loss[loss=0.2329, simple_loss=0.3009, pruned_loss=0.08244, over 21663.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3056, pruned_loss=0.08374, over 4265758.02 frames. ], batch size: 263, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:17:12,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1193502.0, ans=0.0 2023-06-22 09:17:22,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1193562.0, ans=0.125 2023-06-22 09:17:31,177 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.361e+02 3.037e+02 3.517e+02 4.251e+02 9.007e+02, threshold=7.034e+02, percent-clipped=1.0 2023-06-22 09:18:46,868 INFO [train.py:996] (3/4) Epoch 7, batch 16000, loss[loss=0.2382, simple_loss=0.3357, pruned_loss=0.07038, over 21646.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3073, pruned_loss=0.08194, over 4266585.55 frames. ], batch size: 389, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:19:11,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1193862.0, ans=0.125 2023-06-22 09:19:18,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1193862.0, ans=0.0 2023-06-22 09:20:16,521 INFO [train.py:996] (3/4) Epoch 7, batch 16050, loss[loss=0.2331, simple_loss=0.3369, pruned_loss=0.06469, over 20804.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3104, pruned_loss=0.07961, over 4262313.25 frames. ], batch size: 608, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:20:29,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1194102.0, ans=0.125 2023-06-22 09:20:47,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.171e+02 3.896e+02 5.247e+02 9.817e+02, threshold=7.791e+02, percent-clipped=4.0 2023-06-22 09:21:08,498 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-22 09:21:46,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1194342.0, ans=0.125 2023-06-22 09:21:53,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1194342.0, ans=0.07 2023-06-22 09:21:53,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1194342.0, ans=0.125 2023-06-22 09:21:54,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1194402.0, ans=0.125 2023-06-22 09:21:55,840 INFO [train.py:996] (3/4) Epoch 7, batch 16100, loss[loss=0.2206, simple_loss=0.2937, pruned_loss=0.07378, over 21201.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3144, pruned_loss=0.0805, over 4268597.74 frames. ], batch size: 143, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:22:23,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1194462.0, ans=0.2 2023-06-22 09:22:51,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1194582.0, ans=0.0 2023-06-22 09:23:07,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1194582.0, ans=0.1 2023-06-22 09:23:34,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1194702.0, ans=0.125 2023-06-22 09:23:35,217 INFO [train.py:996] (3/4) Epoch 7, batch 16150, loss[loss=0.2375, simple_loss=0.2976, pruned_loss=0.08869, over 21925.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3155, pruned_loss=0.08299, over 4271426.95 frames. ], batch size: 316, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:23:52,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1194702.0, ans=0.125 2023-06-22 09:23:52,726 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-06-22 09:24:08,131 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 3.102e+02 3.921e+02 4.852e+02 9.563e+02, threshold=7.842e+02, percent-clipped=2.0 2023-06-22 09:24:16,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1194822.0, ans=0.0 2023-06-22 09:24:52,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1194882.0, ans=0.125 2023-06-22 09:24:57,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1194942.0, ans=0.1 2023-06-22 09:25:18,418 INFO [train.py:996] (3/4) Epoch 7, batch 16200, loss[loss=0.2698, simple_loss=0.3463, pruned_loss=0.09664, over 21410.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3199, pruned_loss=0.08434, over 4275950.33 frames. ], batch size: 131, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:25:18,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1195002.0, ans=0.2 2023-06-22 09:25:19,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.88 vs. limit=10.0 2023-06-22 09:26:03,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1195122.0, ans=0.125 2023-06-22 09:26:22,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1195182.0, ans=0.125 2023-06-22 09:26:34,120 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-06-22 09:26:59,830 INFO [train.py:996] (3/4) Epoch 7, batch 16250, loss[loss=0.2001, simple_loss=0.2693, pruned_loss=0.06545, over 21727.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3203, pruned_loss=0.08555, over 4277688.53 frames. ], batch size: 124, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:27:05,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=1195302.0, ans=15.0 2023-06-22 09:27:31,793 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.539e+02 3.018e+02 3.500e+02 4.433e+02 8.732e+02, threshold=7.000e+02, percent-clipped=2.0 2023-06-22 09:27:52,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1195422.0, ans=0.0 2023-06-22 09:27:52,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1195422.0, ans=0.125 2023-06-22 09:28:40,757 INFO [train.py:996] (3/4) Epoch 7, batch 16300, loss[loss=0.1922, simple_loss=0.2832, pruned_loss=0.0506, over 21655.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3146, pruned_loss=0.08226, over 4269173.52 frames. ], batch size: 247, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:28:45,011 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-22 09:29:34,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1195722.0, ans=0.1 2023-06-22 09:30:24,327 INFO [train.py:996] (3/4) Epoch 7, batch 16350, loss[loss=0.1971, simple_loss=0.2724, pruned_loss=0.0609, over 21613.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3141, pruned_loss=0.08204, over 4270295.74 frames. ], batch size: 298, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:30:30,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1195902.0, ans=0.125 2023-06-22 09:31:06,943 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 3.307e+02 4.109e+02 5.521e+02 1.139e+03, threshold=8.218e+02, percent-clipped=11.0 2023-06-22 09:31:18,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1196022.0, ans=0.0 2023-06-22 09:31:28,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1196082.0, ans=0.1 2023-06-22 09:31:40,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1196082.0, ans=0.0 2023-06-22 09:32:07,007 INFO [train.py:996] (3/4) Epoch 7, batch 16400, loss[loss=0.2701, simple_loss=0.3364, pruned_loss=0.102, over 21779.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.317, pruned_loss=0.0838, over 4278197.99 frames. ], batch size: 441, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:32:12,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1196202.0, ans=0.1 2023-06-22 09:33:18,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1196382.0, ans=0.125 2023-06-22 09:33:29,081 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.15 vs. limit=15.0 2023-06-22 09:33:47,412 INFO [train.py:996] (3/4) Epoch 7, batch 16450, loss[loss=0.2493, simple_loss=0.3135, pruned_loss=0.09259, over 21835.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3162, pruned_loss=0.08417, over 4282900.53 frames. ], batch size: 124, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:33:54,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1196502.0, ans=0.125 2023-06-22 09:34:28,702 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=15.0 2023-06-22 09:34:29,352 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.529e+02 3.059e+02 3.522e+02 4.400e+02 7.364e+02, threshold=7.044e+02, percent-clipped=0.0 2023-06-22 09:34:38,817 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=22.5 2023-06-22 09:35:16,466 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=22.5 2023-06-22 09:35:18,015 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-22 09:35:28,322 INFO [train.py:996] (3/4) Epoch 7, batch 16500, loss[loss=0.2013, simple_loss=0.27, pruned_loss=0.06628, over 21640.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3134, pruned_loss=0.08379, over 4281332.00 frames. ], batch size: 230, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:37:08,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1197042.0, ans=0.125 2023-06-22 09:37:16,068 INFO [train.py:996] (3/4) Epoch 7, batch 16550, loss[loss=0.2574, simple_loss=0.3374, pruned_loss=0.08864, over 21853.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3122, pruned_loss=0.08145, over 4283501.64 frames. ], batch size: 371, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:37:22,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1197102.0, ans=0.125 2023-06-22 09:37:53,812 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.487e+02 3.830e+02 4.900e+02 6.619e+02 1.240e+03, threshold=9.800e+02, percent-clipped=18.0 2023-06-22 09:38:04,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1197222.0, ans=0.1 2023-06-22 09:39:05,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1197342.0, ans=0.1 2023-06-22 09:39:08,761 INFO [train.py:996] (3/4) Epoch 7, batch 16600, loss[loss=0.2612, simple_loss=0.3446, pruned_loss=0.08889, over 21775.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3216, pruned_loss=0.08533, over 4284230.60 frames. ], batch size: 124, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 09:39:35,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1197462.0, ans=0.125 2023-06-22 09:40:24,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1197582.0, ans=0.1 2023-06-22 09:40:50,872 INFO [train.py:996] (3/4) Epoch 7, batch 16650, loss[loss=0.3034, simple_loss=0.3748, pruned_loss=0.1159, over 21795.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3306, pruned_loss=0.08765, over 4278228.60 frames. ], batch size: 441, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:41:26,156 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.702e+02 3.520e+02 3.910e+02 4.811e+02 1.011e+03, threshold=7.820e+02, percent-clipped=1.0 2023-06-22 09:41:52,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1197822.0, ans=0.125 2023-06-22 09:42:39,975 INFO [train.py:996] (3/4) Epoch 7, batch 16700, loss[loss=0.2252, simple_loss=0.3013, pruned_loss=0.0746, over 21674.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3318, pruned_loss=0.08847, over 4272784.12 frames. ], batch size: 298, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:43:12,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1198062.0, ans=0.0 2023-06-22 09:44:27,967 INFO [train.py:996] (3/4) Epoch 7, batch 16750, loss[loss=0.2776, simple_loss=0.3729, pruned_loss=0.09113, over 21571.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3337, pruned_loss=0.09046, over 4272275.74 frames. ], batch size: 414, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:44:29,340 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.01 vs. limit=5.0 2023-06-22 09:44:38,170 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-22 09:45:05,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1198362.0, ans=0.1 2023-06-22 09:45:09,619 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.580e+02 3.471e+02 3.936e+02 4.958e+02 1.171e+03, threshold=7.873e+02, percent-clipped=3.0 2023-06-22 09:45:12,308 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-22 09:46:00,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1198542.0, ans=0.0 2023-06-22 09:46:00,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1198542.0, ans=0.125 2023-06-22 09:46:11,332 INFO [train.py:996] (3/4) Epoch 7, batch 16800, loss[loss=0.253, simple_loss=0.3168, pruned_loss=0.09459, over 21864.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3378, pruned_loss=0.09082, over 4273684.06 frames. ], batch size: 107, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 09:46:34,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1198602.0, ans=0.1 2023-06-22 09:47:12,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1198782.0, ans=0.2 2023-06-22 09:47:23,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1198782.0, ans=0.0 2023-06-22 09:47:51,157 INFO [train.py:996] (3/4) Epoch 7, batch 16850, loss[loss=0.2207, simple_loss=0.2852, pruned_loss=0.07816, over 21928.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3351, pruned_loss=0.09118, over 4279371.73 frames. ], batch size: 316, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 09:47:54,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1198902.0, ans=0.04949747468305833 2023-06-22 09:48:18,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1198962.0, ans=0.0 2023-06-22 09:48:25,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1198962.0, ans=0.125 2023-06-22 09:48:29,636 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.730e+02 3.467e+02 4.300e+02 5.663e+02 1.182e+03, threshold=8.599e+02, percent-clipped=7.0 2023-06-22 09:48:46,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1199022.0, ans=0.1 2023-06-22 09:49:30,245 INFO [train.py:996] (3/4) Epoch 7, batch 16900, loss[loss=0.2124, simple_loss=0.284, pruned_loss=0.07038, over 21654.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3302, pruned_loss=0.09044, over 4280297.90 frames. ], batch size: 332, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 09:49:48,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1199202.0, ans=0.125 2023-06-22 09:50:20,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1199322.0, ans=0.125 2023-06-22 09:50:22,898 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-22 09:50:42,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1199382.0, ans=0.1 2023-06-22 09:51:05,718 INFO [train.py:996] (3/4) Epoch 7, batch 16950, loss[loss=0.2372, simple_loss=0.3083, pruned_loss=0.0831, over 21855.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3228, pruned_loss=0.08917, over 4284965.17 frames. ], batch size: 118, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:51:23,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1199502.0, ans=0.025 2023-06-22 09:51:23,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1199502.0, ans=0.2 2023-06-22 09:51:32,899 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.30 vs. limit=10.0 2023-06-22 09:51:34,476 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.44 vs. limit=8.0 2023-06-22 09:51:35,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1199562.0, ans=0.125 2023-06-22 09:51:37,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1199562.0, ans=0.125 2023-06-22 09:51:45,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.41 vs. limit=15.0 2023-06-22 09:51:45,903 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.467e+02 2.915e+02 3.202e+02 3.763e+02 5.382e+02, threshold=6.404e+02, percent-clipped=0.0 2023-06-22 09:52:02,882 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=12.0 2023-06-22 09:52:08,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1199682.0, ans=0.125 2023-06-22 09:52:08,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1199682.0, ans=0.0 2023-06-22 09:52:50,002 INFO [train.py:996] (3/4) Epoch 7, batch 17000, loss[loss=0.2355, simple_loss=0.3068, pruned_loss=0.0821, over 21358.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3195, pruned_loss=0.08997, over 4285214.85 frames. ], batch size: 159, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:53:18,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1199862.0, ans=0.125 2023-06-22 09:53:21,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1199862.0, ans=0.0 2023-06-22 09:53:50,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1199982.0, ans=0.015 2023-06-22 09:54:24,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1200042.0, ans=0.0 2023-06-22 09:54:38,314 INFO [train.py:996] (3/4) Epoch 7, batch 17050, loss[loss=0.2684, simple_loss=0.347, pruned_loss=0.09495, over 21859.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3234, pruned_loss=0.09152, over 4284513.45 frames. ], batch size: 351, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:54:57,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1200162.0, ans=0.125 2023-06-22 09:55:08,543 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.478e+02 3.382e+02 4.158e+02 4.859e+02 8.252e+02, threshold=8.317e+02, percent-clipped=8.0 2023-06-22 09:55:29,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1200282.0, ans=0.125 2023-06-22 09:56:17,460 INFO [train.py:996] (3/4) Epoch 7, batch 17100, loss[loss=0.2583, simple_loss=0.3185, pruned_loss=0.09906, over 21837.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3227, pruned_loss=0.0917, over 4283452.69 frames. ], batch size: 124, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:57:29,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1200642.0, ans=0.125 2023-06-22 09:57:41,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1200642.0, ans=0.05 2023-06-22 09:57:50,015 INFO [train.py:996] (3/4) Epoch 7, batch 17150, loss[loss=0.186, simple_loss=0.2716, pruned_loss=0.05023, over 21759.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3181, pruned_loss=0.09045, over 4290164.09 frames. ], batch size: 247, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:58:30,913 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.032e+02 3.543e+02 4.123e+02 6.537e+02, threshold=7.086e+02, percent-clipped=0.0 2023-06-22 09:58:33,490 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-22 09:58:36,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1200822.0, ans=0.2 2023-06-22 09:59:37,089 INFO [train.py:996] (3/4) Epoch 7, batch 17200, loss[loss=0.2272, simple_loss=0.3047, pruned_loss=0.07488, over 20738.00 frames. ], tot_loss[loss=0.25, simple_loss=0.319, pruned_loss=0.09051, over 4291445.90 frames. ], batch size: 608, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 09:59:40,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1201002.0, ans=0.0 2023-06-22 09:59:50,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1201002.0, ans=0.1 2023-06-22 09:59:50,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1201002.0, ans=0.125 2023-06-22 10:01:15,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1201242.0, ans=0.125 2023-06-22 10:01:20,171 INFO [train.py:996] (3/4) Epoch 7, batch 17250, loss[loss=0.2944, simple_loss=0.3732, pruned_loss=0.1079, over 21520.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3222, pruned_loss=0.0916, over 4279241.14 frames. ], batch size: 131, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 10:01:22,765 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-22 10:01:48,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1201362.0, ans=0.1 2023-06-22 10:02:01,137 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.703e+02 3.318e+02 3.860e+02 4.888e+02 8.680e+02, threshold=7.720e+02, percent-clipped=6.0 2023-06-22 10:02:52,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1201542.0, ans=0.125 2023-06-22 10:03:07,124 INFO [train.py:996] (3/4) Epoch 7, batch 17300, loss[loss=0.2983, simple_loss=0.3654, pruned_loss=0.1156, over 21607.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3307, pruned_loss=0.09414, over 4276572.10 frames. ], batch size: 389, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 10:03:19,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1201602.0, ans=0.125 2023-06-22 10:03:29,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1201662.0, ans=0.05 2023-06-22 10:03:43,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1201662.0, ans=0.125 2023-06-22 10:03:48,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1201722.0, ans=0.125 2023-06-22 10:04:44,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1201842.0, ans=0.09899494936611666 2023-06-22 10:04:50,679 INFO [train.py:996] (3/4) Epoch 7, batch 17350, loss[loss=0.2257, simple_loss=0.3098, pruned_loss=0.07075, over 21873.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3318, pruned_loss=0.09401, over 4273973.40 frames. ], batch size: 316, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 10:05:00,901 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-22 10:05:02,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-22 10:05:22,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1201962.0, ans=0.2 2023-06-22 10:05:31,920 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-22 10:05:36,384 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 3.363e+02 3.779e+02 4.471e+02 7.201e+02, threshold=7.558e+02, percent-clipped=0.0 2023-06-22 10:05:46,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1202022.0, ans=0.1 2023-06-22 10:06:00,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1202082.0, ans=0.0 2023-06-22 10:06:06,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1202082.0, ans=0.125 2023-06-22 10:06:37,403 INFO [train.py:996] (3/4) Epoch 7, batch 17400, loss[loss=0.2147, simple_loss=0.2925, pruned_loss=0.06848, over 21592.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3279, pruned_loss=0.09002, over 4274663.62 frames. ], batch size: 263, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 10:07:20,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1202322.0, ans=0.125 2023-06-22 10:07:23,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1202322.0, ans=0.0 2023-06-22 10:07:32,735 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-22 10:07:35,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1202382.0, ans=0.09899494936611666 2023-06-22 10:08:20,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1202442.0, ans=0.0 2023-06-22 10:08:24,849 INFO [train.py:996] (3/4) Epoch 7, batch 17450, loss[loss=0.1905, simple_loss=0.2621, pruned_loss=0.05944, over 21174.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3272, pruned_loss=0.08812, over 4272900.09 frames. ], batch size: 176, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 10:08:50,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1202562.0, ans=0.125 2023-06-22 10:08:56,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1202562.0, ans=0.0 2023-06-22 10:09:00,423 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=13.65 vs. limit=15.0 2023-06-22 10:09:02,630 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 3.174e+02 3.775e+02 5.488e+02 9.226e+02, threshold=7.551e+02, percent-clipped=5.0 2023-06-22 10:09:13,105 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.77 vs. limit=10.0 2023-06-22 10:09:33,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1202682.0, ans=0.1 2023-06-22 10:10:06,308 INFO [train.py:996] (3/4) Epoch 7, batch 17500, loss[loss=0.2636, simple_loss=0.3235, pruned_loss=0.1018, over 21650.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3215, pruned_loss=0.08492, over 4271383.38 frames. ], batch size: 471, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 10:11:05,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1202982.0, ans=0.1 2023-06-22 10:11:09,440 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=11.28 vs. limit=15.0 2023-06-22 10:11:41,902 INFO [train.py:996] (3/4) Epoch 7, batch 17550, loss[loss=0.2143, simple_loss=0.3096, pruned_loss=0.05955, over 21644.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3203, pruned_loss=0.08349, over 4269369.55 frames. ], batch size: 230, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:12:11,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1203162.0, ans=0.125 2023-06-22 10:12:14,418 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 2.829e+02 3.350e+02 3.891e+02 7.522e+02, threshold=6.700e+02, percent-clipped=0.0 2023-06-22 10:12:45,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1203282.0, ans=0.125 2023-06-22 10:12:50,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1203282.0, ans=0.0 2023-06-22 10:13:22,584 INFO [train.py:996] (3/4) Epoch 7, batch 17600, loss[loss=0.2822, simple_loss=0.3542, pruned_loss=0.1051, over 21564.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3241, pruned_loss=0.08415, over 4258121.78 frames. ], batch size: 389, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:13:51,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.79 vs. limit=6.0 2023-06-22 10:13:54,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1203522.0, ans=0.2 2023-06-22 10:14:13,486 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.45 vs. limit=15.0 2023-06-22 10:15:03,717 INFO [train.py:996] (3/4) Epoch 7, batch 17650, loss[loss=0.178, simple_loss=0.2375, pruned_loss=0.05924, over 21689.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3222, pruned_loss=0.08479, over 4263670.38 frames. ], batch size: 112, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:15:21,157 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.53 vs. limit=15.0 2023-06-22 10:15:36,726 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.660e+02 3.234e+02 3.859e+02 4.407e+02 8.519e+02, threshold=7.719e+02, percent-clipped=7.0 2023-06-22 10:16:46,333 INFO [train.py:996] (3/4) Epoch 7, batch 17700, loss[loss=0.2706, simple_loss=0.3538, pruned_loss=0.09367, over 21915.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3152, pruned_loss=0.08211, over 4257161.21 frames. ], batch size: 372, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:16:56,536 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-22 10:17:17,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1204062.0, ans=0.125 2023-06-22 10:18:18,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1204242.0, ans=0.125 2023-06-22 10:18:20,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1204242.0, ans=0.2 2023-06-22 10:18:29,532 INFO [train.py:996] (3/4) Epoch 7, batch 17750, loss[loss=0.257, simple_loss=0.3319, pruned_loss=0.09108, over 21293.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3224, pruned_loss=0.08551, over 4262034.15 frames. ], batch size: 159, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:18:31,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1204302.0, ans=0.125 2023-06-22 10:18:59,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1204362.0, ans=0.035 2023-06-22 10:18:59,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1204362.0, ans=0.05 2023-06-22 10:19:09,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1204362.0, ans=0.125 2023-06-22 10:19:13,746 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.471e+02 3.318e+02 4.087e+02 5.384e+02 1.002e+03, threshold=8.174e+02, percent-clipped=10.0 2023-06-22 10:19:25,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1204422.0, ans=0.125 2023-06-22 10:19:35,990 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.02 vs. limit=10.0 2023-06-22 10:20:10,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1204602.0, ans=0.125 2023-06-22 10:20:11,935 INFO [train.py:996] (3/4) Epoch 7, batch 17800, loss[loss=0.2527, simple_loss=0.3475, pruned_loss=0.07894, over 21286.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3218, pruned_loss=0.08426, over 4264554.35 frames. ], batch size: 549, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:20:24,369 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-22 10:20:35,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1204662.0, ans=0.0 2023-06-22 10:20:47,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1204662.0, ans=0.125 2023-06-22 10:21:15,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1204782.0, ans=0.125 2023-06-22 10:21:36,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1204842.0, ans=0.125 2023-06-22 10:21:39,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1204842.0, ans=0.1 2023-06-22 10:21:55,059 INFO [train.py:996] (3/4) Epoch 7, batch 17850, loss[loss=0.2962, simple_loss=0.3566, pruned_loss=0.1179, over 21715.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3237, pruned_loss=0.0857, over 4266201.80 frames. ], batch size: 298, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:22:02,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1204902.0, ans=0.125 2023-06-22 10:22:04,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1204902.0, ans=0.2 2023-06-22 10:22:15,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1204962.0, ans=0.2 2023-06-22 10:22:45,753 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.428e+02 3.209e+02 3.990e+02 4.443e+02 8.332e+02, threshold=7.980e+02, percent-clipped=3.0 2023-06-22 10:23:02,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1205082.0, ans=0.0 2023-06-22 10:23:38,637 INFO [train.py:996] (3/4) Epoch 7, batch 17900, loss[loss=0.2504, simple_loss=0.3411, pruned_loss=0.07983, over 21659.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3293, pruned_loss=0.08817, over 4275281.13 frames. ], batch size: 263, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:24:23,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1205262.0, ans=0.125 2023-06-22 10:24:29,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1205322.0, ans=0.125 2023-06-22 10:24:39,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1205322.0, ans=0.0 2023-06-22 10:24:45,948 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:25:20,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1205442.0, ans=0.0 2023-06-22 10:25:24,778 INFO [train.py:996] (3/4) Epoch 7, batch 17950, loss[loss=0.2459, simple_loss=0.3378, pruned_loss=0.07697, over 21471.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.327, pruned_loss=0.08405, over 4278459.95 frames. ], batch size: 471, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:25:47,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1205502.0, ans=0.125 2023-06-22 10:25:54,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1205562.0, ans=0.125 2023-06-22 10:26:02,543 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-22 10:26:08,219 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.466e+02 3.180e+02 3.649e+02 4.821e+02 7.234e+02, threshold=7.298e+02, percent-clipped=0.0 2023-06-22 10:26:40,588 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-22 10:26:46,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1205742.0, ans=0.0 2023-06-22 10:27:07,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1205742.0, ans=0.2 2023-06-22 10:27:10,984 INFO [train.py:996] (3/4) Epoch 7, batch 18000, loss[loss=0.2426, simple_loss=0.3028, pruned_loss=0.09115, over 21758.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.321, pruned_loss=0.0824, over 4277724.34 frames. ], batch size: 371, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:27:10,984 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 10:27:30,136 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.265, simple_loss=0.3646, pruned_loss=0.08269, over 1796401.00 frames. 2023-06-22 10:27:30,137 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-22 10:27:31,134 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-22 10:28:16,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1205922.0, ans=0.125 2023-06-22 10:29:01,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1206042.0, ans=0.07 2023-06-22 10:29:04,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1206042.0, ans=0.2 2023-06-22 10:29:09,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1206042.0, ans=0.125 2023-06-22 10:29:12,727 INFO [train.py:996] (3/4) Epoch 7, batch 18050, loss[loss=0.2648, simple_loss=0.3306, pruned_loss=0.09953, over 21375.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3149, pruned_loss=0.08213, over 4278954.67 frames. ], batch size: 211, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:29:20,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1206102.0, ans=0.0 2023-06-22 10:29:50,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1206222.0, ans=0.125 2023-06-22 10:29:52,827 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.553e+02 3.561e+02 4.207e+02 5.144e+02 1.104e+03, threshold=8.414e+02, percent-clipped=10.0 2023-06-22 10:29:55,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1206222.0, ans=0.125 2023-06-22 10:30:13,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.89 vs. limit=22.5 2023-06-22 10:30:54,999 INFO [train.py:996] (3/4) Epoch 7, batch 18100, loss[loss=0.2273, simple_loss=0.3071, pruned_loss=0.07373, over 19984.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3197, pruned_loss=0.08415, over 4272435.96 frames. ], batch size: 703, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:31:29,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1206462.0, ans=0.125 2023-06-22 10:32:11,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1206582.0, ans=0.1 2023-06-22 10:32:16,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1206582.0, ans=0.125 2023-06-22 10:32:18,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1206642.0, ans=0.1 2023-06-22 10:32:35,094 INFO [train.py:996] (3/4) Epoch 7, batch 18150, loss[loss=0.2499, simple_loss=0.3149, pruned_loss=0.09244, over 21811.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3215, pruned_loss=0.08404, over 4266071.87 frames. ], batch size: 317, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:33:15,096 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.590e+02 3.134e+02 3.517e+02 4.943e+02 8.965e+02, threshold=7.034e+02, percent-clipped=1.0 2023-06-22 10:34:13,170 INFO [train.py:996] (3/4) Epoch 7, batch 18200, loss[loss=0.2193, simple_loss=0.2859, pruned_loss=0.07636, over 21891.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3153, pruned_loss=0.08432, over 4260199.11 frames. ], batch size: 98, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:34:42,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1207122.0, ans=0.125 2023-06-22 10:34:51,107 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.46 vs. limit=15.0 2023-06-22 10:34:51,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1207122.0, ans=0.1 2023-06-22 10:35:01,525 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:35:17,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1207182.0, ans=0.0 2023-06-22 10:35:28,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1207242.0, ans=0.0 2023-06-22 10:35:33,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1207242.0, ans=0.0 2023-06-22 10:35:50,297 INFO [train.py:996] (3/4) Epoch 7, batch 18250, loss[loss=0.2448, simple_loss=0.3114, pruned_loss=0.0891, over 21753.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.307, pruned_loss=0.0814, over 4264079.22 frames. ], batch size: 112, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:35:59,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1207302.0, ans=0.125 2023-06-22 10:36:20,422 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.52 vs. limit=10.0 2023-06-22 10:36:25,624 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 3.178e+02 4.108e+02 6.214e+02 1.567e+03, threshold=8.215e+02, percent-clipped=16.0 2023-06-22 10:37:21,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1207542.0, ans=0.0 2023-06-22 10:37:29,320 INFO [train.py:996] (3/4) Epoch 7, batch 18300, loss[loss=0.2528, simple_loss=0.3439, pruned_loss=0.08087, over 21738.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3082, pruned_loss=0.08223, over 4266738.12 frames. ], batch size: 247, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:39:08,693 INFO [train.py:996] (3/4) Epoch 7, batch 18350, loss[loss=0.2399, simple_loss=0.34, pruned_loss=0.0699, over 21593.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3135, pruned_loss=0.08138, over 4264690.17 frames. ], batch size: 230, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:39:18,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1207902.0, ans=0.1 2023-06-22 10:39:43,990 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.350e+02 3.179e+02 3.735e+02 4.992e+02 1.231e+03, threshold=7.469e+02, percent-clipped=7.0 2023-06-22 10:39:51,603 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-22 10:40:12,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1208082.0, ans=0.1 2023-06-22 10:40:49,851 INFO [train.py:996] (3/4) Epoch 7, batch 18400, loss[loss=0.194, simple_loss=0.2799, pruned_loss=0.05411, over 21782.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3073, pruned_loss=0.07945, over 4250447.28 frames. ], batch size: 352, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:41:08,124 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:41:42,269 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=15.0 2023-06-22 10:41:57,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1208382.0, ans=0.1 2023-06-22 10:42:04,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1208382.0, ans=0.2 2023-06-22 10:42:21,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1208442.0, ans=0.125 2023-06-22 10:42:29,235 INFO [train.py:996] (3/4) Epoch 7, batch 18450, loss[loss=0.1936, simple_loss=0.2829, pruned_loss=0.05218, over 21274.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3041, pruned_loss=0.07593, over 4240841.31 frames. ], batch size: 551, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:42:34,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1208502.0, ans=0.125 2023-06-22 10:43:04,263 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.358e+02 3.170e+02 3.772e+02 5.072e+02 1.044e+03, threshold=7.545e+02, percent-clipped=1.0 2023-06-22 10:43:32,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1208682.0, ans=0.125 2023-06-22 10:44:09,088 INFO [train.py:996] (3/4) Epoch 7, batch 18500, loss[loss=0.2283, simple_loss=0.3158, pruned_loss=0.07037, over 21712.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3004, pruned_loss=0.07554, over 4246984.98 frames. ], batch size: 332, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:44:17,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1208802.0, ans=0.125 2023-06-22 10:45:50,051 INFO [train.py:996] (3/4) Epoch 7, batch 18550, loss[loss=0.1948, simple_loss=0.2547, pruned_loss=0.0674, over 17038.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2976, pruned_loss=0.07454, over 4247941.35 frames. ], batch size: 67, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:46:01,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1209102.0, ans=0.125 2023-06-22 10:46:22,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1209222.0, ans=0.0 2023-06-22 10:46:32,518 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 3.124e+02 3.693e+02 4.756e+02 1.140e+03, threshold=7.385e+02, percent-clipped=12.0 2023-06-22 10:46:42,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1209282.0, ans=0.125 2023-06-22 10:47:00,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1209282.0, ans=0.125 2023-06-22 10:47:08,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1209342.0, ans=0.2 2023-06-22 10:47:22,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1209342.0, ans=0.1 2023-06-22 10:47:30,136 INFO [train.py:996] (3/4) Epoch 7, batch 18600, loss[loss=0.2426, simple_loss=0.3109, pruned_loss=0.08716, over 21698.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2957, pruned_loss=0.07571, over 4223753.60 frames. ], batch size: 333, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:47:49,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1209462.0, ans=0.0 2023-06-22 10:47:54,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1209462.0, ans=0.07 2023-06-22 10:47:59,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1209462.0, ans=0.1 2023-06-22 10:48:22,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1209582.0, ans=0.0 2023-06-22 10:48:24,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1209582.0, ans=0.1 2023-06-22 10:48:42,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1209582.0, ans=0.0 2023-06-22 10:48:50,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1209642.0, ans=0.125 2023-06-22 10:49:09,429 INFO [train.py:996] (3/4) Epoch 7, batch 18650, loss[loss=0.2779, simple_loss=0.339, pruned_loss=0.1085, over 21430.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2954, pruned_loss=0.07632, over 4207430.13 frames. ], batch size: 473, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:49:25,861 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.44 vs. limit=12.0 2023-06-22 10:49:45,392 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 3.160e+02 3.578e+02 4.366e+02 8.700e+02, threshold=7.156e+02, percent-clipped=2.0 2023-06-22 10:49:47,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-06-22 10:50:47,140 INFO [train.py:996] (3/4) Epoch 7, batch 18700, loss[loss=0.2287, simple_loss=0.2909, pruned_loss=0.08321, over 21823.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2934, pruned_loss=0.07742, over 4216207.14 frames. ], batch size: 282, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:50:47,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1210002.0, ans=0.125 2023-06-22 10:50:59,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1210002.0, ans=0.125 2023-06-22 10:51:02,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1210062.0, ans=0.125 2023-06-22 10:51:18,465 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:51:26,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1210122.0, ans=0.0 2023-06-22 10:51:50,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1210182.0, ans=0.05 2023-06-22 10:52:26,869 INFO [train.py:996] (3/4) Epoch 7, batch 18750, loss[loss=0.2656, simple_loss=0.3453, pruned_loss=0.0929, over 21382.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2959, pruned_loss=0.08066, over 4230246.52 frames. ], batch size: 131, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:52:32,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1210302.0, ans=0.125 2023-06-22 10:53:01,620 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-22 10:53:03,961 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.372e+02 3.195e+02 3.885e+02 4.969e+02 1.061e+03, threshold=7.770e+02, percent-clipped=4.0 2023-06-22 10:54:02,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1210542.0, ans=0.0 2023-06-22 10:54:05,688 INFO [train.py:996] (3/4) Epoch 7, batch 18800, loss[loss=0.2266, simple_loss=0.3116, pruned_loss=0.07079, over 21757.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3026, pruned_loss=0.08187, over 4242751.94 frames. ], batch size: 351, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:54:12,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1210602.0, ans=0.125 2023-06-22 10:55:44,295 INFO [train.py:996] (3/4) Epoch 7, batch 18850, loss[loss=0.2031, simple_loss=0.3002, pruned_loss=0.05294, over 21600.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3008, pruned_loss=0.07701, over 4255798.23 frames. ], batch size: 389, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:55:49,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1210902.0, ans=0.125 2023-06-22 10:55:59,061 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:56:02,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1210962.0, ans=0.125 2023-06-22 10:56:19,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1211022.0, ans=0.125 2023-06-22 10:56:21,053 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 3.160e+02 3.995e+02 5.299e+02 8.301e+02, threshold=7.991e+02, percent-clipped=3.0 2023-06-22 10:56:51,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1211082.0, ans=0.125 2023-06-22 10:57:03,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1211142.0, ans=15.0 2023-06-22 10:57:24,809 INFO [train.py:996] (3/4) Epoch 7, batch 18900, loss[loss=0.1666, simple_loss=0.2446, pruned_loss=0.04431, over 21281.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2995, pruned_loss=0.07726, over 4243507.77 frames. ], batch size: 176, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:57:29,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1211202.0, ans=0.125 2023-06-22 10:57:34,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1211202.0, ans=0.125 2023-06-22 10:58:08,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1211322.0, ans=0.1 2023-06-22 10:58:19,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1211382.0, ans=0.125 2023-06-22 10:58:25,547 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=15.0 2023-06-22 10:58:35,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1211382.0, ans=10.0 2023-06-22 10:59:00,592 INFO [train.py:996] (3/4) Epoch 7, batch 18950, loss[loss=0.2284, simple_loss=0.3035, pruned_loss=0.07665, over 21662.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.2991, pruned_loss=0.07978, over 4248601.93 frames. ], batch size: 263, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:59:01,615 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=15.0 2023-06-22 10:59:38,371 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.670e+02 3.300e+02 3.868e+02 4.844e+02 6.994e+02, threshold=7.736e+02, percent-clipped=0.0 2023-06-22 10:59:58,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1211682.0, ans=0.1 2023-06-22 11:00:38,063 INFO [train.py:996] (3/4) Epoch 7, batch 19000, loss[loss=0.2845, simple_loss=0.3541, pruned_loss=0.1074, over 21444.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3074, pruned_loss=0.08106, over 4255425.01 frames. ], batch size: 211, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 11:00:47,027 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=22.5 2023-06-22 11:01:49,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1211982.0, ans=0.0 2023-06-22 11:01:59,714 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.37 vs. limit=10.0 2023-06-22 11:02:12,829 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-06-22 11:02:18,608 INFO [train.py:996] (3/4) Epoch 7, batch 19050, loss[loss=0.2805, simple_loss=0.3412, pruned_loss=0.1099, over 21841.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3136, pruned_loss=0.08531, over 4259578.97 frames. ], batch size: 351, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:02:44,964 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=15.0 2023-06-22 11:02:46,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1212162.0, ans=10.0 2023-06-22 11:02:48,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1212222.0, ans=0.0 2023-06-22 11:03:06,389 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.274e+02 3.680e+02 4.051e+02 6.947e+02, threshold=7.360e+02, percent-clipped=0.0 2023-06-22 11:03:57,454 INFO [train.py:996] (3/4) Epoch 7, batch 19100, loss[loss=0.2125, simple_loss=0.276, pruned_loss=0.07453, over 21864.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3112, pruned_loss=0.086, over 4270870.19 frames. ], batch size: 98, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:04:57,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1212522.0, ans=0.1 2023-06-22 11:05:14,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1212582.0, ans=0.1 2023-06-22 11:05:24,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1212642.0, ans=0.125 2023-06-22 11:05:40,898 INFO [train.py:996] (3/4) Epoch 7, batch 19150, loss[loss=0.2422, simple_loss=0.3392, pruned_loss=0.0726, over 21700.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3155, pruned_loss=0.087, over 4271993.86 frames. ], batch size: 247, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:06:42,981 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.634e+02 3.652e+02 4.521e+02 6.039e+02 1.131e+03, threshold=9.042e+02, percent-clipped=10.0 2023-06-22 11:06:50,655 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=12.0 2023-06-22 11:07:23,657 INFO [train.py:996] (3/4) Epoch 7, batch 19200, loss[loss=0.2552, simple_loss=0.3515, pruned_loss=0.07947, over 21802.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3232, pruned_loss=0.0867, over 4265848.16 frames. ], batch size: 332, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:09:03,365 INFO [train.py:996] (3/4) Epoch 7, batch 19250, loss[loss=0.238, simple_loss=0.3165, pruned_loss=0.07973, over 21607.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3253, pruned_loss=0.0827, over 4265316.93 frames. ], batch size: 471, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:09:39,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1213362.0, ans=0.2 2023-06-22 11:10:04,466 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 3.337e+02 4.181e+02 5.636e+02 1.044e+03, threshold=8.362e+02, percent-clipped=4.0 2023-06-22 11:10:18,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=10.0 2023-06-22 11:10:23,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1213482.0, ans=0.2 2023-06-22 11:10:43,612 INFO [train.py:996] (3/4) Epoch 7, batch 19300, loss[loss=0.2339, simple_loss=0.3148, pruned_loss=0.07647, over 21679.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.322, pruned_loss=0.08261, over 4277902.56 frames. ], batch size: 389, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:10:47,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1213602.0, ans=0.07 2023-06-22 11:11:19,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1213662.0, ans=0.0 2023-06-22 11:12:00,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1213782.0, ans=0.2 2023-06-22 11:12:18,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1213842.0, ans=0.125 2023-06-22 11:12:31,341 INFO [train.py:996] (3/4) Epoch 7, batch 19350, loss[loss=0.2185, simple_loss=0.31, pruned_loss=0.0635, over 21560.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.317, pruned_loss=0.0798, over 4279502.20 frames. ], batch size: 441, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:12:33,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1213902.0, ans=0.035 2023-06-22 11:13:20,167 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 3.132e+02 3.696e+02 4.468e+02 9.223e+02, threshold=7.391e+02, percent-clipped=2.0 2023-06-22 11:13:22,569 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.01 vs. limit=22.5 2023-06-22 11:14:04,201 INFO [train.py:996] (3/4) Epoch 7, batch 19400, loss[loss=0.2055, simple_loss=0.3022, pruned_loss=0.0544, over 19877.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3138, pruned_loss=0.07813, over 4284230.12 frames. ], batch size: 703, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:14:12,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1214202.0, ans=0.125 2023-06-22 11:15:11,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1214382.0, ans=0.125 2023-06-22 11:15:19,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1214382.0, ans=0.0 2023-06-22 11:15:29,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1214442.0, ans=0.125 2023-06-22 11:15:43,312 INFO [train.py:996] (3/4) Epoch 7, batch 19450, loss[loss=0.2553, simple_loss=0.3096, pruned_loss=0.1006, over 21859.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3112, pruned_loss=0.08016, over 4290359.14 frames. ], batch size: 98, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:15:45,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1214502.0, ans=0.2 2023-06-22 11:15:47,345 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-22 11:16:14,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1214562.0, ans=0.2 2023-06-22 11:16:27,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1214562.0, ans=0.1 2023-06-22 11:16:37,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1214622.0, ans=0.0 2023-06-22 11:16:38,382 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.537e+02 3.055e+02 3.774e+02 4.517e+02 1.086e+03, threshold=7.548e+02, percent-clipped=5.0 2023-06-22 11:16:43,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1214622.0, ans=0.125 2023-06-22 11:16:54,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1214682.0, ans=0.025 2023-06-22 11:17:04,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1214742.0, ans=0.015 2023-06-22 11:17:09,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1214742.0, ans=0.125 2023-06-22 11:17:16,498 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.96 vs. limit=10.0 2023-06-22 11:17:23,328 INFO [train.py:996] (3/4) Epoch 7, batch 19500, loss[loss=0.2499, simple_loss=0.3116, pruned_loss=0.09411, over 21641.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3067, pruned_loss=0.0808, over 4283665.26 frames. ], batch size: 263, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:17:23,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1214802.0, ans=0.125 2023-06-22 11:18:08,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1214862.0, ans=0.125 2023-06-22 11:18:41,885 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.28 vs. limit=15.0 2023-06-22 11:19:05,686 INFO [train.py:996] (3/4) Epoch 7, batch 19550, loss[loss=0.1773, simple_loss=0.2348, pruned_loss=0.05992, over 21901.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3005, pruned_loss=0.07846, over 4268920.28 frames. ], batch size: 107, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:19:12,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1215102.0, ans=0.125 2023-06-22 11:19:55,191 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.262e+02 3.073e+02 3.530e+02 4.388e+02 8.690e+02, threshold=7.059e+02, percent-clipped=2.0 2023-06-22 11:20:39,258 INFO [train.py:996] (3/4) Epoch 7, batch 19600, loss[loss=0.2247, simple_loss=0.2926, pruned_loss=0.07838, over 21812.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3025, pruned_loss=0.07958, over 4276299.47 frames. ], batch size: 247, lr: 4.26e-03, grad_scale: 32.0 2023-06-22 11:20:47,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1215402.0, ans=0.125 2023-06-22 11:21:11,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1215462.0, ans=0.0 2023-06-22 11:21:44,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1215582.0, ans=0.2 2023-06-22 11:21:49,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1215582.0, ans=0.125 2023-06-22 11:22:22,651 INFO [train.py:996] (3/4) Epoch 7, batch 19650, loss[loss=0.3106, simple_loss=0.3552, pruned_loss=0.133, over 21628.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3085, pruned_loss=0.08417, over 4281345.86 frames. ], batch size: 510, lr: 4.26e-03, grad_scale: 32.0 2023-06-22 11:22:39,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1215702.0, ans=0.125 2023-06-22 11:23:15,645 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.661e+02 3.627e+02 4.077e+02 5.113e+02 8.180e+02, threshold=8.154e+02, percent-clipped=7.0 2023-06-22 11:24:05,572 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 11:24:16,439 INFO [train.py:996] (3/4) Epoch 7, batch 19700, loss[loss=0.2463, simple_loss=0.3456, pruned_loss=0.07349, over 21185.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3133, pruned_loss=0.08499, over 4282867.36 frames. ], batch size: 548, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:24:16,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1216002.0, ans=0.0 2023-06-22 11:24:29,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1216002.0, ans=0.125 2023-06-22 11:24:48,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1216122.0, ans=0.1 2023-06-22 11:25:18,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1216182.0, ans=0.0 2023-06-22 11:25:58,821 INFO [train.py:996] (3/4) Epoch 7, batch 19750, loss[loss=0.3108, simple_loss=0.4105, pruned_loss=0.1056, over 21868.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3224, pruned_loss=0.08596, over 4278811.89 frames. ], batch size: 372, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:26:21,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1216362.0, ans=0.125 2023-06-22 11:26:44,446 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.745e+02 3.729e+02 4.611e+02 5.991e+02 1.312e+03, threshold=9.223e+02, percent-clipped=7.0 2023-06-22 11:27:27,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1216542.0, ans=0.0 2023-06-22 11:27:29,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1216542.0, ans=0.125 2023-06-22 11:27:38,675 INFO [train.py:996] (3/4) Epoch 7, batch 19800, loss[loss=0.2093, simple_loss=0.2706, pruned_loss=0.07404, over 21648.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3239, pruned_loss=0.08728, over 4271943.51 frames. ], batch size: 195, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:27:51,492 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.09 vs. limit=10.0 2023-06-22 11:27:52,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1216602.0, ans=0.0 2023-06-22 11:28:49,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1216782.0, ans=0.125 2023-06-22 11:29:21,341 INFO [train.py:996] (3/4) Epoch 7, batch 19850, loss[loss=0.2178, simple_loss=0.2963, pruned_loss=0.0696, over 21775.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3163, pruned_loss=0.08184, over 4268714.70 frames. ], batch size: 316, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:29:55,750 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-22 11:30:12,593 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 2.965e+02 3.588e+02 4.617e+02 1.028e+03, threshold=7.176e+02, percent-clipped=3.0 2023-06-22 11:30:24,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1217082.0, ans=0.125 2023-06-22 11:30:45,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1217142.0, ans=0.0 2023-06-22 11:30:46,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1217142.0, ans=0.125 2023-06-22 11:31:00,270 INFO [train.py:996] (3/4) Epoch 7, batch 19900, loss[loss=0.2152, simple_loss=0.2803, pruned_loss=0.07509, over 21337.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.315, pruned_loss=0.07861, over 4269483.24 frames. ], batch size: 211, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:31:26,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1217262.0, ans=0.2 2023-06-22 11:31:28,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1217262.0, ans=0.0 2023-06-22 11:31:57,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1217322.0, ans=0.125 2023-06-22 11:32:41,972 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=15.0 2023-06-22 11:32:42,595 INFO [train.py:996] (3/4) Epoch 7, batch 19950, loss[loss=0.1981, simple_loss=0.3187, pruned_loss=0.03877, over 19782.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3112, pruned_loss=0.07902, over 4261971.81 frames. ], batch size: 702, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:33:10,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1217562.0, ans=0.1 2023-06-22 11:33:38,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1217622.0, ans=0.125 2023-06-22 11:33:39,617 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.305e+02 4.065e+02 5.440e+02 9.798e+02, threshold=8.130e+02, percent-clipped=10.0 2023-06-22 11:34:06,269 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.08 vs. limit=12.0 2023-06-22 11:34:07,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1217742.0, ans=0.125 2023-06-22 11:34:18,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1217742.0, ans=0.0 2023-06-22 11:34:22,877 INFO [train.py:996] (3/4) Epoch 7, batch 20000, loss[loss=0.2481, simple_loss=0.3209, pruned_loss=0.08766, over 21497.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.312, pruned_loss=0.07945, over 4263775.75 frames. ], batch size: 195, lr: 4.26e-03, grad_scale: 32.0 2023-06-22 11:34:41,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1217802.0, ans=0.1 2023-06-22 11:34:52,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1217862.0, ans=0.125 2023-06-22 11:35:30,200 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-22 11:35:32,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1217982.0, ans=0.2 2023-06-22 11:35:55,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1218042.0, ans=0.125 2023-06-22 11:36:02,150 INFO [train.py:996] (3/4) Epoch 7, batch 20050, loss[loss=0.2362, simple_loss=0.3131, pruned_loss=0.07967, over 21617.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3132, pruned_loss=0.08204, over 4275439.77 frames. ], batch size: 230, lr: 4.26e-03, grad_scale: 32.0 2023-06-22 11:36:30,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1218162.0, ans=0.07 2023-06-22 11:36:31,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1218162.0, ans=0.125 2023-06-22 11:36:47,793 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=15.0 2023-06-22 11:36:57,940 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.68 vs. limit=6.0 2023-06-22 11:37:00,024 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.588e+02 3.182e+02 3.871e+02 4.475e+02 7.153e+02, threshold=7.741e+02, percent-clipped=0.0 2023-06-22 11:37:10,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1218282.0, ans=0.2 2023-06-22 11:37:10,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1218282.0, ans=0.125 2023-06-22 11:37:13,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1218282.0, ans=0.5 2023-06-22 11:37:32,521 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=12.0 2023-06-22 11:37:49,399 INFO [train.py:996] (3/4) Epoch 7, batch 20100, loss[loss=0.2316, simple_loss=0.2926, pruned_loss=0.08536, over 21600.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3162, pruned_loss=0.08433, over 4274480.44 frames. ], batch size: 548, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:37:49,857 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.074e-02 2023-06-22 11:38:13,433 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=22.5 2023-06-22 11:38:34,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1218522.0, ans=0.125 2023-06-22 11:38:36,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1218522.0, ans=0.1 2023-06-22 11:38:46,394 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-22 11:39:12,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1218642.0, ans=0.1 2023-06-22 11:39:26,421 INFO [train.py:996] (3/4) Epoch 7, batch 20150, loss[loss=0.2846, simple_loss=0.3519, pruned_loss=0.1086, over 21773.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3244, pruned_loss=0.08758, over 4271925.54 frames. ], batch size: 441, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:39:29,070 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-22 11:39:45,579 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 11:39:47,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1218762.0, ans=0.1 2023-06-22 11:39:49,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1218762.0, ans=0.2 2023-06-22 11:40:20,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1218822.0, ans=0.125 2023-06-22 11:40:28,334 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.601e+02 4.003e+02 4.787e+02 6.267e+02 1.040e+03, threshold=9.575e+02, percent-clipped=17.0 2023-06-22 11:41:15,541 INFO [train.py:996] (3/4) Epoch 7, batch 20200, loss[loss=0.256, simple_loss=0.3907, pruned_loss=0.06063, over 19913.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3289, pruned_loss=0.09008, over 4261767.21 frames. ], batch size: 702, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:41:32,697 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-22 11:41:35,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1219002.0, ans=0.125 2023-06-22 11:41:44,512 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=15.0 2023-06-22 11:41:58,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1219122.0, ans=0.2 2023-06-22 11:42:04,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1219122.0, ans=0.125 2023-06-22 11:42:30,954 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2023-06-22 11:43:00,754 INFO [train.py:996] (3/4) Epoch 7, batch 20250, loss[loss=0.2699, simple_loss=0.3335, pruned_loss=0.1031, over 21794.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3288, pruned_loss=0.0878, over 4254967.70 frames. ], batch size: 298, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:43:11,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1219302.0, ans=10.0 2023-06-22 11:43:21,495 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-22 11:43:25,991 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.16 vs. limit=12.0 2023-06-22 11:43:32,594 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-22 11:43:41,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1219422.0, ans=0.125 2023-06-22 11:43:51,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1219422.0, ans=0.1 2023-06-22 11:43:54,245 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.109e+02 3.852e+02 4.558e+02 1.289e+03, threshold=7.704e+02, percent-clipped=1.0 2023-06-22 11:43:56,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1219422.0, ans=0.2 2023-06-22 11:44:02,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1219482.0, ans=0.125 2023-06-22 11:44:37,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1219542.0, ans=0.035 2023-06-22 11:44:40,568 INFO [train.py:996] (3/4) Epoch 7, batch 20300, loss[loss=0.2298, simple_loss=0.3144, pruned_loss=0.07257, over 21731.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3253, pruned_loss=0.08406, over 4255055.58 frames. ], batch size: 332, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:44:40,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1219602.0, ans=0.1 2023-06-22 11:45:07,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1219662.0, ans=0.125 2023-06-22 11:45:31,947 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 11:45:47,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1219782.0, ans=0.04949747468305833 2023-06-22 11:46:12,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1219842.0, ans=0.05 2023-06-22 11:46:18,486 INFO [train.py:996] (3/4) Epoch 7, batch 20350, loss[loss=0.2537, simple_loss=0.3145, pruned_loss=0.09645, over 21324.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.326, pruned_loss=0.08517, over 4254620.91 frames. ], batch size: 143, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:46:23,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1219902.0, ans=0.125 2023-06-22 11:47:02,105 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=22.5 2023-06-22 11:47:11,888 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.199e+02 3.639e+02 4.659e+02 8.452e+02, threshold=7.278e+02, percent-clipped=1.0 2023-06-22 11:47:55,233 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.62 vs. limit=6.0 2023-06-22 11:47:58,703 INFO [train.py:996] (3/4) Epoch 7, batch 20400, loss[loss=0.2957, simple_loss=0.3635, pruned_loss=0.1139, over 21895.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3293, pruned_loss=0.08854, over 4253321.00 frames. ], batch size: 371, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 11:48:10,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1220202.0, ans=0.1 2023-06-22 11:48:14,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1220202.0, ans=0.2 2023-06-22 11:48:38,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1220322.0, ans=0.2 2023-06-22 11:49:06,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1220382.0, ans=0.125 2023-06-22 11:49:17,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1220442.0, ans=0.1 2023-06-22 11:49:43,950 INFO [train.py:996] (3/4) Epoch 7, batch 20450, loss[loss=0.2858, simple_loss=0.3478, pruned_loss=0.1119, over 21511.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.331, pruned_loss=0.09111, over 4254332.50 frames. ], batch size: 131, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 11:49:47,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1220502.0, ans=0.125 2023-06-22 11:50:19,438 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-06-22 11:50:30,933 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.680e+02 3.380e+02 3.850e+02 4.870e+02 7.513e+02, threshold=7.700e+02, percent-clipped=1.0 2023-06-22 11:51:05,922 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 11:51:16,538 INFO [train.py:996] (3/4) Epoch 7, batch 20500, loss[loss=0.2426, simple_loss=0.3001, pruned_loss=0.09252, over 21355.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.326, pruned_loss=0.09078, over 4258266.88 frames. ], batch size: 144, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:53:01,599 INFO [train.py:996] (3/4) Epoch 7, batch 20550, loss[loss=0.198, simple_loss=0.316, pruned_loss=0.04004, over 19824.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.319, pruned_loss=0.0887, over 4261164.26 frames. ], batch size: 703, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:53:06,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1221102.0, ans=0.125 2023-06-22 11:53:51,913 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 3.265e+02 4.144e+02 5.422e+02 9.318e+02, threshold=8.288e+02, percent-clipped=6.0 2023-06-22 11:54:14,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1221342.0, ans=0.125 2023-06-22 11:54:16,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1221342.0, ans=0.125 2023-06-22 11:54:40,925 INFO [train.py:996] (3/4) Epoch 7, batch 20600, loss[loss=0.2544, simple_loss=0.3241, pruned_loss=0.09237, over 21293.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3212, pruned_loss=0.08733, over 4239526.63 frames. ], batch size: 176, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:55:52,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1221642.0, ans=0.125 2023-06-22 11:56:08,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1221642.0, ans=0.0 2023-06-22 11:56:19,425 INFO [train.py:996] (3/4) Epoch 7, batch 20650, loss[loss=0.2066, simple_loss=0.2795, pruned_loss=0.06681, over 21650.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3182, pruned_loss=0.08746, over 4248652.39 frames. ], batch size: 332, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:56:21,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1221702.0, ans=0.125 2023-06-22 11:56:27,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1221702.0, ans=0.125 2023-06-22 11:56:32,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1221702.0, ans=0.125 2023-06-22 11:56:51,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1221762.0, ans=0.0 2023-06-22 11:56:59,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1221822.0, ans=0.125 2023-06-22 11:57:01,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1221822.0, ans=0.1 2023-06-22 11:57:08,833 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.629e+02 3.307e+02 4.027e+02 4.834e+02 1.059e+03, threshold=8.054e+02, percent-clipped=3.0 2023-06-22 11:57:58,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1222002.0, ans=0.0 2023-06-22 11:57:59,182 INFO [train.py:996] (3/4) Epoch 7, batch 20700, loss[loss=0.266, simple_loss=0.3324, pruned_loss=0.09981, over 20063.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3124, pruned_loss=0.08502, over 4245503.80 frames. ], batch size: 702, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:58:41,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1222122.0, ans=0.125 2023-06-22 11:58:49,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1222122.0, ans=0.0 2023-06-22 11:59:20,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1222242.0, ans=0.0 2023-06-22 11:59:41,092 INFO [train.py:996] (3/4) Epoch 7, batch 20750, loss[loss=0.2777, simple_loss=0.4049, pruned_loss=0.07527, over 20855.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3146, pruned_loss=0.08457, over 4245452.80 frames. ], batch size: 607, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:00:07,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1222362.0, ans=0.125 2023-06-22 12:00:31,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1222422.0, ans=0.0 2023-06-22 12:00:33,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1222422.0, ans=0.125 2023-06-22 12:00:36,501 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.565e+02 3.483e+02 4.528e+02 6.877e+02 1.317e+03, threshold=9.056e+02, percent-clipped=16.0 2023-06-22 12:01:26,496 INFO [train.py:996] (3/4) Epoch 7, batch 20800, loss[loss=0.2053, simple_loss=0.2761, pruned_loss=0.06724, over 21833.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3159, pruned_loss=0.08501, over 4251206.23 frames. ], batch size: 318, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 12:01:52,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1222662.0, ans=0.0 2023-06-22 12:02:45,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1222842.0, ans=10.0 2023-06-22 12:02:56,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1222842.0, ans=0.125 2023-06-22 12:03:02,322 INFO [train.py:996] (3/4) Epoch 7, batch 20850, loss[loss=0.1832, simple_loss=0.2561, pruned_loss=0.05519, over 21572.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3112, pruned_loss=0.08369, over 4252884.57 frames. ], batch size: 263, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 12:03:37,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1223022.0, ans=0.2 2023-06-22 12:04:01,378 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.483e+02 3.647e+02 5.072e+02 6.568e+02 1.337e+03, threshold=1.014e+03, percent-clipped=9.0 2023-06-22 12:04:07,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1223082.0, ans=0.0 2023-06-22 12:04:45,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1223202.0, ans=0.02 2023-06-22 12:04:46,060 INFO [train.py:996] (3/4) Epoch 7, batch 20900, loss[loss=0.2833, simple_loss=0.4066, pruned_loss=0.08005, over 19714.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3112, pruned_loss=0.08358, over 4251983.31 frames. ], batch size: 702, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 12:05:51,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1223382.0, ans=0.0 2023-06-22 12:06:19,589 INFO [train.py:996] (3/4) Epoch 7, batch 20950, loss[loss=0.2474, simple_loss=0.3172, pruned_loss=0.08878, over 21788.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3079, pruned_loss=0.08063, over 4253120.50 frames. ], batch size: 414, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:07:15,565 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.036e+02 3.516e+02 4.387e+02 8.628e+02, threshold=7.032e+02, percent-clipped=0.0 2023-06-22 12:07:47,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1223742.0, ans=0.125 2023-06-22 12:07:57,790 INFO [train.py:996] (3/4) Epoch 7, batch 21000, loss[loss=0.2234, simple_loss=0.2961, pruned_loss=0.07532, over 21870.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3052, pruned_loss=0.08066, over 4262032.52 frames. ], batch size: 98, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:07:57,790 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 12:08:15,950 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2689, simple_loss=0.3672, pruned_loss=0.08525, over 1796401.00 frames. 2023-06-22 12:08:15,951 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-22 12:08:39,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1223862.0, ans=0.2 2023-06-22 12:09:13,661 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-22 12:09:36,984 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.81 vs. limit=5.0 2023-06-22 12:09:39,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1224042.0, ans=0.1 2023-06-22 12:09:39,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1224042.0, ans=0.125 2023-06-22 12:09:54,853 INFO [train.py:996] (3/4) Epoch 7, batch 21050, loss[loss=0.2322, simple_loss=0.2939, pruned_loss=0.08527, over 21613.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3025, pruned_loss=0.08113, over 4264110.64 frames. ], batch size: 282, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:09:55,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1224102.0, ans=0.125 2023-06-22 12:10:49,867 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 3.015e+02 3.354e+02 4.094e+02 5.427e+02, threshold=6.709e+02, percent-clipped=0.0 2023-06-22 12:10:58,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1224282.0, ans=0.04949747468305833 2023-06-22 12:11:33,479 INFO [train.py:996] (3/4) Epoch 7, batch 21100, loss[loss=0.2321, simple_loss=0.2806, pruned_loss=0.09181, over 21241.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.298, pruned_loss=0.0798, over 4258326.21 frames. ], batch size: 471, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:11:44,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1224402.0, ans=0.125 2023-06-22 12:13:04,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1224642.0, ans=0.0 2023-06-22 12:13:07,716 INFO [train.py:996] (3/4) Epoch 7, batch 21150, loss[loss=0.2339, simple_loss=0.3019, pruned_loss=0.08297, over 15031.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2953, pruned_loss=0.07986, over 4251302.41 frames. ], batch size: 60, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:13:34,167 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=10.0 2023-06-22 12:13:37,410 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-22 12:13:52,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1224822.0, ans=0.0 2023-06-22 12:14:08,451 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.503e+02 3.198e+02 3.741e+02 4.699e+02 9.376e+02, threshold=7.483e+02, percent-clipped=8.0 2023-06-22 12:14:31,017 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=22.5 2023-06-22 12:14:46,324 INFO [train.py:996] (3/4) Epoch 7, batch 21200, loss[loss=0.2199, simple_loss=0.2781, pruned_loss=0.08088, over 21892.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2914, pruned_loss=0.07956, over 4245180.39 frames. ], batch size: 107, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:15:05,127 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=15.0 2023-06-22 12:16:30,834 INFO [train.py:996] (3/4) Epoch 7, batch 21250, loss[loss=0.2618, simple_loss=0.3282, pruned_loss=0.09765, over 21601.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2918, pruned_loss=0.07999, over 4245712.34 frames. ], batch size: 230, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:16:40,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1225302.0, ans=0.0 2023-06-22 12:16:45,597 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:17:27,709 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.499e+02 3.297e+02 3.945e+02 5.021e+02 1.062e+03, threshold=7.890e+02, percent-clipped=7.0 2023-06-22 12:17:34,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1225482.0, ans=0.125 2023-06-22 12:17:53,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1225542.0, ans=0.125 2023-06-22 12:18:03,956 INFO [train.py:996] (3/4) Epoch 7, batch 21300, loss[loss=0.2357, simple_loss=0.3047, pruned_loss=0.08332, over 21912.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.2997, pruned_loss=0.08299, over 4241504.59 frames. ], batch size: 316, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:18:41,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1225662.0, ans=0.125 2023-06-22 12:19:02,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1225722.0, ans=0.125 2023-06-22 12:19:47,786 INFO [train.py:996] (3/4) Epoch 7, batch 21350, loss[loss=0.2224, simple_loss=0.3133, pruned_loss=0.06579, over 21776.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3034, pruned_loss=0.08318, over 4250592.92 frames. ], batch size: 298, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:20:00,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1225902.0, ans=0.07 2023-06-22 12:20:06,036 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.04 vs. limit=6.0 2023-06-22 12:20:09,274 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-22 12:20:45,049 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 3.191e+02 3.567e+02 4.757e+02 8.464e+02, threshold=7.133e+02, percent-clipped=1.0 2023-06-22 12:21:10,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1226142.0, ans=0.1 2023-06-22 12:21:26,933 INFO [train.py:996] (3/4) Epoch 7, batch 21400, loss[loss=0.2183, simple_loss=0.2857, pruned_loss=0.07543, over 20142.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3054, pruned_loss=0.08218, over 4255798.18 frames. ], batch size: 702, lr: 4.24e-03, grad_scale: 8.0 2023-06-22 12:22:08,127 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.92 vs. limit=15.0 2023-06-22 12:22:23,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1226382.0, ans=0.1 2023-06-22 12:22:55,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1226442.0, ans=0.0 2023-06-22 12:23:06,016 INFO [train.py:996] (3/4) Epoch 7, batch 21450, loss[loss=0.2346, simple_loss=0.312, pruned_loss=0.07863, over 20649.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3095, pruned_loss=0.08456, over 4260237.69 frames. ], batch size: 607, lr: 4.24e-03, grad_scale: 8.0 2023-06-22 12:23:19,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1226502.0, ans=0.1 2023-06-22 12:24:01,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1226682.0, ans=0.0 2023-06-22 12:24:03,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1226682.0, ans=10.0 2023-06-22 12:24:04,448 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.467e+02 3.251e+02 3.638e+02 4.479e+02 7.872e+02, threshold=7.276e+02, percent-clipped=2.0 2023-06-22 12:24:18,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1226682.0, ans=0.125 2023-06-22 12:24:45,007 INFO [train.py:996] (3/4) Epoch 7, batch 21500, loss[loss=0.246, simple_loss=0.3029, pruned_loss=0.0946, over 21399.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3086, pruned_loss=0.08548, over 4265978.58 frames. ], batch size: 131, lr: 4.24e-03, grad_scale: 8.0 2023-06-22 12:24:58,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1226802.0, ans=0.0 2023-06-22 12:25:21,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1226862.0, ans=0.2 2023-06-22 12:25:31,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-22 12:26:03,154 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=22.5 2023-06-22 12:26:20,000 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=22.5 2023-06-22 12:26:21,923 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.38 vs. limit=15.0 2023-06-22 12:26:22,140 INFO [train.py:996] (3/4) Epoch 7, batch 21550, loss[loss=0.191, simple_loss=0.2654, pruned_loss=0.05831, over 21769.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3019, pruned_loss=0.08315, over 4269182.09 frames. ], batch size: 124, lr: 4.24e-03, grad_scale: 8.0 2023-06-22 12:26:34,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1227102.0, ans=0.0 2023-06-22 12:26:46,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1227162.0, ans=0.125 2023-06-22 12:26:56,684 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:27:04,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1227222.0, ans=0.0 2023-06-22 12:27:22,174 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 3.394e+02 4.273e+02 5.102e+02 8.166e+02, threshold=8.546e+02, percent-clipped=3.0 2023-06-22 12:28:03,002 INFO [train.py:996] (3/4) Epoch 7, batch 21600, loss[loss=0.2347, simple_loss=0.2919, pruned_loss=0.08871, over 21805.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.2965, pruned_loss=0.08208, over 4277596.59 frames. ], batch size: 352, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:28:19,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1227402.0, ans=0.0 2023-06-22 12:28:23,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1227462.0, ans=0.0 2023-06-22 12:28:25,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1227462.0, ans=0.07 2023-06-22 12:28:33,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1227462.0, ans=0.125 2023-06-22 12:28:43,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1227462.0, ans=0.0 2023-06-22 12:29:44,720 INFO [train.py:996] (3/4) Epoch 7, batch 21650, loss[loss=0.2347, simple_loss=0.3152, pruned_loss=0.07706, over 21231.00 frames. ], tot_loss[loss=0.23, simple_loss=0.2997, pruned_loss=0.08019, over 4273795.89 frames. ], batch size: 176, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:30:12,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1227762.0, ans=0.2 2023-06-22 12:30:32,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1227822.0, ans=0.2 2023-06-22 12:30:48,051 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 3.272e+02 4.072e+02 5.244e+02 1.561e+03, threshold=8.145e+02, percent-clipped=7.0 2023-06-22 12:30:53,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1227882.0, ans=0.0 2023-06-22 12:30:53,874 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2023-06-22 12:31:00,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1227882.0, ans=0.2 2023-06-22 12:31:22,589 INFO [train.py:996] (3/4) Epoch 7, batch 21700, loss[loss=0.2138, simple_loss=0.2787, pruned_loss=0.0745, over 21606.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2987, pruned_loss=0.07746, over 4276173.43 frames. ], batch size: 332, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:32:00,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1228062.0, ans=0.0 2023-06-22 12:32:06,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1228122.0, ans=0.125 2023-06-22 12:32:23,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1228182.0, ans=0.125 2023-06-22 12:32:23,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1228182.0, ans=0.125 2023-06-22 12:32:58,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1228242.0, ans=0.125 2023-06-22 12:33:01,425 INFO [train.py:996] (3/4) Epoch 7, batch 21750, loss[loss=0.2616, simple_loss=0.3041, pruned_loss=0.1095, over 21240.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.296, pruned_loss=0.07748, over 4273753.67 frames. ], batch size: 471, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:33:36,399 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-22 12:34:00,299 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.383e+02 3.122e+02 3.637e+02 4.917e+02 1.048e+03, threshold=7.274e+02, percent-clipped=3.0 2023-06-22 12:34:37,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1228542.0, ans=0.0 2023-06-22 12:34:40,467 INFO [train.py:996] (3/4) Epoch 7, batch 21800, loss[loss=0.2514, simple_loss=0.3227, pruned_loss=0.09003, over 21644.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2965, pruned_loss=0.0786, over 4276888.67 frames. ], batch size: 298, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:34:46,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1228602.0, ans=0.1 2023-06-22 12:34:53,280 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=12.0 2023-06-22 12:35:57,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1228782.0, ans=0.0 2023-06-22 12:35:58,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1228782.0, ans=0.0 2023-06-22 12:36:09,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1228842.0, ans=0.125 2023-06-22 12:36:20,673 INFO [train.py:996] (3/4) Epoch 7, batch 21850, loss[loss=0.259, simple_loss=0.3239, pruned_loss=0.09706, over 21764.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3003, pruned_loss=0.07918, over 4274178.69 frames. ], batch size: 112, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:37:07,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1229022.0, ans=0.07 2023-06-22 12:37:21,115 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.47 vs. limit=10.0 2023-06-22 12:37:24,819 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.520e+02 3.283e+02 3.846e+02 4.671e+02 1.030e+03, threshold=7.692e+02, percent-clipped=3.0 2023-06-22 12:37:34,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1229082.0, ans=0.0 2023-06-22 12:37:51,980 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-22 12:38:00,765 INFO [train.py:996] (3/4) Epoch 7, batch 21900, loss[loss=0.2191, simple_loss=0.2797, pruned_loss=0.07923, over 21481.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3027, pruned_loss=0.08117, over 4280156.30 frames. ], batch size: 194, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:38:08,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1229202.0, ans=0.125 2023-06-22 12:38:12,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1229202.0, ans=0.125 2023-06-22 12:38:28,867 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=15.0 2023-06-22 12:39:14,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1229382.0, ans=0.0 2023-06-22 12:39:28,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1229442.0, ans=0.125 2023-06-22 12:39:30,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1229442.0, ans=0.125 2023-06-22 12:39:37,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1229502.0, ans=0.125 2023-06-22 12:39:44,757 INFO [train.py:996] (3/4) Epoch 7, batch 21950, loss[loss=0.2164, simple_loss=0.3297, pruned_loss=0.05161, over 20906.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2978, pruned_loss=0.07941, over 4272115.01 frames. ], batch size: 607, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:40:42,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1229682.0, ans=0.1 2023-06-22 12:40:48,661 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.971e+02 3.645e+02 4.413e+02 9.727e+02, threshold=7.291e+02, percent-clipped=1.0 2023-06-22 12:41:02,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1229742.0, ans=0.125 2023-06-22 12:41:24,221 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-22 12:41:24,798 INFO [train.py:996] (3/4) Epoch 7, batch 22000, loss[loss=0.2297, simple_loss=0.2896, pruned_loss=0.08488, over 21963.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2914, pruned_loss=0.07612, over 4269183.16 frames. ], batch size: 103, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:41:35,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1229802.0, ans=0.0 2023-06-22 12:41:43,434 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=22.5 2023-06-22 12:42:27,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1229982.0, ans=0.0 2023-06-22 12:43:11,540 INFO [train.py:996] (3/4) Epoch 7, batch 22050, loss[loss=0.2655, simple_loss=0.3573, pruned_loss=0.0869, over 21848.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2976, pruned_loss=0.07788, over 4257917.34 frames. ], batch size: 372, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:44:04,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1230222.0, ans=0.1 2023-06-22 12:44:09,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-22 12:44:14,083 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.284e+02 3.765e+02 5.011e+02 6.386e+02 1.691e+03, threshold=1.002e+03, percent-clipped=17.0 2023-06-22 12:44:52,488 INFO [train.py:996] (3/4) Epoch 7, batch 22100, loss[loss=0.2862, simple_loss=0.3473, pruned_loss=0.1126, over 21525.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3076, pruned_loss=0.08263, over 4252576.01 frames. ], batch size: 548, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:45:07,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1230462.0, ans=0.125 2023-06-22 12:46:18,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1230642.0, ans=0.125 2023-06-22 12:46:30,169 INFO [train.py:996] (3/4) Epoch 7, batch 22150, loss[loss=0.2562, simple_loss=0.3236, pruned_loss=0.09441, over 21892.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3095, pruned_loss=0.08371, over 4265480.91 frames. ], batch size: 107, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:46:53,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1230762.0, ans=0.125 2023-06-22 12:47:29,579 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.621e+02 3.676e+02 4.235e+02 5.035e+02 1.205e+03, threshold=8.469e+02, percent-clipped=1.0 2023-06-22 12:47:52,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1230942.0, ans=0.125 2023-06-22 12:48:02,746 INFO [train.py:996] (3/4) Epoch 7, batch 22200, loss[loss=0.2836, simple_loss=0.3533, pruned_loss=0.1069, over 21786.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3135, pruned_loss=0.08562, over 4269529.45 frames. ], batch size: 441, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:48:39,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1231062.0, ans=0.125 2023-06-22 12:49:06,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1231182.0, ans=0.0 2023-06-22 12:49:07,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1231182.0, ans=0.125 2023-06-22 12:49:48,290 INFO [train.py:996] (3/4) Epoch 7, batch 22250, loss[loss=0.2931, simple_loss=0.3518, pruned_loss=0.1172, over 21482.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3207, pruned_loss=0.08731, over 4272261.00 frames. ], batch size: 211, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:49:50,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1231302.0, ans=0.0 2023-06-22 12:50:12,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1231362.0, ans=0.0 2023-06-22 12:50:33,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1231422.0, ans=0.0 2023-06-22 12:50:43,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1231422.0, ans=0.1 2023-06-22 12:50:46,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1231482.0, ans=0.1 2023-06-22 12:50:49,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1231482.0, ans=0.0 2023-06-22 12:50:50,639 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.541e+02 3.545e+02 4.219e+02 5.859e+02 1.258e+03, threshold=8.437e+02, percent-clipped=3.0 2023-06-22 12:51:20,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1231542.0, ans=0.125 2023-06-22 12:51:29,293 INFO [train.py:996] (3/4) Epoch 7, batch 22300, loss[loss=0.2272, simple_loss=0.2869, pruned_loss=0.08379, over 21491.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.321, pruned_loss=0.08933, over 4275717.37 frames. ], batch size: 211, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 12:52:10,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1231722.0, ans=0.125 2023-06-22 12:52:18,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1231722.0, ans=0.125 2023-06-22 12:52:37,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1231782.0, ans=0.125 2023-06-22 12:52:49,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1231842.0, ans=0.0 2023-06-22 12:52:59,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1231842.0, ans=0.0 2023-06-22 12:53:10,189 INFO [train.py:996] (3/4) Epoch 7, batch 22350, loss[loss=0.2413, simple_loss=0.3014, pruned_loss=0.09063, over 21554.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3184, pruned_loss=0.08997, over 4288024.10 frames. ], batch size: 548, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 12:53:12,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-22 12:53:20,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1231902.0, ans=0.125 2023-06-22 12:53:42,096 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=22.5 2023-06-22 12:54:17,416 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 3.296e+02 3.742e+02 4.441e+02 8.144e+02, threshold=7.483e+02, percent-clipped=0.0 2023-06-22 12:54:46,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1232142.0, ans=0.125 2023-06-22 12:54:50,708 INFO [train.py:996] (3/4) Epoch 7, batch 22400, loss[loss=0.2324, simple_loss=0.2982, pruned_loss=0.08333, over 21698.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3147, pruned_loss=0.08673, over 4285681.91 frames. ], batch size: 112, lr: 4.23e-03, grad_scale: 32.0 2023-06-22 12:54:53,437 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-22 12:55:17,363 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=12.0 2023-06-22 12:55:23,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1232262.0, ans=0.125 2023-06-22 12:56:01,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1232382.0, ans=0.2 2023-06-22 12:56:03,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1232382.0, ans=0.0 2023-06-22 12:56:30,500 INFO [train.py:996] (3/4) Epoch 7, batch 22450, loss[loss=0.2478, simple_loss=0.302, pruned_loss=0.09677, over 21797.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3095, pruned_loss=0.08656, over 4275482.63 frames. ], batch size: 98, lr: 4.23e-03, grad_scale: 32.0 2023-06-22 12:56:49,604 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.11 vs. limit=15.0 2023-06-22 12:57:14,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1232622.0, ans=0.125 2023-06-22 12:57:16,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1232622.0, ans=0.125 2023-06-22 12:57:32,152 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.73 vs. limit=8.0 2023-06-22 12:57:34,110 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.451e+02 3.003e+02 3.355e+02 3.883e+02 5.692e+02, threshold=6.709e+02, percent-clipped=0.0 2023-06-22 12:58:16,665 INFO [train.py:996] (3/4) Epoch 7, batch 22500, loss[loss=0.208, simple_loss=0.2672, pruned_loss=0.07446, over 21932.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3046, pruned_loss=0.08541, over 4274009.18 frames. ], batch size: 113, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 12:58:17,771 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-22 12:58:39,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1232862.0, ans=0.125 2023-06-22 12:58:48,318 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.82 vs. limit=15.0 2023-06-22 12:58:52,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-06-22 12:59:18,917 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-22 12:59:29,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1232982.0, ans=0.1 2023-06-22 13:00:01,038 INFO [train.py:996] (3/4) Epoch 7, batch 22550, loss[loss=0.2947, simple_loss=0.3512, pruned_loss=0.1191, over 21715.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.308, pruned_loss=0.08532, over 4284024.48 frames. ], batch size: 507, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:00:11,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1233102.0, ans=0.0 2023-06-22 13:00:37,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1233162.0, ans=0.125 2023-06-22 13:00:42,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1233222.0, ans=0.125 2023-06-22 13:01:06,755 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.865e+02 3.475e+02 4.180e+02 5.606e+02 1.235e+03, threshold=8.360e+02, percent-clipped=11.0 2023-06-22 13:01:31,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1233342.0, ans=0.125 2023-06-22 13:01:39,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1233342.0, ans=0.125 2023-06-22 13:01:45,197 INFO [train.py:996] (3/4) Epoch 7, batch 22600, loss[loss=0.1962, simple_loss=0.264, pruned_loss=0.06415, over 21425.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3116, pruned_loss=0.08574, over 4290443.89 frames. ], batch size: 211, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:02:20,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1233462.0, ans=0.0 2023-06-22 13:03:03,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1233582.0, ans=0.0 2023-06-22 13:03:14,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1233642.0, ans=0.125 2023-06-22 13:03:25,471 INFO [train.py:996] (3/4) Epoch 7, batch 22650, loss[loss=0.2023, simple_loss=0.2649, pruned_loss=0.06983, over 21761.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3088, pruned_loss=0.08521, over 4281609.90 frames. ], batch size: 124, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:03:36,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1233702.0, ans=0.125 2023-06-22 13:04:08,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1233822.0, ans=0.2 2023-06-22 13:04:14,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1233822.0, ans=0.125 2023-06-22 13:04:32,412 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.918e+02 3.837e+02 4.775e+02 6.238e+02 8.753e+02, threshold=9.549e+02, percent-clipped=4.0 2023-06-22 13:04:54,041 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-22 13:05:01,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1233942.0, ans=0.125 2023-06-22 13:05:04,924 INFO [train.py:996] (3/4) Epoch 7, batch 22700, loss[loss=0.2083, simple_loss=0.2761, pruned_loss=0.07029, over 21803.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3044, pruned_loss=0.08479, over 4271874.20 frames. ], batch size: 102, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:05:14,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1234002.0, ans=0.125 2023-06-22 13:05:52,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1234122.0, ans=0.0 2023-06-22 13:06:03,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1234182.0, ans=0.2 2023-06-22 13:06:43,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1234242.0, ans=0.125 2023-06-22 13:06:46,562 INFO [train.py:996] (3/4) Epoch 7, batch 22750, loss[loss=0.2021, simple_loss=0.273, pruned_loss=0.06561, over 20794.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3072, pruned_loss=0.08754, over 4261958.50 frames. ], batch size: 607, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:07:15,280 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-06-22 13:07:42,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1234482.0, ans=0.125 2023-06-22 13:07:53,250 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.685e+02 3.471e+02 4.141e+02 5.452e+02 1.173e+03, threshold=8.282e+02, percent-clipped=2.0 2023-06-22 13:07:53,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1234482.0, ans=0.0 2023-06-22 13:08:25,194 INFO [train.py:996] (3/4) Epoch 7, batch 22800, loss[loss=0.2579, simple_loss=0.3256, pruned_loss=0.09509, over 21236.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3096, pruned_loss=0.0892, over 4273206.85 frames. ], batch size: 143, lr: 4.23e-03, grad_scale: 32.0 2023-06-22 13:08:35,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-22 13:08:37,573 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-22 13:08:58,052 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=22.5 2023-06-22 13:08:59,705 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=15.0 2023-06-22 13:09:22,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1234782.0, ans=0.1 2023-06-22 13:09:53,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1234842.0, ans=0.1 2023-06-22 13:10:04,262 INFO [train.py:996] (3/4) Epoch 7, batch 22850, loss[loss=0.2231, simple_loss=0.2869, pruned_loss=0.07968, over 21387.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3055, pruned_loss=0.08832, over 4278271.74 frames. ], batch size: 211, lr: 4.23e-03, grad_scale: 32.0 2023-06-22 13:10:09,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1234902.0, ans=0.125 2023-06-22 13:10:14,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1234902.0, ans=0.0 2023-06-22 13:10:35,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1234962.0, ans=0.0 2023-06-22 13:11:09,852 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.451e+02 3.461e+02 4.069e+02 5.005e+02 9.619e+02, threshold=8.139e+02, percent-clipped=3.0 2023-06-22 13:11:27,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1235142.0, ans=0.125 2023-06-22 13:11:38,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1235142.0, ans=0.125 2023-06-22 13:11:45,216 INFO [train.py:996] (3/4) Epoch 7, batch 22900, loss[loss=0.2062, simple_loss=0.2701, pruned_loss=0.07113, over 21853.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3058, pruned_loss=0.08725, over 4272770.54 frames. ], batch size: 107, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:11:48,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1235202.0, ans=0.025 2023-06-22 13:12:05,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1235262.0, ans=0.1 2023-06-22 13:12:56,870 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-22 13:13:29,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1235442.0, ans=0.125 2023-06-22 13:13:32,244 INFO [train.py:996] (3/4) Epoch 7, batch 22950, loss[loss=0.2313, simple_loss=0.348, pruned_loss=0.05728, over 21752.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.319, pruned_loss=0.08595, over 4274071.89 frames. ], batch size: 298, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:14:00,131 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=22.5 2023-06-22 13:14:17,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1235622.0, ans=0.2 2023-06-22 13:14:37,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1235682.0, ans=0.2 2023-06-22 13:14:42,481 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 3.271e+02 4.396e+02 6.484e+02 1.017e+03, threshold=8.792e+02, percent-clipped=10.0 2023-06-22 13:14:42,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1235682.0, ans=0.0 2023-06-22 13:15:05,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1235742.0, ans=0.2 2023-06-22 13:15:08,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1235742.0, ans=0.125 2023-06-22 13:15:11,710 INFO [train.py:996] (3/4) Epoch 7, batch 23000, loss[loss=0.2217, simple_loss=0.2891, pruned_loss=0.07719, over 21656.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3171, pruned_loss=0.08321, over 4280284.82 frames. ], batch size: 230, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:15:59,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1235922.0, ans=0.2 2023-06-22 13:16:15,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1235982.0, ans=0.1 2023-06-22 13:16:18,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1235982.0, ans=0.0 2023-06-22 13:16:36,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1236042.0, ans=0.2 2023-06-22 13:16:49,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1236042.0, ans=15.0 2023-06-22 13:16:51,785 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.68 vs. limit=5.0 2023-06-22 13:16:52,135 INFO [train.py:996] (3/4) Epoch 7, batch 23050, loss[loss=0.3191, simple_loss=0.3784, pruned_loss=0.1299, over 21807.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3189, pruned_loss=0.08539, over 4282886.09 frames. ], batch size: 441, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:17:15,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1236162.0, ans=0.125 2023-06-22 13:17:24,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1236162.0, ans=0.1 2023-06-22 13:17:31,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1236162.0, ans=0.0 2023-06-22 13:17:47,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1236222.0, ans=0.0 2023-06-22 13:18:03,099 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.533e+02 3.591e+02 4.450e+02 5.560e+02 1.062e+03, threshold=8.900e+02, percent-clipped=1.0 2023-06-22 13:18:33,440 INFO [train.py:996] (3/4) Epoch 7, batch 23100, loss[loss=0.1944, simple_loss=0.2581, pruned_loss=0.06532, over 21268.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3158, pruned_loss=0.08531, over 4284357.55 frames. ], batch size: 159, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:18:39,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-22 13:18:56,354 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-22 13:19:36,232 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-06-22 13:19:37,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1236582.0, ans=0.125 2023-06-22 13:19:39,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1236582.0, ans=0.125 2023-06-22 13:20:11,895 INFO [train.py:996] (3/4) Epoch 7, batch 23150, loss[loss=0.2359, simple_loss=0.3076, pruned_loss=0.08217, over 21828.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3127, pruned_loss=0.08506, over 4281950.60 frames. ], batch size: 414, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:20:16,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1236702.0, ans=0.125 2023-06-22 13:20:25,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1236702.0, ans=0.1 2023-06-22 13:20:29,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1236702.0, ans=0.125 2023-06-22 13:20:37,020 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:20:57,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1236822.0, ans=0.125 2023-06-22 13:21:20,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1236882.0, ans=0.1 2023-06-22 13:21:21,225 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.669e+02 3.574e+02 4.225e+02 5.615e+02 9.377e+02, threshold=8.449e+02, percent-clipped=1.0 2023-06-22 13:21:50,740 INFO [train.py:996] (3/4) Epoch 7, batch 23200, loss[loss=0.2176, simple_loss=0.291, pruned_loss=0.07204, over 21896.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3118, pruned_loss=0.08602, over 4291430.26 frames. ], batch size: 371, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:22:02,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1237002.0, ans=0.125 2023-06-22 13:23:19,311 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:23:30,174 INFO [train.py:996] (3/4) Epoch 7, batch 23250, loss[loss=0.27, simple_loss=0.3356, pruned_loss=0.1021, over 21702.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3114, pruned_loss=0.08725, over 4294913.00 frames. ], batch size: 389, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:23:32,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1237302.0, ans=0.0 2023-06-22 13:23:38,232 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=15.0 2023-06-22 13:23:53,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1237362.0, ans=0.0 2023-06-22 13:23:57,188 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-06-22 13:24:07,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1237362.0, ans=0.125 2023-06-22 13:24:11,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1237422.0, ans=0.025 2023-06-22 13:24:25,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1237422.0, ans=0.1 2023-06-22 13:24:46,819 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.614e+02 3.587e+02 4.602e+02 6.286e+02 1.178e+03, threshold=9.205e+02, percent-clipped=7.0 2023-06-22 13:24:52,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1237482.0, ans=0.125 2023-06-22 13:25:04,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1237542.0, ans=0.125 2023-06-22 13:25:16,746 INFO [train.py:996] (3/4) Epoch 7, batch 23300, loss[loss=0.2011, simple_loss=0.2649, pruned_loss=0.06868, over 21155.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3169, pruned_loss=0.08842, over 4290146.54 frames. ], batch size: 608, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:25:39,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1237662.0, ans=0.1 2023-06-22 13:26:33,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1237782.0, ans=0.2 2023-06-22 13:26:52,004 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-22 13:26:57,947 INFO [train.py:996] (3/4) Epoch 7, batch 23350, loss[loss=0.2327, simple_loss=0.3144, pruned_loss=0.07546, over 21604.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.321, pruned_loss=0.08792, over 4281380.05 frames. ], batch size: 441, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:28:09,255 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 3.411e+02 4.317e+02 5.452e+02 1.291e+03, threshold=8.634e+02, percent-clipped=4.0 2023-06-22 13:28:32,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1238142.0, ans=0.125 2023-06-22 13:28:38,100 INFO [train.py:996] (3/4) Epoch 7, batch 23400, loss[loss=0.2085, simple_loss=0.2825, pruned_loss=0.06727, over 21933.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3141, pruned_loss=0.08379, over 4285406.18 frames. ], batch size: 316, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:29:06,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1238262.0, ans=0.1 2023-06-22 13:29:19,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1238322.0, ans=0.2 2023-06-22 13:29:22,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1238322.0, ans=0.015 2023-06-22 13:29:59,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1238382.0, ans=0.125 2023-06-22 13:30:12,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1238442.0, ans=0.0 2023-06-22 13:30:12,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1238442.0, ans=0.2 2023-06-22 13:30:24,426 INFO [train.py:996] (3/4) Epoch 7, batch 23450, loss[loss=0.2759, simple_loss=0.3375, pruned_loss=0.1071, over 21291.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3151, pruned_loss=0.08578, over 4279230.13 frames. ], batch size: 176, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:30:35,067 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2023-06-22 13:30:40,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1238502.0, ans=0.125 2023-06-22 13:31:03,719 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:31:34,391 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.532e+02 3.929e+02 4.982e+02 6.736e+02 9.588e+02, threshold=9.965e+02, percent-clipped=2.0 2023-06-22 13:31:44,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1238742.0, ans=0.07 2023-06-22 13:31:52,612 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.69 vs. limit=15.0 2023-06-22 13:32:06,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1238802.0, ans=0.1 2023-06-22 13:32:07,433 INFO [train.py:996] (3/4) Epoch 7, batch 23500, loss[loss=0.2785, simple_loss=0.3339, pruned_loss=0.1115, over 21858.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3168, pruned_loss=0.08687, over 4279700.80 frames. ], batch size: 441, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:33:22,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1239042.0, ans=0.125 2023-06-22 13:33:26,047 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:33:27,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1239042.0, ans=0.0 2023-06-22 13:33:48,029 INFO [train.py:996] (3/4) Epoch 7, batch 23550, loss[loss=0.2049, simple_loss=0.2618, pruned_loss=0.07405, over 21518.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3117, pruned_loss=0.08698, over 4285891.66 frames. ], batch size: 231, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:34:00,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1239102.0, ans=0.125 2023-06-22 13:34:01,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1239102.0, ans=0.125 2023-06-22 13:34:55,616 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.824e+02 3.361e+02 3.874e+02 4.874e+02 9.234e+02, threshold=7.748e+02, percent-clipped=0.0 2023-06-22 13:34:56,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1239282.0, ans=0.125 2023-06-22 13:35:04,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1239342.0, ans=0.5 2023-06-22 13:35:29,830 INFO [train.py:996] (3/4) Epoch 7, batch 23600, loss[loss=0.2483, simple_loss=0.3233, pruned_loss=0.08666, over 21439.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.312, pruned_loss=0.08821, over 4280332.62 frames. ], batch size: 131, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:35:39,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1239402.0, ans=0.2 2023-06-22 13:35:42,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1239402.0, ans=0.125 2023-06-22 13:35:45,958 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=15.0 2023-06-22 13:36:23,571 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=15.0 2023-06-22 13:36:41,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1239582.0, ans=0.1 2023-06-22 13:37:12,149 INFO [train.py:996] (3/4) Epoch 7, batch 23650, loss[loss=0.2631, simple_loss=0.3324, pruned_loss=0.09691, over 21301.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3125, pruned_loss=0.0862, over 4279572.51 frames. ], batch size: 159, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:37:39,025 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-22 13:37:49,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1239762.0, ans=0.1 2023-06-22 13:38:00,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1239822.0, ans=0.125 2023-06-22 13:38:29,765 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.732e+02 3.692e+02 5.064e+02 6.588e+02 1.428e+03, threshold=1.013e+03, percent-clipped=16.0 2023-06-22 13:38:53,660 INFO [train.py:996] (3/4) Epoch 7, batch 23700, loss[loss=0.2126, simple_loss=0.3017, pruned_loss=0.0618, over 21626.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3153, pruned_loss=0.08564, over 4283564.09 frames. ], batch size: 247, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:40:10,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1240182.0, ans=0.125 2023-06-22 13:40:39,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1240302.0, ans=0.125 2023-06-22 13:40:40,722 INFO [train.py:996] (3/4) Epoch 7, batch 23750, loss[loss=0.207, simple_loss=0.3066, pruned_loss=0.05369, over 21800.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.317, pruned_loss=0.08564, over 4285263.43 frames. ], batch size: 282, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:40:54,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1240302.0, ans=0.125 2023-06-22 13:40:59,633 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=22.5 2023-06-22 13:41:29,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1240422.0, ans=0.2 2023-06-22 13:41:47,992 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 3.292e+02 4.228e+02 5.463e+02 1.067e+03, threshold=8.456e+02, percent-clipped=1.0 2023-06-22 13:42:23,012 INFO [train.py:996] (3/4) Epoch 7, batch 23800, loss[loss=0.2474, simple_loss=0.3161, pruned_loss=0.08932, over 21359.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3172, pruned_loss=0.08444, over 4273341.18 frames. ], batch size: 131, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:42:33,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1240602.0, ans=0.2 2023-06-22 13:43:51,847 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.07 vs. limit=6.0 2023-06-22 13:43:56,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1240842.0, ans=0.0 2023-06-22 13:44:04,742 INFO [train.py:996] (3/4) Epoch 7, batch 23850, loss[loss=0.3109, simple_loss=0.3745, pruned_loss=0.1237, over 21427.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3259, pruned_loss=0.08693, over 4272519.63 frames. ], batch size: 471, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:44:50,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1241022.0, ans=0.0 2023-06-22 13:44:55,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1241022.0, ans=0.125 2023-06-22 13:45:23,638 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.822e+02 3.704e+02 4.399e+02 5.519e+02 1.068e+03, threshold=8.797e+02, percent-clipped=5.0 2023-06-22 13:45:47,903 INFO [train.py:996] (3/4) Epoch 7, batch 23900, loss[loss=0.2741, simple_loss=0.3438, pruned_loss=0.1022, over 21815.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3324, pruned_loss=0.08942, over 4278483.68 frames. ], batch size: 98, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:46:14,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1241262.0, ans=0.0 2023-06-22 13:46:39,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1241322.0, ans=0.025 2023-06-22 13:46:39,600 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-06-22 13:47:30,440 INFO [train.py:996] (3/4) Epoch 7, batch 23950, loss[loss=0.2588, simple_loss=0.3225, pruned_loss=0.09757, over 16097.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3256, pruned_loss=0.08893, over 4271782.76 frames. ], batch size: 62, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:47:49,479 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=22.5 2023-06-22 13:48:07,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1241562.0, ans=0.125 2023-06-22 13:48:11,842 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:48:13,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1241622.0, ans=0.0 2023-06-22 13:48:47,875 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.707e+02 3.506e+02 4.327e+02 5.535e+02 8.905e+02, threshold=8.653e+02, percent-clipped=1.0 2023-06-22 13:49:00,571 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-22 13:49:02,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1241742.0, ans=0.125 2023-06-22 13:49:17,837 INFO [train.py:996] (3/4) Epoch 7, batch 24000, loss[loss=0.2831, simple_loss=0.3486, pruned_loss=0.1088, over 21455.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3272, pruned_loss=0.09222, over 4276958.80 frames. ], batch size: 211, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:49:17,838 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 13:49:33,457 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2773, simple_loss=0.3696, pruned_loss=0.09254, over 1796401.00 frames. 2023-06-22 13:49:33,458 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-22 13:50:26,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1241922.0, ans=0.0 2023-06-22 13:51:10,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1242042.0, ans=0.2 2023-06-22 13:51:13,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1242042.0, ans=0.125 2023-06-22 13:51:14,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-22 13:51:16,754 INFO [train.py:996] (3/4) Epoch 7, batch 24050, loss[loss=0.2364, simple_loss=0.3261, pruned_loss=0.07331, over 21759.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3287, pruned_loss=0.09199, over 4268308.14 frames. ], batch size: 351, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:52:19,802 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=15.0 2023-06-22 13:52:22,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1242282.0, ans=0.1 2023-06-22 13:52:35,697 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.538e+02 3.768e+02 5.092e+02 6.295e+02 1.003e+03, threshold=1.018e+03, percent-clipped=2.0 2023-06-22 13:52:58,061 INFO [train.py:996] (3/4) Epoch 7, batch 24100, loss[loss=0.2853, simple_loss=0.3555, pruned_loss=0.1075, over 21724.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3279, pruned_loss=0.09011, over 4268005.05 frames. ], batch size: 298, lr: 4.22e-03, grad_scale: 8.0 2023-06-22 13:53:33,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1242462.0, ans=0.0 2023-06-22 13:54:17,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1242582.0, ans=0.125 2023-06-22 13:54:38,963 INFO [train.py:996] (3/4) Epoch 7, batch 24150, loss[loss=0.2771, simple_loss=0.3361, pruned_loss=0.1091, over 21274.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3275, pruned_loss=0.09152, over 4272666.31 frames. ], batch size: 143, lr: 4.22e-03, grad_scale: 8.0 2023-06-22 13:55:57,965 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.873e+02 3.729e+02 4.537e+02 5.592e+02 8.815e+02, threshold=9.074e+02, percent-clipped=0.0 2023-06-22 13:56:02,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1242942.0, ans=0.05 2023-06-22 13:56:18,601 INFO [train.py:996] (3/4) Epoch 7, batch 24200, loss[loss=0.2593, simple_loss=0.3366, pruned_loss=0.09099, over 21680.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3297, pruned_loss=0.09279, over 4275676.18 frames. ], batch size: 263, lr: 4.22e-03, grad_scale: 8.0 2023-06-22 13:56:37,174 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:56:45,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1243062.0, ans=0.125 2023-06-22 13:56:47,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1243062.0, ans=0.0 2023-06-22 13:57:14,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1243122.0, ans=0.0 2023-06-22 13:57:48,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1243242.0, ans=0.0 2023-06-22 13:58:03,624 INFO [train.py:996] (3/4) Epoch 7, batch 24250, loss[loss=0.2078, simple_loss=0.2942, pruned_loss=0.0607, over 21341.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3249, pruned_loss=0.08642, over 4269288.09 frames. ], batch size: 176, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 13:58:03,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1243302.0, ans=0.125 2023-06-22 13:58:25,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1243362.0, ans=0.125 2023-06-22 13:59:07,499 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 3.091e+02 3.731e+02 4.711e+02 7.099e+02, threshold=7.462e+02, percent-clipped=0.0 2023-06-22 13:59:35,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1243542.0, ans=0.0 2023-06-22 13:59:38,159 INFO [train.py:996] (3/4) Epoch 7, batch 24300, loss[loss=0.2577, simple_loss=0.366, pruned_loss=0.07469, over 20769.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3214, pruned_loss=0.08168, over 4269010.39 frames. ], batch size: 607, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 13:59:52,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1243662.0, ans=0.1 2023-06-22 14:00:30,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1243722.0, ans=0.125 2023-06-22 14:01:17,183 INFO [train.py:996] (3/4) Epoch 7, batch 24350, loss[loss=0.2949, simple_loss=0.3525, pruned_loss=0.1187, over 21813.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3185, pruned_loss=0.08221, over 4273489.22 frames. ], batch size: 441, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 14:01:52,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1243962.0, ans=0.125 2023-06-22 14:01:56,256 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=12.0 2023-06-22 14:02:11,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1244022.0, ans=0.0 2023-06-22 14:02:30,996 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 3.282e+02 3.867e+02 5.054e+02 1.141e+03, threshold=7.734e+02, percent-clipped=5.0 2023-06-22 14:02:45,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1244142.0, ans=0.2 2023-06-22 14:02:56,600 INFO [train.py:996] (3/4) Epoch 7, batch 24400, loss[loss=0.2253, simple_loss=0.3133, pruned_loss=0.06871, over 21292.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3218, pruned_loss=0.08519, over 4281602.51 frames. ], batch size: 548, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:03:33,320 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=12.0 2023-06-22 14:03:40,234 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-22 14:04:42,171 INFO [train.py:996] (3/4) Epoch 7, batch 24450, loss[loss=0.3429, simple_loss=0.4253, pruned_loss=0.1302, over 21431.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3233, pruned_loss=0.08637, over 4278562.53 frames. ], batch size: 507, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:05:57,128 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.669e+02 3.418e+02 4.354e+02 5.626e+02 1.189e+03, threshold=8.709e+02, percent-clipped=8.0 2023-06-22 14:06:22,877 INFO [train.py:996] (3/4) Epoch 7, batch 24500, loss[loss=0.2244, simple_loss=0.2958, pruned_loss=0.07644, over 21133.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.322, pruned_loss=0.08569, over 4277928.40 frames. ], batch size: 608, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:07:08,966 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-06-22 14:07:24,176 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:08:11,131 INFO [train.py:996] (3/4) Epoch 7, batch 24550, loss[loss=0.2792, simple_loss=0.3471, pruned_loss=0.1057, over 21416.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3215, pruned_loss=0.08694, over 4271669.14 frames. ], batch size: 159, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:08:14,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1245102.0, ans=0.1 2023-06-22 14:08:22,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1245102.0, ans=0.0 2023-06-22 14:08:33,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1245162.0, ans=0.125 2023-06-22 14:09:25,435 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.637e+02 3.383e+02 4.024e+02 4.711e+02 8.418e+02, threshold=8.048e+02, percent-clipped=0.0 2023-06-22 14:09:27,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1245282.0, ans=0.0 2023-06-22 14:09:51,571 INFO [train.py:996] (3/4) Epoch 7, batch 24600, loss[loss=0.2187, simple_loss=0.2762, pruned_loss=0.08061, over 21355.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3177, pruned_loss=0.08759, over 4268956.83 frames. ], batch size: 194, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:09:57,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1245402.0, ans=0.125 2023-06-22 14:10:44,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1245522.0, ans=0.0 2023-06-22 14:11:01,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1245582.0, ans=0.125 2023-06-22 14:11:32,119 INFO [train.py:996] (3/4) Epoch 7, batch 24650, loss[loss=0.215, simple_loss=0.2864, pruned_loss=0.07181, over 21591.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3104, pruned_loss=0.08633, over 4264935.23 frames. ], batch size: 414, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:12:06,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1245762.0, ans=0.2 2023-06-22 14:12:24,763 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:12:46,321 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.600e+02 3.556e+02 4.332e+02 6.409e+02 1.192e+03, threshold=8.664e+02, percent-clipped=12.0 2023-06-22 14:13:09,302 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:13:12,565 INFO [train.py:996] (3/4) Epoch 7, batch 24700, loss[loss=0.1985, simple_loss=0.2666, pruned_loss=0.0652, over 21830.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3074, pruned_loss=0.0843, over 4261539.83 frames. ], batch size: 118, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:13:43,868 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-22 14:14:54,464 INFO [train.py:996] (3/4) Epoch 7, batch 24750, loss[loss=0.193, simple_loss=0.2561, pruned_loss=0.065, over 21607.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3015, pruned_loss=0.08202, over 4264903.66 frames. ], batch size: 263, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:14:56,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1246302.0, ans=0.0 2023-06-22 14:15:09,139 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-22 14:15:47,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1246422.0, ans=0.125 2023-06-22 14:15:54,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1246482.0, ans=0.125 2023-06-22 14:16:07,195 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.368e+02 3.441e+02 4.025e+02 5.655e+02 9.587e+02, threshold=8.050e+02, percent-clipped=3.0 2023-06-22 14:16:32,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1246602.0, ans=0.0 2023-06-22 14:16:33,206 INFO [train.py:996] (3/4) Epoch 7, batch 24800, loss[loss=0.2247, simple_loss=0.2884, pruned_loss=0.08049, over 21869.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.2967, pruned_loss=0.08059, over 4267357.50 frames. ], batch size: 371, lr: 4.21e-03, grad_scale: 32.0 2023-06-22 14:18:14,874 INFO [train.py:996] (3/4) Epoch 7, batch 24850, loss[loss=0.2378, simple_loss=0.3079, pruned_loss=0.08385, over 20114.00 frames. ], tot_loss[loss=0.232, simple_loss=0.2988, pruned_loss=0.08262, over 4269838.86 frames. ], batch size: 703, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:19:04,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1247022.0, ans=0.035 2023-06-22 14:19:28,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1247082.0, ans=0.2 2023-06-22 14:19:31,868 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.875e+02 3.674e+02 4.443e+02 5.678e+02 1.334e+03, threshold=8.887e+02, percent-clipped=8.0 2023-06-22 14:19:56,844 INFO [train.py:996] (3/4) Epoch 7, batch 24900, loss[loss=0.2835, simple_loss=0.3481, pruned_loss=0.1094, over 21203.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3001, pruned_loss=0.08328, over 4265755.81 frames. ], batch size: 143, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:20:43,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1247322.0, ans=0.0 2023-06-22 14:21:02,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1247382.0, ans=0.5 2023-06-22 14:21:43,152 INFO [train.py:996] (3/4) Epoch 7, batch 24950, loss[loss=0.2864, simple_loss=0.3465, pruned_loss=0.1132, over 21319.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3084, pruned_loss=0.08803, over 4266795.99 frames. ], batch size: 159, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:21:47,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1247502.0, ans=0.05 2023-06-22 14:21:58,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.whiten.whitening_limit, batch_count=1247562.0, ans=15.0 2023-06-22 14:22:05,427 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-22 14:22:24,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1247622.0, ans=0.07 2023-06-22 14:22:32,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1247622.0, ans=0.0 2023-06-22 14:23:02,241 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.769e+02 3.624e+02 4.184e+02 5.539e+02 8.778e+02, threshold=8.368e+02, percent-clipped=0.0 2023-06-22 14:23:26,314 INFO [train.py:996] (3/4) Epoch 7, batch 25000, loss[loss=0.2399, simple_loss=0.3029, pruned_loss=0.08848, over 21639.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3158, pruned_loss=0.08944, over 4259205.18 frames. ], batch size: 282, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:24:06,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1247862.0, ans=0.2 2023-06-22 14:24:13,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1247922.0, ans=0.1 2023-06-22 14:24:43,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1247982.0, ans=0.0 2023-06-22 14:25:04,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1248042.0, ans=0.125 2023-06-22 14:25:10,082 INFO [train.py:996] (3/4) Epoch 7, batch 25050, loss[loss=0.227, simple_loss=0.2856, pruned_loss=0.08418, over 21598.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3091, pruned_loss=0.08744, over 4264094.37 frames. ], batch size: 415, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 14:25:51,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1248222.0, ans=0.0 2023-06-22 14:26:06,952 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:26:28,132 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.523e+02 3.310e+02 3.900e+02 4.690e+02 8.099e+02, threshold=7.799e+02, percent-clipped=0.0 2023-06-22 14:26:51,214 INFO [train.py:996] (3/4) Epoch 7, batch 25100, loss[loss=0.2053, simple_loss=0.2707, pruned_loss=0.06994, over 21377.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.302, pruned_loss=0.08582, over 4265030.56 frames. ], batch size: 211, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 14:26:55,244 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.72 vs. limit=15.0 2023-06-22 14:27:47,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1248582.0, ans=0.125 2023-06-22 14:27:55,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1248582.0, ans=0.125 2023-06-22 14:28:25,014 INFO [train.py:996] (3/4) Epoch 7, batch 25150, loss[loss=0.2169, simple_loss=0.2935, pruned_loss=0.07016, over 21748.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3073, pruned_loss=0.08436, over 4267454.86 frames. ], batch size: 247, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 14:29:00,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1248762.0, ans=0.0 2023-06-22 14:29:30,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1248882.0, ans=0.2 2023-06-22 14:29:40,289 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:29:48,290 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.302e+02 3.375e+02 4.359e+02 5.711e+02 9.666e+02, threshold=8.717e+02, percent-clipped=5.0 2023-06-22 14:29:51,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1248942.0, ans=0.05 2023-06-22 14:30:06,540 INFO [train.py:996] (3/4) Epoch 7, batch 25200, loss[loss=0.1978, simple_loss=0.2866, pruned_loss=0.0545, over 21242.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3084, pruned_loss=0.08195, over 4263265.98 frames. ], batch size: 176, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:30:16,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1249002.0, ans=0.1 2023-06-22 14:30:36,988 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=15.0 2023-06-22 14:31:07,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1249182.0, ans=0.0 2023-06-22 14:31:15,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1249182.0, ans=0.1 2023-06-22 14:31:36,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1249242.0, ans=0.125 2023-06-22 14:31:45,943 INFO [train.py:996] (3/4) Epoch 7, batch 25250, loss[loss=0.2058, simple_loss=0.2748, pruned_loss=0.06842, over 21734.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3063, pruned_loss=0.0802, over 4263764.56 frames. ], batch size: 334, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:32:23,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1249362.0, ans=0.2 2023-06-22 14:32:51,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1249482.0, ans=0.0 2023-06-22 14:32:54,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1249482.0, ans=0.0 2023-06-22 14:33:04,016 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.324e+02 3.268e+02 4.285e+02 6.443e+02 1.274e+03, threshold=8.570e+02, percent-clipped=9.0 2023-06-22 14:33:19,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-22 14:33:32,048 INFO [train.py:996] (3/4) Epoch 7, batch 25300, loss[loss=0.3214, simple_loss=0.3844, pruned_loss=0.1292, over 21434.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3055, pruned_loss=0.08063, over 4261387.09 frames. ], batch size: 471, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:33:43,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1249602.0, ans=0.0 2023-06-22 14:34:09,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1249662.0, ans=0.0 2023-06-22 14:34:20,262 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.10 vs. limit=10.0 2023-06-22 14:34:23,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1249722.0, ans=0.0 2023-06-22 14:34:24,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1249722.0, ans=0.1 2023-06-22 14:34:40,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1249782.0, ans=0.125 2023-06-22 14:34:40,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1249782.0, ans=0.125 2023-06-22 14:34:50,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1249842.0, ans=0.0 2023-06-22 14:35:12,494 INFO [train.py:996] (3/4) Epoch 7, batch 25350, loss[loss=0.2296, simple_loss=0.2837, pruned_loss=0.08774, over 20279.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3056, pruned_loss=0.07993, over 4246630.51 frames. ], batch size: 703, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:36:25,525 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 3.273e+02 3.948e+02 5.188e+02 1.185e+03, threshold=7.895e+02, percent-clipped=3.0 2023-06-22 14:36:40,632 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:36:47,856 INFO [train.py:996] (3/4) Epoch 7, batch 25400, loss[loss=0.2367, simple_loss=0.3061, pruned_loss=0.08367, over 21705.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.302, pruned_loss=0.07926, over 4239521.82 frames. ], batch size: 282, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:36:57,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1250202.0, ans=0.2 2023-06-22 14:37:35,273 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-22 14:38:33,451 INFO [train.py:996] (3/4) Epoch 7, batch 25450, loss[loss=0.2574, simple_loss=0.3241, pruned_loss=0.0954, over 21733.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3005, pruned_loss=0.08021, over 4236133.21 frames. ], batch size: 389, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:39:09,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1250562.0, ans=0.125 2023-06-22 14:39:14,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1250562.0, ans=0.0 2023-06-22 14:39:47,979 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.452e+02 3.153e+02 3.590e+02 4.840e+02 1.017e+03, threshold=7.180e+02, percent-clipped=5.0 2023-06-22 14:40:07,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1250742.0, ans=10.0 2023-06-22 14:40:14,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1250802.0, ans=0.125 2023-06-22 14:40:16,007 INFO [train.py:996] (3/4) Epoch 7, batch 25500, loss[loss=0.2587, simple_loss=0.3377, pruned_loss=0.08982, over 21615.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.302, pruned_loss=0.0772, over 4250619.67 frames. ], batch size: 389, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:40:17,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1250802.0, ans=0.125 2023-06-22 14:41:56,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1251102.0, ans=0.125 2023-06-22 14:41:57,260 INFO [train.py:996] (3/4) Epoch 7, batch 25550, loss[loss=0.226, simple_loss=0.3248, pruned_loss=0.0636, over 21870.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3089, pruned_loss=0.0778, over 4259823.21 frames. ], batch size: 316, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:42:48,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1251222.0, ans=0.125 2023-06-22 14:42:48,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1251222.0, ans=0.09899494936611666 2023-06-22 14:43:00,589 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:43:10,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1251282.0, ans=0.125 2023-06-22 14:43:16,629 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.498e+02 4.719e+02 6.187e+02 1.035e+03, threshold=9.438e+02, percent-clipped=13.0 2023-06-22 14:43:33,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1251342.0, ans=0.2 2023-06-22 14:43:38,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1251402.0, ans=0.125 2023-06-22 14:43:39,329 INFO [train.py:996] (3/4) Epoch 7, batch 25600, loss[loss=0.25, simple_loss=0.3296, pruned_loss=0.08525, over 21615.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3131, pruned_loss=0.07902, over 4261635.22 frames. ], batch size: 389, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:43:54,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1251402.0, ans=0.125 2023-06-22 14:44:07,928 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-06-22 14:44:08,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1251462.0, ans=0.2 2023-06-22 14:44:23,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1251522.0, ans=0.07 2023-06-22 14:44:31,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1251582.0, ans=0.07 2023-06-22 14:44:49,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1251582.0, ans=0.1 2023-06-22 14:45:20,818 INFO [train.py:996] (3/4) Epoch 7, batch 25650, loss[loss=0.2394, simple_loss=0.2976, pruned_loss=0.09057, over 21605.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3143, pruned_loss=0.08204, over 4253083.71 frames. ], batch size: 298, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:45:30,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1251702.0, ans=0.0 2023-06-22 14:46:09,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1251822.0, ans=0.1 2023-06-22 14:46:37,522 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.765e+02 3.700e+02 4.314e+02 5.225e+02 1.015e+03, threshold=8.627e+02, percent-clipped=1.0 2023-06-22 14:47:05,268 INFO [train.py:996] (3/4) Epoch 7, batch 25700, loss[loss=0.2863, simple_loss=0.3514, pruned_loss=0.1107, over 21457.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3119, pruned_loss=0.08298, over 4257667.05 frames. ], batch size: 471, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:47:21,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1252062.0, ans=0.125 2023-06-22 14:47:27,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.37 vs. limit=10.0 2023-06-22 14:47:28,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1252062.0, ans=0.125 2023-06-22 14:47:36,947 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-22 14:48:17,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1252182.0, ans=0.1 2023-06-22 14:48:47,653 INFO [train.py:996] (3/4) Epoch 7, batch 25750, loss[loss=0.2942, simple_loss=0.3573, pruned_loss=0.1156, over 21588.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3174, pruned_loss=0.08611, over 4262607.34 frames. ], batch size: 389, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:50:13,043 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 3.834e+02 4.976e+02 6.309e+02 1.046e+03, threshold=9.952e+02, percent-clipped=8.0 2023-06-22 14:50:19,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1252542.0, ans=0.0 2023-06-22 14:50:31,418 INFO [train.py:996] (3/4) Epoch 7, batch 25800, loss[loss=0.2754, simple_loss=0.346, pruned_loss=0.1024, over 21715.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3313, pruned_loss=0.09082, over 4264673.22 frames. ], batch size: 332, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:52:13,823 INFO [train.py:996] (3/4) Epoch 7, batch 25850, loss[loss=0.2313, simple_loss=0.2994, pruned_loss=0.08153, over 21732.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3319, pruned_loss=0.08969, over 4265120.42 frames. ], batch size: 230, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:52:34,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1252902.0, ans=0.2 2023-06-22 14:52:59,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1252962.0, ans=0.2 2023-06-22 14:53:33,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1253082.0, ans=0.0 2023-06-22 14:53:39,283 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 3.379e+02 4.198e+02 5.392e+02 1.106e+03, threshold=8.396e+02, percent-clipped=2.0 2023-06-22 14:54:05,887 INFO [train.py:996] (3/4) Epoch 7, batch 25900, loss[loss=0.3123, simple_loss=0.4028, pruned_loss=0.1109, over 21708.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3338, pruned_loss=0.09137, over 4274945.11 frames. ], batch size: 414, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:54:23,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1253202.0, ans=0.2 2023-06-22 14:54:40,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1253262.0, ans=0.125 2023-06-22 14:54:51,011 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-06-22 14:54:59,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1253322.0, ans=0.0 2023-06-22 14:55:02,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1253322.0, ans=0.0 2023-06-22 14:55:03,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1253382.0, ans=0.1 2023-06-22 14:55:10,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1253382.0, ans=0.125 2023-06-22 14:55:14,691 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.33 vs. limit=22.5 2023-06-22 14:55:52,697 INFO [train.py:996] (3/4) Epoch 7, batch 25950, loss[loss=0.2452, simple_loss=0.3154, pruned_loss=0.08751, over 21873.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3398, pruned_loss=0.09436, over 4281791.08 frames. ], batch size: 107, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:56:18,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1253562.0, ans=0.07 2023-06-22 14:56:27,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1253622.0, ans=0.0 2023-06-22 14:57:01,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1253682.0, ans=0.1 2023-06-22 14:57:11,653 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.576e+02 3.589e+02 4.463e+02 5.427e+02 9.904e+02, threshold=8.927e+02, percent-clipped=3.0 2023-06-22 14:57:23,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1253742.0, ans=0.125 2023-06-22 14:57:37,678 INFO [train.py:996] (3/4) Epoch 7, batch 26000, loss[loss=0.2714, simple_loss=0.3454, pruned_loss=0.09869, over 21353.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3379, pruned_loss=0.09242, over 4274932.09 frames. ], batch size: 176, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:58:22,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1253922.0, ans=0.04949747468305833 2023-06-22 14:58:30,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1253982.0, ans=0.0 2023-06-22 14:58:30,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1253982.0, ans=0.0 2023-06-22 14:58:39,342 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-06-22 14:59:11,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1254042.0, ans=0.125 2023-06-22 14:59:14,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1254042.0, ans=0.2 2023-06-22 14:59:14,505 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-22 14:59:18,542 INFO [train.py:996] (3/4) Epoch 7, batch 26050, loss[loss=0.2424, simple_loss=0.304, pruned_loss=0.09045, over 21482.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3377, pruned_loss=0.09354, over 4273580.79 frames. ], batch size: 211, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:59:25,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1254102.0, ans=0.1 2023-06-22 14:59:31,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1254102.0, ans=0.125 2023-06-22 15:00:02,153 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.41 vs. limit=12.0 2023-06-22 15:00:37,205 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.714e+02 3.622e+02 4.374e+02 5.321e+02 8.486e+02, threshold=8.748e+02, percent-clipped=0.0 2023-06-22 15:00:39,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1254342.0, ans=0.2 2023-06-22 15:00:56,280 INFO [train.py:996] (3/4) Epoch 7, batch 26100, loss[loss=0.2566, simple_loss=0.3147, pruned_loss=0.09928, over 21686.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3322, pruned_loss=0.09215, over 4280605.23 frames. ], batch size: 263, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 15:01:21,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1254462.0, ans=0.125 2023-06-22 15:01:37,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1254522.0, ans=0.125 2023-06-22 15:02:19,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1254642.0, ans=0.09899494936611666 2023-06-22 15:02:30,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1254642.0, ans=0.1 2023-06-22 15:02:36,848 INFO [train.py:996] (3/4) Epoch 7, batch 26150, loss[loss=0.215, simple_loss=0.2938, pruned_loss=0.06804, over 21800.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3283, pruned_loss=0.09195, over 4285236.87 frames. ], batch size: 247, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 15:02:53,164 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-22 15:02:59,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1254762.0, ans=0.125 2023-06-22 15:03:06,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1254762.0, ans=15.0 2023-06-22 15:04:04,139 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.633e+02 3.200e+02 3.625e+02 4.544e+02 6.697e+02, threshold=7.250e+02, percent-clipped=0.0 2023-06-22 15:04:19,693 INFO [train.py:996] (3/4) Epoch 7, batch 26200, loss[loss=0.2573, simple_loss=0.3346, pruned_loss=0.08996, over 21753.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3292, pruned_loss=0.09003, over 4285103.99 frames. ], batch size: 124, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 15:04:20,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1255002.0, ans=0.07 2023-06-22 15:04:42,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1255062.0, ans=0.035 2023-06-22 15:04:55,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1255122.0, ans=0.1 2023-06-22 15:05:25,330 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-22 15:05:57,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1255242.0, ans=0.0 2023-06-22 15:05:59,764 INFO [train.py:996] (3/4) Epoch 7, batch 26250, loss[loss=0.241, simple_loss=0.3146, pruned_loss=0.08365, over 21684.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3323, pruned_loss=0.08893, over 4281409.44 frames. ], batch size: 263, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:07:07,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1255482.0, ans=0.125 2023-06-22 15:07:12,591 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.50 vs. limit=15.0 2023-06-22 15:07:14,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1255482.0, ans=0.125 2023-06-22 15:07:24,248 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.543e+02 3.568e+02 4.427e+02 6.029e+02 1.421e+03, threshold=8.855e+02, percent-clipped=13.0 2023-06-22 15:07:38,976 INFO [train.py:996] (3/4) Epoch 7, batch 26300, loss[loss=0.2353, simple_loss=0.3093, pruned_loss=0.08071, over 21841.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3308, pruned_loss=0.08981, over 4285299.00 frames. ], batch size: 124, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:08:52,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1255782.0, ans=22.5 2023-06-22 15:09:03,624 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.37 vs. limit=15.0 2023-06-22 15:09:17,860 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-22 15:09:19,812 INFO [train.py:996] (3/4) Epoch 7, batch 26350, loss[loss=0.2756, simple_loss=0.3429, pruned_loss=0.1042, over 21599.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3293, pruned_loss=0.09076, over 4285014.87 frames. ], batch size: 415, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:10:29,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1256082.0, ans=0.125 2023-06-22 15:10:39,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1256082.0, ans=0.1 2023-06-22 15:10:45,397 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.781e+02 3.544e+02 4.038e+02 5.377e+02 1.137e+03, threshold=8.075e+02, percent-clipped=4.0 2023-06-22 15:10:49,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1256142.0, ans=0.2 2023-06-22 15:10:53,217 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.37 vs. limit=15.0 2023-06-22 15:11:00,706 INFO [train.py:996] (3/4) Epoch 7, batch 26400, loss[loss=0.2203, simple_loss=0.2798, pruned_loss=0.08042, over 21836.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3232, pruned_loss=0.09075, over 4283028.44 frames. ], batch size: 372, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:12:11,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1256382.0, ans=0.1 2023-06-22 15:12:40,470 INFO [train.py:996] (3/4) Epoch 7, batch 26450, loss[loss=0.2444, simple_loss=0.3536, pruned_loss=0.06763, over 21729.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3244, pruned_loss=0.08991, over 4278463.72 frames. ], batch size: 332, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:13:29,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1256562.0, ans=0.125 2023-06-22 15:13:35,696 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2023-06-22 15:13:52,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1256682.0, ans=0.125 2023-06-22 15:14:08,657 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.783e+02 3.688e+02 4.621e+02 8.318e+02 2.033e+03, threshold=9.242e+02, percent-clipped=27.0 2023-06-22 15:14:37,753 INFO [train.py:996] (3/4) Epoch 7, batch 26500, loss[loss=0.244, simple_loss=0.333, pruned_loss=0.07747, over 21680.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3252, pruned_loss=0.08795, over 4270742.06 frames. ], batch size: 389, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:15:37,012 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-22 15:16:17,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1257042.0, ans=0.125 2023-06-22 15:16:27,013 INFO [train.py:996] (3/4) Epoch 7, batch 26550, loss[loss=0.1973, simple_loss=0.3167, pruned_loss=0.03898, over 20716.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3199, pruned_loss=0.08465, over 4256206.72 frames. ], batch size: 608, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:17:46,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1257282.0, ans=0.125 2023-06-22 15:17:48,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1257282.0, ans=0.2 2023-06-22 15:17:55,909 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.468e+02 3.702e+02 5.125e+02 7.307e+02 1.235e+03, threshold=1.025e+03, percent-clipped=15.0 2023-06-22 15:18:08,956 INFO [train.py:996] (3/4) Epoch 7, batch 26600, loss[loss=0.2423, simple_loss=0.3051, pruned_loss=0.08971, over 21729.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3205, pruned_loss=0.08149, over 4263467.70 frames. ], batch size: 112, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:18:09,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1257402.0, ans=0.0 2023-06-22 15:18:10,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1257402.0, ans=0.2 2023-06-22 15:18:17,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1257402.0, ans=0.5 2023-06-22 15:18:39,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1257462.0, ans=0.1 2023-06-22 15:19:23,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1257582.0, ans=0.95 2023-06-22 15:19:31,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1257642.0, ans=0.125 2023-06-22 15:19:32,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1257642.0, ans=0.1 2023-06-22 15:19:46,580 INFO [train.py:996] (3/4) Epoch 7, batch 26650, loss[loss=0.1619, simple_loss=0.2465, pruned_loss=0.03861, over 21581.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3132, pruned_loss=0.08053, over 4262802.78 frames. ], batch size: 230, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:19:56,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1257702.0, ans=0.0 2023-06-22 15:20:05,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1257762.0, ans=0.125 2023-06-22 15:20:31,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1257822.0, ans=0.125 2023-06-22 15:21:12,599 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 3.375e+02 4.059e+02 5.279e+02 9.271e+02, threshold=8.118e+02, percent-clipped=0.0 2023-06-22 15:21:25,457 INFO [train.py:996] (3/4) Epoch 7, batch 26700, loss[loss=0.2179, simple_loss=0.2874, pruned_loss=0.07421, over 21813.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3067, pruned_loss=0.07795, over 4266795.13 frames. ], batch size: 282, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:21:30,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1258002.0, ans=0.0 2023-06-22 15:21:38,089 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2023-06-22 15:21:49,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1258062.0, ans=0.125 2023-06-22 15:21:54,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1258062.0, ans=0.125 2023-06-22 15:22:25,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1258122.0, ans=0.125 2023-06-22 15:23:08,479 INFO [train.py:996] (3/4) Epoch 7, batch 26750, loss[loss=0.2924, simple_loss=0.3597, pruned_loss=0.1126, over 21383.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3076, pruned_loss=0.07737, over 4275505.79 frames. ], batch size: 507, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:23:56,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1258422.0, ans=0.125 2023-06-22 15:24:04,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1258422.0, ans=0.07 2023-06-22 15:24:11,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1258422.0, ans=0.2 2023-06-22 15:24:37,077 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.345e+02 4.329e+02 5.688e+02 1.116e+03, threshold=8.657e+02, percent-clipped=5.0 2023-06-22 15:24:50,375 INFO [train.py:996] (3/4) Epoch 7, batch 26800, loss[loss=0.2391, simple_loss=0.3075, pruned_loss=0.08536, over 21860.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3164, pruned_loss=0.08267, over 4273065.43 frames. ], batch size: 282, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:26:07,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1258782.0, ans=15.0 2023-06-22 15:26:37,049 INFO [train.py:996] (3/4) Epoch 7, batch 26850, loss[loss=0.2394, simple_loss=0.2914, pruned_loss=0.09371, over 21823.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3163, pruned_loss=0.08417, over 4273221.54 frames. ], batch size: 107, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:27:12,122 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=15.0 2023-06-22 15:27:59,611 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.773e+02 3.440e+02 3.815e+02 4.750e+02 7.886e+02, threshold=7.630e+02, percent-clipped=0.0 2023-06-22 15:28:17,070 INFO [train.py:996] (3/4) Epoch 7, batch 26900, loss[loss=0.2314, simple_loss=0.2808, pruned_loss=0.09099, over 21481.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3073, pruned_loss=0.08359, over 4277711.00 frames. ], batch size: 195, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:28:17,348 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:29:04,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1259322.0, ans=0.125 2023-06-22 15:29:06,295 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=22.5 2023-06-22 15:29:07,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1259322.0, ans=0.0 2023-06-22 15:29:56,673 INFO [train.py:996] (3/4) Epoch 7, batch 26950, loss[loss=0.3036, simple_loss=0.384, pruned_loss=0.1116, over 21644.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.307, pruned_loss=0.08335, over 4280029.12 frames. ], batch size: 414, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:30:00,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1259502.0, ans=0.125 2023-06-22 15:30:49,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1259622.0, ans=0.125 2023-06-22 15:31:18,838 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.384e+02 3.421e+02 4.204e+02 5.403e+02 1.152e+03, threshold=8.409e+02, percent-clipped=7.0 2023-06-22 15:31:31,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1259742.0, ans=0.125 2023-06-22 15:31:34,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1259802.0, ans=0.5 2023-06-22 15:31:40,909 INFO [train.py:996] (3/4) Epoch 7, batch 27000, loss[loss=0.2384, simple_loss=0.3337, pruned_loss=0.0715, over 21638.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3064, pruned_loss=0.08082, over 4280618.94 frames. ], batch size: 442, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:31:40,910 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 15:31:54,470 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.6069, 2.8414, 2.7719, 2.6962], device='cuda:3') 2023-06-22 15:31:59,800 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2427, simple_loss=0.3424, pruned_loss=0.07152, over 1796401.00 frames. 2023-06-22 15:31:59,800 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-22 15:32:13,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1259802.0, ans=0.0 2023-06-22 15:32:24,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1259862.0, ans=0.125 2023-06-22 15:33:02,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=1259982.0, ans=0.2 2023-06-22 15:33:38,743 INFO [train.py:996] (3/4) Epoch 7, batch 27050, loss[loss=0.2595, simple_loss=0.3327, pruned_loss=0.09312, over 21575.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3077, pruned_loss=0.07713, over 4280389.68 frames. ], batch size: 471, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:34:19,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1260222.0, ans=0.1 2023-06-22 15:34:25,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1260222.0, ans=0.09899494936611666 2023-06-22 15:35:07,261 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 3.078e+02 3.767e+02 4.545e+02 7.806e+02, threshold=7.533e+02, percent-clipped=0.0 2023-06-22 15:35:09,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.64 vs. limit=12.0 2023-06-22 15:35:23,459 INFO [train.py:996] (3/4) Epoch 7, batch 27100, loss[loss=0.2067, simple_loss=0.2779, pruned_loss=0.06781, over 21683.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3097, pruned_loss=0.07836, over 4288112.50 frames. ], batch size: 263, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:35:26,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1260402.0, ans=0.0 2023-06-22 15:35:31,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1260402.0, ans=0.04949747468305833 2023-06-22 15:35:40,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1260462.0, ans=0.125 2023-06-22 15:35:41,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1260462.0, ans=0.0 2023-06-22 15:36:31,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1260582.0, ans=0.125 2023-06-22 15:37:02,146 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=15.0 2023-06-22 15:37:02,875 INFO [train.py:996] (3/4) Epoch 7, batch 27150, loss[loss=0.2907, simple_loss=0.3722, pruned_loss=0.1046, over 21813.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3199, pruned_loss=0.08142, over 4279324.08 frames. ], batch size: 282, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:37:13,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1260702.0, ans=0.2 2023-06-22 15:38:02,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1260882.0, ans=0.125 2023-06-22 15:38:31,255 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.782e+02 4.019e+02 5.172e+02 7.242e+02 1.500e+03, threshold=1.034e+03, percent-clipped=23.0 2023-06-22 15:38:42,498 INFO [train.py:996] (3/4) Epoch 7, batch 27200, loss[loss=0.3244, simple_loss=0.3897, pruned_loss=0.1295, over 21461.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3306, pruned_loss=0.08543, over 4279237.05 frames. ], batch size: 471, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:38:47,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1261002.0, ans=0.0 2023-06-22 15:39:35,481 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:39:43,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1261122.0, ans=0.2 2023-06-22 15:40:03,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1261182.0, ans=0.1 2023-06-22 15:40:07,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1261242.0, ans=0.95 2023-06-22 15:40:23,216 INFO [train.py:996] (3/4) Epoch 7, batch 27250, loss[loss=0.2476, simple_loss=0.3166, pruned_loss=0.08927, over 20649.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3346, pruned_loss=0.09041, over 4276977.49 frames. ], batch size: 607, lr: 4.18e-03, grad_scale: 32.0 2023-06-22 15:41:32,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1261482.0, ans=0.2 2023-06-22 15:41:47,740 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-22 15:41:52,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1261542.0, ans=0.125 2023-06-22 15:41:54,925 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.033e+02 3.741e+02 4.375e+02 5.438e+02 9.965e+02, threshold=8.750e+02, percent-clipped=0.0 2023-06-22 15:42:00,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1261542.0, ans=0.1 2023-06-22 15:42:08,894 INFO [train.py:996] (3/4) Epoch 7, batch 27300, loss[loss=0.2678, simple_loss=0.3257, pruned_loss=0.105, over 20002.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3361, pruned_loss=0.09199, over 4276560.40 frames. ], batch size: 702, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:42:16,518 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-22 15:43:18,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1261782.0, ans=0.125 2023-06-22 15:43:48,243 INFO [train.py:996] (3/4) Epoch 7, batch 27350, loss[loss=0.2755, simple_loss=0.341, pruned_loss=0.105, over 21289.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3384, pruned_loss=0.09225, over 4276760.24 frames. ], batch size: 159, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:44:13,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1261962.0, ans=0.0 2023-06-22 15:44:41,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1262022.0, ans=0.125 2023-06-22 15:44:46,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1262022.0, ans=0.0 2023-06-22 15:44:48,407 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=22.5 2023-06-22 15:45:15,955 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.698e+02 3.868e+02 4.861e+02 6.523e+02 1.170e+03, threshold=9.722e+02, percent-clipped=10.0 2023-06-22 15:45:25,585 INFO [train.py:996] (3/4) Epoch 7, batch 27400, loss[loss=0.2343, simple_loss=0.2977, pruned_loss=0.0854, over 21683.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3328, pruned_loss=0.09106, over 4281053.50 frames. ], batch size: 282, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:45:40,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-06-22 15:46:35,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1262382.0, ans=0.1 2023-06-22 15:46:49,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1262442.0, ans=0.0 2023-06-22 15:47:09,263 INFO [train.py:996] (3/4) Epoch 7, batch 27450, loss[loss=0.2717, simple_loss=0.3443, pruned_loss=0.09953, over 21400.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3262, pruned_loss=0.08925, over 4277952.96 frames. ], batch size: 194, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:47:32,336 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=22.5 2023-06-22 15:48:28,400 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.635e+02 3.611e+02 4.155e+02 4.904e+02 8.641e+02, threshold=8.310e+02, percent-clipped=0.0 2023-06-22 15:48:41,010 INFO [train.py:996] (3/4) Epoch 7, batch 27500, loss[loss=0.2685, simple_loss=0.3295, pruned_loss=0.1038, over 21903.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3249, pruned_loss=0.09006, over 4283723.65 frames. ], batch size: 316, lr: 4.18e-03, grad_scale: 8.0 2023-06-22 15:49:02,425 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=22.5 2023-06-22 15:49:21,438 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-22 15:49:29,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=22.5 2023-06-22 15:50:23,456 INFO [train.py:996] (3/4) Epoch 7, batch 27550, loss[loss=0.2232, simple_loss=0.2799, pruned_loss=0.08328, over 21221.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3192, pruned_loss=0.08679, over 4277946.62 frames. ], batch size: 176, lr: 4.18e-03, grad_scale: 8.0 2023-06-22 15:50:24,396 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.89 vs. limit=10.0 2023-06-22 15:50:36,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1263102.0, ans=0.125 2023-06-22 15:51:46,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1263342.0, ans=0.125 2023-06-22 15:51:46,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1263342.0, ans=0.0 2023-06-22 15:51:47,985 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.633e+02 3.463e+02 4.160e+02 5.154e+02 1.063e+03, threshold=8.319e+02, percent-clipped=3.0 2023-06-22 15:52:01,052 INFO [train.py:996] (3/4) Epoch 7, batch 27600, loss[loss=0.2261, simple_loss=0.2866, pruned_loss=0.08281, over 21374.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3129, pruned_loss=0.08587, over 4276474.25 frames. ], batch size: 177, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:52:15,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1263462.0, ans=0.0 2023-06-22 15:52:24,810 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.25 vs. limit=12.0 2023-06-22 15:52:53,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1263582.0, ans=0.125 2023-06-22 15:53:14,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1263642.0, ans=0.0 2023-06-22 15:53:20,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1263642.0, ans=0.125 2023-06-22 15:53:32,808 INFO [train.py:996] (3/4) Epoch 7, batch 27650, loss[loss=0.2504, simple_loss=0.3205, pruned_loss=0.09014, over 15965.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3071, pruned_loss=0.0853, over 4269496.13 frames. ], batch size: 61, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:53:34,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1263702.0, ans=0.1 2023-06-22 15:53:36,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1263702.0, ans=0.0 2023-06-22 15:53:53,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1263762.0, ans=0.07 2023-06-22 15:53:56,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1263762.0, ans=0.0 2023-06-22 15:54:02,271 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=15.0 2023-06-22 15:54:03,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1263762.0, ans=0.125 2023-06-22 15:54:08,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1263822.0, ans=0.125 2023-06-22 15:54:25,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1263822.0, ans=0.125 2023-06-22 15:54:58,109 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.477e+02 3.224e+02 3.872e+02 5.377e+02 9.163e+02, threshold=7.744e+02, percent-clipped=1.0 2023-06-22 15:55:10,429 INFO [train.py:996] (3/4) Epoch 7, batch 27700, loss[loss=0.2164, simple_loss=0.301, pruned_loss=0.06588, over 21733.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3074, pruned_loss=0.08346, over 4262055.95 frames. ], batch size: 247, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:55:27,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1264002.0, ans=0.125 2023-06-22 15:56:46,070 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-06-22 15:56:53,348 INFO [train.py:996] (3/4) Epoch 7, batch 27750, loss[loss=0.1878, simple_loss=0.2772, pruned_loss=0.04916, over 21678.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3114, pruned_loss=0.08316, over 4260705.62 frames. ], batch size: 263, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:57:15,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1264362.0, ans=0.125 2023-06-22 15:58:17,875 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 3.542e+02 4.380e+02 5.826e+02 1.163e+03, threshold=8.759e+02, percent-clipped=13.0 2023-06-22 15:58:19,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1264542.0, ans=0.0 2023-06-22 15:58:26,074 INFO [train.py:996] (3/4) Epoch 7, batch 27800, loss[loss=0.2096, simple_loss=0.2976, pruned_loss=0.06085, over 21411.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3101, pruned_loss=0.0831, over 4274336.95 frames. ], batch size: 548, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:58:36,658 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.25 vs. limit=10.0 2023-06-22 15:58:45,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1264662.0, ans=0.125 2023-06-22 15:59:14,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1264722.0, ans=0.0 2023-06-22 15:59:19,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1264722.0, ans=0.0 2023-06-22 15:59:22,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1264782.0, ans=0.025 2023-06-22 15:59:56,714 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.44 vs. limit=10.0 2023-06-22 16:00:09,292 INFO [train.py:996] (3/4) Epoch 7, batch 27850, loss[loss=0.2574, simple_loss=0.3345, pruned_loss=0.09012, over 21352.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3102, pruned_loss=0.08424, over 4289931.70 frames. ], batch size: 159, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:00:23,177 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=22.5 2023-06-22 16:00:37,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1264962.0, ans=0.015 2023-06-22 16:00:54,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1265022.0, ans=0.125 2023-06-22 16:01:21,666 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:01:31,639 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:01:39,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1265142.0, ans=0.0 2023-06-22 16:01:44,416 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.686e+02 3.358e+02 4.006e+02 5.079e+02 1.358e+03, threshold=8.013e+02, percent-clipped=6.0 2023-06-22 16:01:50,946 INFO [train.py:996] (3/4) Epoch 7, batch 27900, loss[loss=0.2891, simple_loss=0.41, pruned_loss=0.08411, over 20801.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3194, pruned_loss=0.08574, over 4287795.64 frames. ], batch size: 607, lr: 4.18e-03, grad_scale: 8.0 2023-06-22 16:02:05,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1265202.0, ans=0.2 2023-06-22 16:02:11,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1265262.0, ans=0.125 2023-06-22 16:03:30,618 INFO [train.py:996] (3/4) Epoch 7, batch 27950, loss[loss=0.2374, simple_loss=0.3232, pruned_loss=0.07577, over 21973.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3178, pruned_loss=0.08195, over 4285952.14 frames. ], batch size: 317, lr: 4.18e-03, grad_scale: 8.0 2023-06-22 16:04:22,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1265622.0, ans=0.125 2023-06-22 16:04:39,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1265682.0, ans=0.125 2023-06-22 16:05:01,854 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 3.331e+02 4.215e+02 5.224e+02 1.025e+03, threshold=8.430e+02, percent-clipped=5.0 2023-06-22 16:05:13,096 INFO [train.py:996] (3/4) Epoch 7, batch 28000, loss[loss=0.2358, simple_loss=0.3105, pruned_loss=0.08052, over 21902.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3154, pruned_loss=0.0792, over 4288480.22 frames. ], batch size: 351, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:05:56,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1265922.0, ans=0.2 2023-06-22 16:06:14,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1265982.0, ans=0.125 2023-06-22 16:06:17,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1265982.0, ans=0.0 2023-06-22 16:06:30,766 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-22 16:06:52,780 INFO [train.py:996] (3/4) Epoch 7, batch 28050, loss[loss=0.3007, simple_loss=0.3739, pruned_loss=0.1138, over 21549.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3146, pruned_loss=0.08097, over 4294293.48 frames. ], batch size: 471, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:07:06,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1266102.0, ans=0.125 2023-06-22 16:07:12,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1266162.0, ans=0.0 2023-06-22 16:07:32,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1266222.0, ans=0.125 2023-06-22 16:07:53,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1266282.0, ans=0.125 2023-06-22 16:08:19,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1266342.0, ans=0.125 2023-06-22 16:08:19,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1266342.0, ans=0.125 2023-06-22 16:08:23,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1266342.0, ans=0.0 2023-06-22 16:08:25,037 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 3.562e+02 4.496e+02 6.296e+02 1.484e+03, threshold=8.993e+02, percent-clipped=8.0 2023-06-22 16:08:31,022 INFO [train.py:996] (3/4) Epoch 7, batch 28100, loss[loss=0.2299, simple_loss=0.2824, pruned_loss=0.08868, over 21195.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3109, pruned_loss=0.08147, over 4290097.91 frames. ], batch size: 176, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:09:01,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1266462.0, ans=0.02 2023-06-22 16:09:28,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1266522.0, ans=0.05 2023-06-22 16:10:07,979 INFO [train.py:996] (3/4) Epoch 7, batch 28150, loss[loss=0.2395, simple_loss=0.2886, pruned_loss=0.0952, over 21150.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3043, pruned_loss=0.08155, over 4283335.92 frames. ], batch size: 176, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:10:21,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1266702.0, ans=0.95 2023-06-22 16:10:37,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1266762.0, ans=0.0 2023-06-22 16:11:05,606 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=22.5 2023-06-22 16:11:35,825 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=22.5 2023-06-22 16:11:39,700 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.795e+02 3.363e+02 3.938e+02 4.881e+02 1.435e+03, threshold=7.877e+02, percent-clipped=4.0 2023-06-22 16:11:40,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1266942.0, ans=0.125 2023-06-22 16:11:46,175 INFO [train.py:996] (3/4) Epoch 7, batch 28200, loss[loss=0.2256, simple_loss=0.2901, pruned_loss=0.08057, over 21775.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3028, pruned_loss=0.08356, over 4284995.20 frames. ], batch size: 107, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:11:49,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1267002.0, ans=0.2 2023-06-22 16:12:04,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1267002.0, ans=0.1 2023-06-22 16:12:35,085 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-22 16:13:04,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1267242.0, ans=0.2 2023-06-22 16:13:16,240 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.71 vs. limit=22.5 2023-06-22 16:13:24,174 INFO [train.py:996] (3/4) Epoch 7, batch 28250, loss[loss=0.2651, simple_loss=0.3173, pruned_loss=0.1065, over 21526.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3066, pruned_loss=0.08597, over 4287449.34 frames. ], batch size: 441, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:13:26,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1267302.0, ans=0.125 2023-06-22 16:13:43,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1267362.0, ans=0.125 2023-06-22 16:14:22,472 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-22 16:14:52,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1267542.0, ans=0.5 2023-06-22 16:14:56,636 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.948e+02 3.937e+02 4.825e+02 6.344e+02 1.441e+03, threshold=9.649e+02, percent-clipped=9.0 2023-06-22 16:15:07,518 INFO [train.py:996] (3/4) Epoch 7, batch 28300, loss[loss=0.2283, simple_loss=0.3229, pruned_loss=0.06686, over 21497.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3064, pruned_loss=0.0844, over 4275951.06 frames. ], batch size: 471, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:15:08,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1267602.0, ans=0.125 2023-06-22 16:15:34,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1267662.0, ans=0.125 2023-06-22 16:15:54,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.93 vs. limit=22.5 2023-06-22 16:16:37,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1267842.0, ans=0.125 2023-06-22 16:16:39,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1267842.0, ans=0.125 2023-06-22 16:16:51,632 INFO [train.py:996] (3/4) Epoch 7, batch 28350, loss[loss=0.2065, simple_loss=0.2897, pruned_loss=0.06163, over 21492.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3034, pruned_loss=0.07843, over 4279066.26 frames. ], batch size: 389, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:16:51,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1267902.0, ans=0.125 2023-06-22 16:17:30,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1268022.0, ans=0.2 2023-06-22 16:17:57,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1268082.0, ans=0.125 2023-06-22 16:18:16,784 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.359e+02 3.369e+02 4.514e+02 6.821e+02 1.544e+03, threshold=9.028e+02, percent-clipped=6.0 2023-06-22 16:18:28,277 INFO [train.py:996] (3/4) Epoch 7, batch 28400, loss[loss=0.224, simple_loss=0.2886, pruned_loss=0.07967, over 21586.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3004, pruned_loss=0.0785, over 4280111.62 frames. ], batch size: 231, lr: 4.17e-03, grad_scale: 32.0 2023-06-22 16:18:43,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1268262.0, ans=0.0 2023-06-22 16:18:47,458 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=12.0 2023-06-22 16:18:48,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1268262.0, ans=0.0 2023-06-22 16:19:25,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1268322.0, ans=0.07 2023-06-22 16:19:42,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1268382.0, ans=0.125 2023-06-22 16:20:10,439 INFO [train.py:996] (3/4) Epoch 7, batch 28450, loss[loss=0.2322, simple_loss=0.3061, pruned_loss=0.07918, over 20748.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3054, pruned_loss=0.08213, over 4278317.21 frames. ], batch size: 611, lr: 4.17e-03, grad_scale: 32.0 2023-06-22 16:20:36,761 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-22 16:20:47,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1268622.0, ans=0.125 2023-06-22 16:21:04,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1268622.0, ans=0.125 2023-06-22 16:21:07,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1268682.0, ans=0.05 2023-06-22 16:21:28,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1268742.0, ans=0.125 2023-06-22 16:21:30,363 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-22 16:21:44,196 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.730e+02 3.754e+02 4.591e+02 6.046e+02 1.139e+03, threshold=9.182e+02, percent-clipped=3.0 2023-06-22 16:21:49,023 INFO [train.py:996] (3/4) Epoch 7, batch 28500, loss[loss=0.2239, simple_loss=0.2892, pruned_loss=0.07935, over 21259.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.31, pruned_loss=0.08579, over 4288167.76 frames. ], batch size: 608, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:23:13,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1269042.0, ans=0.0 2023-06-22 16:23:33,163 INFO [train.py:996] (3/4) Epoch 7, batch 28550, loss[loss=0.2116, simple_loss=0.2871, pruned_loss=0.06804, over 21873.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3191, pruned_loss=0.08834, over 4288829.76 frames. ], batch size: 98, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:23:39,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1269102.0, ans=0.2 2023-06-22 16:23:46,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1269102.0, ans=0.2 2023-06-22 16:23:52,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1269162.0, ans=0.5 2023-06-22 16:25:07,210 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.907e+02 3.553e+02 4.352e+02 5.602e+02 1.151e+03, threshold=8.703e+02, percent-clipped=1.0 2023-06-22 16:25:10,382 INFO [train.py:996] (3/4) Epoch 7, batch 28600, loss[loss=0.2606, simple_loss=0.3363, pruned_loss=0.09247, over 21565.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3243, pruned_loss=0.09022, over 4282525.19 frames. ], batch size: 112, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:25:18,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1269402.0, ans=0.125 2023-06-22 16:25:56,056 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:25:56,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-22 16:26:25,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1269582.0, ans=0.1 2023-06-22 16:26:28,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1269642.0, ans=0.125 2023-06-22 16:26:48,951 INFO [train.py:996] (3/4) Epoch 7, batch 28650, loss[loss=0.3221, simple_loss=0.356, pruned_loss=0.1441, over 21417.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3179, pruned_loss=0.08944, over 4283311.91 frames. ], batch size: 509, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:27:58,813 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.08 vs. limit=15.0 2023-06-22 16:27:59,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1269882.0, ans=0.1 2023-06-22 16:28:15,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1269942.0, ans=0.125 2023-06-22 16:28:24,456 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.670e+02 4.036e+02 5.079e+02 7.151e+02 1.408e+03, threshold=1.016e+03, percent-clipped=11.0 2023-06-22 16:28:32,060 INFO [train.py:996] (3/4) Epoch 7, batch 28700, loss[loss=0.2278, simple_loss=0.3036, pruned_loss=0.07603, over 21660.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3168, pruned_loss=0.09011, over 4279835.66 frames. ], batch size: 263, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:28:35,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1270002.0, ans=0.125 2023-06-22 16:28:38,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1270002.0, ans=0.0 2023-06-22 16:28:46,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1270062.0, ans=0.0 2023-06-22 16:29:11,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1270122.0, ans=0.1 2023-06-22 16:29:26,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1270122.0, ans=0.0 2023-06-22 16:29:33,280 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-06-22 16:29:50,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1270242.0, ans=0.1 2023-06-22 16:30:09,815 INFO [train.py:996] (3/4) Epoch 7, batch 28750, loss[loss=0.2532, simple_loss=0.3177, pruned_loss=0.09439, over 21338.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3176, pruned_loss=0.09042, over 4283382.79 frames. ], batch size: 159, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:30:13,868 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-22 16:30:24,845 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.33 vs. limit=10.0 2023-06-22 16:30:37,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1270362.0, ans=0.2 2023-06-22 16:31:06,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1270482.0, ans=0.035 2023-06-22 16:31:24,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1270482.0, ans=0.125 2023-06-22 16:31:43,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1270542.0, ans=0.125 2023-06-22 16:31:45,880 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.805e+02 3.311e+02 3.906e+02 4.905e+02 1.219e+03, threshold=7.811e+02, percent-clipped=5.0 2023-06-22 16:31:49,098 INFO [train.py:996] (3/4) Epoch 7, batch 28800, loss[loss=0.3028, simple_loss=0.3632, pruned_loss=0.1212, over 21339.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3204, pruned_loss=0.09093, over 4282434.40 frames. ], batch size: 159, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:32:05,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1270602.0, ans=15.0 2023-06-22 16:33:00,523 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:33:30,948 INFO [train.py:996] (3/4) Epoch 7, batch 28850, loss[loss=0.2423, simple_loss=0.3013, pruned_loss=0.09159, over 21485.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3211, pruned_loss=0.09205, over 4291645.26 frames. ], batch size: 211, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:34:00,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1270962.0, ans=0.125 2023-06-22 16:34:15,189 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:34:32,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1271082.0, ans=0.0 2023-06-22 16:34:36,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1271082.0, ans=0.125 2023-06-22 16:34:36,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1271082.0, ans=0.125 2023-06-22 16:35:03,020 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.913e+02 3.986e+02 4.811e+02 7.732e+02 1.672e+03, threshold=9.622e+02, percent-clipped=22.0 2023-06-22 16:35:06,506 INFO [train.py:996] (3/4) Epoch 7, batch 28900, loss[loss=0.2844, simple_loss=0.3514, pruned_loss=0.1087, over 21831.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3257, pruned_loss=0.09411, over 4285905.47 frames. ], batch size: 118, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:35:44,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1271262.0, ans=0.0 2023-06-22 16:35:47,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1271322.0, ans=0.0 2023-06-22 16:35:54,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1271322.0, ans=15.0 2023-06-22 16:36:17,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1271382.0, ans=0.1 2023-06-22 16:36:30,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1271442.0, ans=0.0 2023-06-22 16:36:42,214 INFO [train.py:996] (3/4) Epoch 7, batch 28950, loss[loss=0.2431, simple_loss=0.3412, pruned_loss=0.0725, over 21760.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3254, pruned_loss=0.09236, over 4278354.07 frames. ], batch size: 351, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:36:53,276 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2023-06-22 16:37:43,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1271622.0, ans=0.1 2023-06-22 16:37:43,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1271622.0, ans=0.125 2023-06-22 16:37:50,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1271682.0, ans=0.125 2023-06-22 16:38:11,128 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.01 vs. limit=15.0 2023-06-22 16:38:23,459 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.694e+02 3.738e+02 4.662e+02 6.144e+02 1.231e+03, threshold=9.324e+02, percent-clipped=1.0 2023-06-22 16:38:31,536 INFO [train.py:996] (3/4) Epoch 7, batch 29000, loss[loss=0.2433, simple_loss=0.3517, pruned_loss=0.06747, over 21622.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3277, pruned_loss=0.09089, over 4274837.46 frames. ], batch size: 441, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:38:41,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1271802.0, ans=0.95 2023-06-22 16:38:41,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1271802.0, ans=0.125 2023-06-22 16:39:14,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1271922.0, ans=0.125 2023-06-22 16:39:25,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1271982.0, ans=0.2 2023-06-22 16:40:05,710 INFO [train.py:996] (3/4) Epoch 7, batch 29050, loss[loss=0.2718, simple_loss=0.3361, pruned_loss=0.1037, over 21950.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3263, pruned_loss=0.09088, over 4272092.76 frames. ], batch size: 333, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:40:16,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1272102.0, ans=0.1 2023-06-22 16:40:23,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1272162.0, ans=0.0 2023-06-22 16:40:25,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-22 16:40:58,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1272222.0, ans=0.125 2023-06-22 16:41:32,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1272342.0, ans=0.125 2023-06-22 16:41:38,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1272342.0, ans=0.0 2023-06-22 16:41:39,489 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.766e+02 3.571e+02 4.314e+02 5.526e+02 9.643e+02, threshold=8.628e+02, percent-clipped=2.0 2023-06-22 16:41:42,552 INFO [train.py:996] (3/4) Epoch 7, batch 29100, loss[loss=0.2378, simple_loss=0.282, pruned_loss=0.09682, over 21320.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3177, pruned_loss=0.08862, over 4266054.65 frames. ], batch size: 144, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:41:55,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1272402.0, ans=0.1 2023-06-22 16:42:02,571 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-22 16:42:57,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1272642.0, ans=0.035 2023-06-22 16:43:19,090 INFO [train.py:996] (3/4) Epoch 7, batch 29150, loss[loss=0.2219, simple_loss=0.2904, pruned_loss=0.07673, over 21302.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.317, pruned_loss=0.08741, over 4270082.42 frames. ], batch size: 144, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:44:05,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1272822.0, ans=0.125 2023-06-22 16:44:18,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1272882.0, ans=0.125 2023-06-22 16:44:29,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1272942.0, ans=0.125 2023-06-22 16:44:45,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1272942.0, ans=0.125 2023-06-22 16:44:52,747 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.751e+02 3.598e+02 4.101e+02 5.513e+02 1.275e+03, threshold=8.201e+02, percent-clipped=6.0 2023-06-22 16:44:55,699 INFO [train.py:996] (3/4) Epoch 7, batch 29200, loss[loss=0.2093, simple_loss=0.2707, pruned_loss=0.07392, over 21423.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3126, pruned_loss=0.08678, over 4262917.11 frames. ], batch size: 194, lr: 4.17e-03, grad_scale: 32.0 2023-06-22 16:44:57,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1273002.0, ans=0.1 2023-06-22 16:45:15,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1273062.0, ans=0.0 2023-06-22 16:45:33,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1273062.0, ans=0.125 2023-06-22 16:45:54,711 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-22 16:46:06,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1273242.0, ans=0.0 2023-06-22 16:46:35,590 INFO [train.py:996] (3/4) Epoch 7, batch 29250, loss[loss=0.2222, simple_loss=0.2818, pruned_loss=0.08126, over 16674.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3123, pruned_loss=0.08486, over 4261006.10 frames. ], batch size: 61, lr: 4.16e-03, grad_scale: 32.0 2023-06-22 16:46:40,049 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=12.0 2023-06-22 16:46:43,231 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=22.5 2023-06-22 16:46:44,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1273302.0, ans=0.2 2023-06-22 16:47:20,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1273422.0, ans=0.125 2023-06-22 16:47:25,687 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-22 16:48:10,631 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.544e+02 3.452e+02 4.218e+02 5.407e+02 1.140e+03, threshold=8.437e+02, percent-clipped=6.0 2023-06-22 16:48:13,948 INFO [train.py:996] (3/4) Epoch 7, batch 29300, loss[loss=0.2048, simple_loss=0.265, pruned_loss=0.07232, over 21777.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3123, pruned_loss=0.08327, over 4263637.80 frames. ], batch size: 124, lr: 4.16e-03, grad_scale: 32.0 2023-06-22 16:48:28,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1273602.0, ans=0.2 2023-06-22 16:49:00,136 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.28 vs. limit=6.0 2023-06-22 16:49:05,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1273722.0, ans=0.0 2023-06-22 16:49:21,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.88 vs. limit=15.0 2023-06-22 16:49:31,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1273782.0, ans=0.0 2023-06-22 16:49:52,298 INFO [train.py:996] (3/4) Epoch 7, batch 29350, loss[loss=0.2185, simple_loss=0.2771, pruned_loss=0.07989, over 20007.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3081, pruned_loss=0.0824, over 4264544.82 frames. ], batch size: 702, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:50:00,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1273902.0, ans=10.0 2023-06-22 16:50:34,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1274022.0, ans=0.0 2023-06-22 16:50:45,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1274022.0, ans=0.1 2023-06-22 16:50:46,152 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-06-22 16:51:09,200 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.69 vs. limit=22.5 2023-06-22 16:51:29,115 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.749e+02 3.690e+02 4.735e+02 5.967e+02 1.066e+03, threshold=9.470e+02, percent-clipped=8.0 2023-06-22 16:51:30,497 INFO [train.py:996] (3/4) Epoch 7, batch 29400, loss[loss=0.1453, simple_loss=0.1988, pruned_loss=0.04593, over 21811.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3091, pruned_loss=0.08058, over 4274758.85 frames. ], batch size: 118, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:51:45,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1274202.0, ans=0.0 2023-06-22 16:51:45,828 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=15.0 2023-06-22 16:51:53,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1274262.0, ans=0.1 2023-06-22 16:52:12,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1274322.0, ans=0.0 2023-06-22 16:53:09,780 INFO [train.py:996] (3/4) Epoch 7, batch 29450, loss[loss=0.232, simple_loss=0.2946, pruned_loss=0.08468, over 21340.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3094, pruned_loss=0.07982, over 4274824.53 frames. ], batch size: 176, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:53:34,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1274562.0, ans=0.1 2023-06-22 16:54:41,847 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.353e+02 4.158e+02 5.429e+02 7.361e+02 1.574e+03, threshold=1.086e+03, percent-clipped=7.0 2023-06-22 16:54:43,558 INFO [train.py:996] (3/4) Epoch 7, batch 29500, loss[loss=0.2588, simple_loss=0.3309, pruned_loss=0.09336, over 20662.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3138, pruned_loss=0.08392, over 4283220.82 frames. ], batch size: 607, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:55:26,919 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=15.0 2023-06-22 16:55:43,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1274982.0, ans=0.0 2023-06-22 16:56:21,906 INFO [train.py:996] (3/4) Epoch 7, batch 29550, loss[loss=0.2183, simple_loss=0.2858, pruned_loss=0.07542, over 21629.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3126, pruned_loss=0.08553, over 4289023.04 frames. ], batch size: 263, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:57:03,118 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:57:39,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1275342.0, ans=0.04949747468305833 2023-06-22 16:57:59,470 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.645e+02 3.724e+02 4.638e+02 7.025e+02 1.809e+03, threshold=9.276e+02, percent-clipped=5.0 2023-06-22 16:58:01,070 INFO [train.py:996] (3/4) Epoch 7, batch 29600, loss[loss=0.2668, simple_loss=0.3515, pruned_loss=0.09111, over 21835.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3177, pruned_loss=0.08769, over 4291084.29 frames. ], batch size: 282, lr: 4.16e-03, grad_scale: 32.0 2023-06-22 16:58:55,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1275522.0, ans=0.125 2023-06-22 16:59:03,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1275582.0, ans=0.0 2023-06-22 16:59:37,930 INFO [train.py:996] (3/4) Epoch 7, batch 29650, loss[loss=0.215, simple_loss=0.3167, pruned_loss=0.05665, over 20799.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3147, pruned_loss=0.08388, over 4282307.01 frames. ], batch size: 608, lr: 4.16e-03, grad_scale: 32.0 2023-06-22 17:00:08,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1275762.0, ans=0.0 2023-06-22 17:01:01,239 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 17:01:16,734 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.600e+02 3.826e+02 4.944e+02 6.202e+02 1.000e+03, threshold=9.888e+02, percent-clipped=1.0 2023-06-22 17:01:16,754 INFO [train.py:996] (3/4) Epoch 7, batch 29700, loss[loss=0.2709, simple_loss=0.3366, pruned_loss=0.1026, over 21737.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3179, pruned_loss=0.08516, over 4290575.85 frames. ], batch size: 441, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:01:30,177 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=22.5 2023-06-22 17:01:31,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1276062.0, ans=0.0 2023-06-22 17:01:31,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1276062.0, ans=0.0 2023-06-22 17:01:41,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1276062.0, ans=0.125 2023-06-22 17:02:23,604 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-22 17:02:54,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1276302.0, ans=0.125 2023-06-22 17:02:55,334 INFO [train.py:996] (3/4) Epoch 7, batch 29750, loss[loss=0.2014, simple_loss=0.289, pruned_loss=0.05689, over 21499.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3235, pruned_loss=0.08506, over 4290996.25 frames. ], batch size: 211, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:03:26,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1276362.0, ans=0.125 2023-06-22 17:04:22,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.04 vs. limit=12.0 2023-06-22 17:04:32,155 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.731e+02 3.419e+02 3.988e+02 5.118e+02 1.049e+03, threshold=7.976e+02, percent-clipped=2.0 2023-06-22 17:04:32,175 INFO [train.py:996] (3/4) Epoch 7, batch 29800, loss[loss=0.2278, simple_loss=0.2971, pruned_loss=0.07925, over 21861.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3252, pruned_loss=0.08605, over 4295383.64 frames. ], batch size: 298, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:04:35,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1276602.0, ans=0.0 2023-06-22 17:05:18,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1276722.0, ans=0.0 2023-06-22 17:06:06,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1276842.0, ans=0.1 2023-06-22 17:06:10,672 INFO [train.py:996] (3/4) Epoch 7, batch 29850, loss[loss=0.236, simple_loss=0.3141, pruned_loss=0.07898, over 21487.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3204, pruned_loss=0.08379, over 4290201.70 frames. ], batch size: 131, lr: 4.16e-03, grad_scale: 8.0 2023-06-22 17:07:14,196 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=12.0 2023-06-22 17:07:31,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1277142.0, ans=0.0 2023-06-22 17:07:43,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1277142.0, ans=0.125 2023-06-22 17:07:48,854 INFO [train.py:996] (3/4) Epoch 7, batch 29900, loss[loss=0.2951, simple_loss=0.4056, pruned_loss=0.09226, over 19801.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3197, pruned_loss=0.08555, over 4292971.37 frames. ], batch size: 703, lr: 4.16e-03, grad_scale: 8.0 2023-06-22 17:07:50,463 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.619e+02 3.325e+02 3.983e+02 5.006e+02 1.426e+03, threshold=7.966e+02, percent-clipped=5.0 2023-06-22 17:08:23,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1277262.0, ans=0.0 2023-06-22 17:08:27,589 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-22 17:08:42,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1277322.0, ans=0.1 2023-06-22 17:08:58,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1277382.0, ans=0.05 2023-06-22 17:08:59,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1277382.0, ans=0.125 2023-06-22 17:09:14,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1277442.0, ans=0.1 2023-06-22 17:09:14,773 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=15.0 2023-06-22 17:09:33,571 INFO [train.py:996] (3/4) Epoch 7, batch 29950, loss[loss=0.2951, simple_loss=0.3681, pruned_loss=0.1111, over 21820.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3243, pruned_loss=0.08967, over 4295453.05 frames. ], batch size: 124, lr: 4.16e-03, grad_scale: 8.0 2023-06-22 17:09:51,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1277502.0, ans=0.2 2023-06-22 17:09:57,173 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.03 vs. limit=12.0 2023-06-22 17:09:58,815 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=22.5 2023-06-22 17:10:49,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1277682.0, ans=0.125 2023-06-22 17:10:54,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1277742.0, ans=0.0 2023-06-22 17:10:56,427 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=12.0 2023-06-22 17:11:13,177 INFO [train.py:996] (3/4) Epoch 7, batch 30000, loss[loss=0.2134, simple_loss=0.3037, pruned_loss=0.06157, over 21717.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3256, pruned_loss=0.08938, over 4293154.12 frames. ], batch size: 247, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:11:13,177 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 17:11:34,232 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2473, simple_loss=0.3461, pruned_loss=0.0743, over 1796401.00 frames. 2023-06-22 17:11:34,233 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-22 17:11:36,054 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.688e+02 3.810e+02 4.424e+02 5.666e+02 1.321e+03, threshold=8.847e+02, percent-clipped=8.0 2023-06-22 17:12:32,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1277982.0, ans=0.125 2023-06-22 17:12:59,538 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-22 17:13:10,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1278042.0, ans=0.2 2023-06-22 17:13:26,173 INFO [train.py:996] (3/4) Epoch 7, batch 30050, loss[loss=0.279, simple_loss=0.4037, pruned_loss=0.07717, over 21155.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3278, pruned_loss=0.08611, over 4281073.06 frames. ], batch size: 548, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:13:41,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1278162.0, ans=0.0 2023-06-22 17:13:50,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1278162.0, ans=0.1 2023-06-22 17:15:03,300 INFO [train.py:996] (3/4) Epoch 7, batch 30100, loss[loss=0.2098, simple_loss=0.2701, pruned_loss=0.07474, over 21823.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3252, pruned_loss=0.08465, over 4270952.01 frames. ], batch size: 118, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:15:04,862 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.804e+02 3.663e+02 4.877e+02 6.196e+02 1.469e+03, threshold=9.754e+02, percent-clipped=9.0 2023-06-22 17:15:34,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1278522.0, ans=0.0 2023-06-22 17:16:41,779 INFO [train.py:996] (3/4) Epoch 7, batch 30150, loss[loss=0.2545, simple_loss=0.3229, pruned_loss=0.09304, over 21549.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3211, pruned_loss=0.08625, over 4264228.10 frames. ], batch size: 389, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:17:09,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1278762.0, ans=0.125 2023-06-22 17:18:21,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1278942.0, ans=0.125 2023-06-22 17:18:24,037 INFO [train.py:996] (3/4) Epoch 7, batch 30200, loss[loss=0.2373, simple_loss=0.3182, pruned_loss=0.07818, over 21719.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3237, pruned_loss=0.08439, over 4271803.99 frames. ], batch size: 298, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:18:24,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1279002.0, ans=0.0 2023-06-22 17:18:25,097 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.55 vs. limit=10.0 2023-06-22 17:18:25,700 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.591e+02 3.500e+02 4.327e+02 6.195e+02 1.104e+03, threshold=8.654e+02, percent-clipped=5.0 2023-06-22 17:19:11,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1279122.0, ans=0.1 2023-06-22 17:19:11,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1279122.0, ans=0.1 2023-06-22 17:19:18,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1279122.0, ans=0.0 2023-06-22 17:19:28,002 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=15.0 2023-06-22 17:19:48,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1279242.0, ans=0.125 2023-06-22 17:20:09,051 INFO [train.py:996] (3/4) Epoch 7, batch 30250, loss[loss=0.3358, simple_loss=0.4397, pruned_loss=0.116, over 21330.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3319, pruned_loss=0.08753, over 4273276.48 frames. ], batch size: 549, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:20:22,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1279302.0, ans=0.0 2023-06-22 17:20:51,551 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-22 17:21:06,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1279422.0, ans=0.125 2023-06-22 17:21:43,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1279542.0, ans=0.1 2023-06-22 17:21:48,211 INFO [train.py:996] (3/4) Epoch 7, batch 30300, loss[loss=0.2141, simple_loss=0.2791, pruned_loss=0.07453, over 21516.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3282, pruned_loss=0.08718, over 4276759.60 frames. ], batch size: 414, lr: 4.15e-03, grad_scale: 16.0 2023-06-22 17:21:49,809 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.688e+02 4.197e+02 5.234e+02 7.319e+02 1.495e+03, threshold=1.047e+03, percent-clipped=13.0 2023-06-22 17:21:58,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1279602.0, ans=0.125 2023-06-22 17:22:03,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1279602.0, ans=0.125 2023-06-22 17:22:19,415 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-06-22 17:22:46,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1279722.0, ans=0.125 2023-06-22 17:23:33,741 INFO [train.py:996] (3/4) Epoch 7, batch 30350, loss[loss=0.2489, simple_loss=0.3334, pruned_loss=0.08222, over 21796.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.329, pruned_loss=0.08825, over 4281293.22 frames. ], batch size: 333, lr: 4.15e-03, grad_scale: 16.0 2023-06-22 17:24:02,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1279962.0, ans=0.2 2023-06-22 17:24:03,005 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-06-22 17:24:20,767 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 17:24:28,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1280082.0, ans=0.0 2023-06-22 17:24:32,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1280082.0, ans=0.125 2023-06-22 17:24:38,002 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-22 17:24:42,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1280142.0, ans=0.02 2023-06-22 17:24:46,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1280142.0, ans=0.0 2023-06-22 17:24:56,704 INFO [train.py:996] (3/4) Epoch 7, batch 30400, loss[loss=0.2202, simple_loss=0.2721, pruned_loss=0.08418, over 20348.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3231, pruned_loss=0.08701, over 4267055.27 frames. ], batch size: 703, lr: 4.15e-03, grad_scale: 32.0 2023-06-22 17:24:58,178 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.840e+02 4.001e+02 6.030e+02 8.810e+02 1.556e+03, threshold=1.206e+03, percent-clipped=18.0 2023-06-22 17:25:22,952 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=15.0 2023-06-22 17:26:21,152 INFO [train.py:996] (3/4) Epoch 7, batch 30450, loss[loss=0.3101, simple_loss=0.4287, pruned_loss=0.0957, over 19755.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.325, pruned_loss=0.08678, over 4206006.44 frames. ], batch size: 702, lr: 4.15e-03, grad_scale: 32.0 2023-06-22 17:26:48,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1280562.0, ans=0.125 2023-06-22 17:26:59,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1280622.0, ans=0.0 2023-06-22 17:27:13,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1280682.0, ans=0.025 2023-06-22 17:27:17,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1280682.0, ans=0.1 2023-06-22 17:27:24,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1280742.0, ans=0.125 2023-06-22 17:27:25,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1280742.0, ans=0.1 2023-06-22 17:29:06,157 INFO [train.py:996] (3/4) Epoch 8, batch 0, loss[loss=0.2194, simple_loss=0.2857, pruned_loss=0.07655, over 21228.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2857, pruned_loss=0.07655, over 21228.00 frames. ], batch size: 160, lr: 3.86e-03, grad_scale: 32.0 2023-06-22 17:29:06,157 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 17:29:21,698 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2437, simple_loss=0.3524, pruned_loss=0.06749, over 1796401.00 frames. 2023-06-22 17:29:21,699 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-22 17:29:27,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1280772.0, ans=0.125 2023-06-22 17:29:30,708 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.318e+02 7.236e+02 1.078e+03 1.767e+03 4.535e+03, threshold=2.157e+03, percent-clipped=44.0 2023-06-22 17:29:37,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1280832.0, ans=0.04949747468305833 2023-06-22 17:29:55,676 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.30 vs. limit=15.0 2023-06-22 17:30:04,848 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-22 17:31:00,093 INFO [train.py:996] (3/4) Epoch 8, batch 50, loss[loss=0.2734, simple_loss=0.358, pruned_loss=0.09435, over 21773.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3302, pruned_loss=0.08788, over 963420.81 frames. ], batch size: 282, lr: 3.86e-03, grad_scale: 32.0 2023-06-22 17:31:05,662 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=15.0 2023-06-22 17:31:08,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1281072.0, ans=0.1 2023-06-22 17:31:13,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1281072.0, ans=0.125 2023-06-22 17:31:51,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1281192.0, ans=0.0 2023-06-22 17:32:07,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1281252.0, ans=0.125 2023-06-22 17:32:29,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1281312.0, ans=0.07 2023-06-22 17:32:33,419 INFO [train.py:996] (3/4) Epoch 8, batch 100, loss[loss=0.1808, simple_loss=0.2507, pruned_loss=0.05549, over 21768.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3401, pruned_loss=0.09016, over 1695893.79 frames. ], batch size: 102, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:32:44,635 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.771e+02 3.609e+02 4.818e+02 6.662e+02 2.202e+03, threshold=9.637e+02, percent-clipped=1.0 2023-06-22 17:32:48,766 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.28 vs. limit=10.0 2023-06-22 17:32:54,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1281432.0, ans=0.125 2023-06-22 17:34:06,751 INFO [train.py:996] (3/4) Epoch 8, batch 150, loss[loss=0.292, simple_loss=0.3768, pruned_loss=0.1036, over 21491.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3407, pruned_loss=0.08844, over 2248439.11 frames. ], batch size: 471, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:34:15,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1281672.0, ans=0.125 2023-06-22 17:34:58,853 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-22 17:35:27,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1281912.0, ans=0.0 2023-06-22 17:35:39,241 INFO [train.py:996] (3/4) Epoch 8, batch 200, loss[loss=0.2149, simple_loss=0.2756, pruned_loss=0.07712, over 21895.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3377, pruned_loss=0.0874, over 2701922.78 frames. ], batch size: 107, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:35:49,885 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.926e+02 4.076e+02 5.203e+02 6.716e+02 1.490e+03, threshold=1.041e+03, percent-clipped=7.0 2023-06-22 17:35:59,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1282032.0, ans=0.07 2023-06-22 17:36:49,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1282152.0, ans=0.5 2023-06-22 17:37:18,873 INFO [train.py:996] (3/4) Epoch 8, batch 250, loss[loss=0.247, simple_loss=0.3078, pruned_loss=0.09311, over 21800.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3308, pruned_loss=0.08577, over 3024627.68 frames. ], batch size: 298, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:37:33,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1282332.0, ans=0.1 2023-06-22 17:38:27,119 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=22.5 2023-06-22 17:38:34,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1282512.0, ans=0.125 2023-06-22 17:38:50,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1282512.0, ans=0.2 2023-06-22 17:38:54,763 INFO [train.py:996] (3/4) Epoch 8, batch 300, loss[loss=0.2571, simple_loss=0.3624, pruned_loss=0.07592, over 19747.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.326, pruned_loss=0.08537, over 3291201.94 frames. ], batch size: 703, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:39:06,251 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.679e+02 4.058e+02 5.447e+02 7.307e+02 1.512e+03, threshold=1.089e+03, percent-clipped=7.0 2023-06-22 17:39:14,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1282632.0, ans=0.125 2023-06-22 17:40:35,695 INFO [train.py:996] (3/4) Epoch 8, batch 350, loss[loss=0.2347, simple_loss=0.2938, pruned_loss=0.08784, over 21171.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.319, pruned_loss=0.08398, over 3516092.43 frames. ], batch size: 608, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:40:54,677 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=22.5 2023-06-22 17:42:14,804 INFO [train.py:996] (3/4) Epoch 8, batch 400, loss[loss=0.1946, simple_loss=0.2753, pruned_loss=0.05694, over 21657.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3128, pruned_loss=0.08318, over 3684101.39 frames. ], batch size: 247, lr: 3.86e-03, grad_scale: 32.0 2023-06-22 17:42:21,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1283172.0, ans=0.125 2023-06-22 17:42:25,996 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.922e+02 3.737e+02 4.920e+02 6.486e+02 1.177e+03, threshold=9.840e+02, percent-clipped=3.0 2023-06-22 17:43:09,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1283292.0, ans=0.0 2023-06-22 17:43:16,637 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=15.0 2023-06-22 17:43:22,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1283352.0, ans=0.2 2023-06-22 17:43:38,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1283412.0, ans=0.125 2023-06-22 17:43:54,179 INFO [train.py:996] (3/4) Epoch 8, batch 450, loss[loss=0.2828, simple_loss=0.3711, pruned_loss=0.09723, over 21226.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.31, pruned_loss=0.08168, over 3817080.37 frames. ], batch size: 159, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:44:02,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1283472.0, ans=0.0 2023-06-22 17:45:29,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1283712.0, ans=0.2 2023-06-22 17:45:32,278 INFO [train.py:996] (3/4) Epoch 8, batch 500, loss[loss=0.2756, simple_loss=0.3965, pruned_loss=0.07731, over 21249.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3127, pruned_loss=0.08061, over 3921266.67 frames. ], batch size: 548, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:45:59,920 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.804e+02 3.923e+02 5.554e+02 7.720e+02 1.831e+03, threshold=1.111e+03, percent-clipped=13.0 2023-06-22 17:46:17,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1283892.0, ans=0.125 2023-06-22 17:46:17,940 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-22 17:46:37,141 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 17:46:49,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1283952.0, ans=0.125 2023-06-22 17:47:05,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1284012.0, ans=0.125 2023-06-22 17:47:15,026 INFO [train.py:996] (3/4) Epoch 8, batch 550, loss[loss=0.4005, simple_loss=0.4789, pruned_loss=0.161, over 21453.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.316, pruned_loss=0.07995, over 4004398.41 frames. ], batch size: 507, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:47:17,395 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-06-22 17:47:29,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1284072.0, ans=0.0 2023-06-22 17:47:56,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1284192.0, ans=0.125 2023-06-22 17:48:01,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1284192.0, ans=0.125 2023-06-22 17:48:11,708 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-22 17:48:46,260 INFO [train.py:996] (3/4) Epoch 8, batch 600, loss[loss=0.2507, simple_loss=0.308, pruned_loss=0.09668, over 22014.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3187, pruned_loss=0.08082, over 4059732.08 frames. ], batch size: 103, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:49:01,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1284372.0, ans=0.0 2023-06-22 17:49:08,675 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.731e+02 3.844e+02 4.934e+02 7.871e+02 2.167e+03, threshold=9.868e+02, percent-clipped=19.0 2023-06-22 17:49:21,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1284432.0, ans=0.125 2023-06-22 17:49:52,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1284552.0, ans=0.2 2023-06-22 17:50:23,340 INFO [train.py:996] (3/4) Epoch 8, batch 650, loss[loss=0.2383, simple_loss=0.3359, pruned_loss=0.07037, over 21401.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3193, pruned_loss=0.08173, over 4103806.45 frames. ], batch size: 211, lr: 3.85e-03, grad_scale: 8.0 2023-06-22 17:50:52,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1284732.0, ans=22.5 2023-06-22 17:51:09,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1284792.0, ans=0.125 2023-06-22 17:51:48,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1284912.0, ans=0.015 2023-06-22 17:51:50,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1284912.0, ans=0.125 2023-06-22 17:51:56,364 INFO [train.py:996] (3/4) Epoch 8, batch 700, loss[loss=0.2777, simple_loss=0.338, pruned_loss=0.1087, over 21321.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3215, pruned_loss=0.08294, over 4146615.31 frames. ], batch size: 471, lr: 3.85e-03, grad_scale: 8.0 2023-06-22 17:52:20,173 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.941e+02 4.186e+02 5.348e+02 7.319e+02 1.415e+03, threshold=1.070e+03, percent-clipped=6.0 2023-06-22 17:52:52,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1285152.0, ans=0.2 2023-06-22 17:53:29,841 INFO [train.py:996] (3/4) Epoch 8, batch 750, loss[loss=0.2875, simple_loss=0.4176, pruned_loss=0.07874, over 19794.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3222, pruned_loss=0.08375, over 4172780.41 frames. ], batch size: 702, lr: 3.85e-03, grad_scale: 8.0 2023-06-22 17:53:34,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1285272.0, ans=0.125 2023-06-22 17:54:08,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1285332.0, ans=0.125 2023-06-22 17:54:20,588 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 17:54:33,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1285452.0, ans=0.0 2023-06-22 17:55:07,697 INFO [train.py:996] (3/4) Epoch 8, batch 800, loss[loss=0.2516, simple_loss=0.2983, pruned_loss=0.1024, over 21496.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3198, pruned_loss=0.08435, over 4200985.47 frames. ], batch size: 508, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:55:35,847 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.791e+02 4.052e+02 4.658e+02 6.687e+02 1.387e+03, threshold=9.317e+02, percent-clipped=3.0 2023-06-22 17:55:47,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1285632.0, ans=0.0 2023-06-22 17:55:50,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1285632.0, ans=0.035 2023-06-22 17:55:50,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1285632.0, ans=0.125 2023-06-22 17:56:04,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.16 vs. limit=12.0 2023-06-22 17:56:54,346 INFO [train.py:996] (3/4) Epoch 8, batch 850, loss[loss=0.2097, simple_loss=0.279, pruned_loss=0.07019, over 21822.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3164, pruned_loss=0.08415, over 4217938.29 frames. ], batch size: 298, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:57:27,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1285932.0, ans=0.1 2023-06-22 17:57:40,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1285992.0, ans=0.0 2023-06-22 17:58:33,076 INFO [train.py:996] (3/4) Epoch 8, batch 900, loss[loss=0.2256, simple_loss=0.2924, pruned_loss=0.07939, over 21324.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.313, pruned_loss=0.08243, over 4232835.32 frames. ], batch size: 159, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:58:35,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1286172.0, ans=0.0 2023-06-22 17:58:43,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1286172.0, ans=0.125 2023-06-22 17:58:43,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1286172.0, ans=0.1 2023-06-22 17:58:47,802 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.900e+02 3.949e+02 5.086e+02 6.787e+02 1.769e+03, threshold=1.017e+03, percent-clipped=9.0 2023-06-22 17:58:59,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1286232.0, ans=0.125 2023-06-22 18:00:05,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1286412.0, ans=0.1 2023-06-22 18:00:12,904 INFO [train.py:996] (3/4) Epoch 8, batch 950, loss[loss=0.2951, simple_loss=0.3606, pruned_loss=0.1148, over 21807.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3116, pruned_loss=0.0824, over 4248153.07 frames. ], batch size: 414, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:00:19,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1286472.0, ans=0.125 2023-06-22 18:00:26,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1286472.0, ans=0.125 2023-06-22 18:01:37,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1286712.0, ans=0.05 2023-06-22 18:01:43,592 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 18:01:51,045 INFO [train.py:996] (3/4) Epoch 8, batch 1000, loss[loss=0.2315, simple_loss=0.294, pruned_loss=0.08454, over 21648.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3125, pruned_loss=0.08364, over 4263647.06 frames. ], batch size: 263, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:02:05,739 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.829e+02 3.565e+02 4.375e+02 6.228e+02 1.305e+03, threshold=8.750e+02, percent-clipped=2.0 2023-06-22 18:03:10,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1287012.0, ans=0.125 2023-06-22 18:03:32,508 INFO [train.py:996] (3/4) Epoch 8, batch 1050, loss[loss=0.2332, simple_loss=0.2977, pruned_loss=0.08436, over 21278.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3104, pruned_loss=0.08253, over 4272135.29 frames. ], batch size: 176, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:03:53,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1287132.0, ans=0.1 2023-06-22 18:03:56,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1287132.0, ans=0.1 2023-06-22 18:04:24,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1287252.0, ans=0.125 2023-06-22 18:04:32,994 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.66 vs. limit=15.0 2023-06-22 18:05:07,910 INFO [train.py:996] (3/4) Epoch 8, batch 1100, loss[loss=0.1853, simple_loss=0.2669, pruned_loss=0.05182, over 21193.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3123, pruned_loss=0.08298, over 4281022.69 frames. ], batch size: 176, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:05:21,940 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.840e+02 4.476e+02 5.815e+02 7.362e+02 1.371e+03, threshold=1.163e+03, percent-clipped=15.0 2023-06-22 18:05:29,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1287432.0, ans=0.2 2023-06-22 18:05:40,963 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-22 18:05:48,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1287492.0, ans=0.125 2023-06-22 18:05:55,461 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=22.5 2023-06-22 18:06:48,154 INFO [train.py:996] (3/4) Epoch 8, batch 1150, loss[loss=0.2393, simple_loss=0.3129, pruned_loss=0.08282, over 21488.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.312, pruned_loss=0.08362, over 4281523.56 frames. ], batch size: 548, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:06:52,531 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-22 18:07:42,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1287792.0, ans=0.125 2023-06-22 18:07:50,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1287852.0, ans=0.0 2023-06-22 18:08:11,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1287912.0, ans=0.125 2023-06-22 18:08:24,607 INFO [train.py:996] (3/4) Epoch 8, batch 1200, loss[loss=0.2258, simple_loss=0.2797, pruned_loss=0.08589, over 20370.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3138, pruned_loss=0.0834, over 4280605.92 frames. ], batch size: 703, lr: 3.85e-03, grad_scale: 32.0 2023-06-22 18:08:43,474 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.884e+02 3.897e+02 4.987e+02 7.014e+02 1.089e+03, threshold=9.974e+02, percent-clipped=0.0 2023-06-22 18:08:46,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1288032.0, ans=0.0 2023-06-22 18:08:54,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1288032.0, ans=0.1 2023-06-22 18:08:56,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1288032.0, ans=0.125 2023-06-22 18:09:38,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1288152.0, ans=0.1 2023-06-22 18:10:03,634 INFO [train.py:996] (3/4) Epoch 8, batch 1250, loss[loss=0.2156, simple_loss=0.2975, pruned_loss=0.06688, over 21127.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3149, pruned_loss=0.08405, over 4281964.60 frames. ], batch size: 607, lr: 3.85e-03, grad_scale: 32.0 2023-06-22 18:10:04,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1288272.0, ans=0.0 2023-06-22 18:10:05,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1288272.0, ans=0.125 2023-06-22 18:10:26,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1288332.0, ans=0.0 2023-06-22 18:10:35,712 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-06-22 18:10:51,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1288392.0, ans=0.2 2023-06-22 18:11:24,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1288452.0, ans=0.0 2023-06-22 18:11:34,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1288512.0, ans=0.0 2023-06-22 18:11:37,056 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.10 vs. limit=15.0 2023-06-22 18:11:41,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1288512.0, ans=0.125 2023-06-22 18:11:44,062 INFO [train.py:996] (3/4) Epoch 8, batch 1300, loss[loss=0.2833, simple_loss=0.3442, pruned_loss=0.1112, over 21707.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3154, pruned_loss=0.08362, over 4283920.15 frames. ], batch size: 507, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:11:46,130 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 18:11:49,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1288572.0, ans=0.2 2023-06-22 18:11:50,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1288572.0, ans=0.2 2023-06-22 18:12:04,926 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 4.208e+02 5.615e+02 7.044e+02 1.517e+03, threshold=1.123e+03, percent-clipped=9.0 2023-06-22 18:12:07,596 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-22 18:12:31,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1288692.0, ans=0.1 2023-06-22 18:12:34,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1288692.0, ans=0.125 2023-06-22 18:13:07,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1288812.0, ans=0.125 2023-06-22 18:13:24,638 INFO [train.py:996] (3/4) Epoch 8, batch 1350, loss[loss=0.281, simple_loss=0.3565, pruned_loss=0.1028, over 21615.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3177, pruned_loss=0.0851, over 4286691.19 frames. ], batch size: 471, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:14:00,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1288932.0, ans=0.125 2023-06-22 18:14:00,724 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.99 vs. limit=10.0 2023-06-22 18:14:01,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1288932.0, ans=0.0 2023-06-22 18:14:21,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1288992.0, ans=0.125 2023-06-22 18:14:48,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1289112.0, ans=0.125 2023-06-22 18:14:59,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1289112.0, ans=0.1 2023-06-22 18:15:05,851 INFO [train.py:996] (3/4) Epoch 8, batch 1400, loss[loss=0.2617, simple_loss=0.3258, pruned_loss=0.09877, over 21518.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3169, pruned_loss=0.08508, over 4288292.38 frames. ], batch size: 548, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:15:26,926 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.742e+02 3.782e+02 4.959e+02 6.793e+02 1.586e+03, threshold=9.917e+02, percent-clipped=6.0 2023-06-22 18:15:47,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1289292.0, ans=0.125 2023-06-22 18:16:19,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1289352.0, ans=0.125 2023-06-22 18:16:22,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1289352.0, ans=0.025 2023-06-22 18:16:30,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1289412.0, ans=0.0 2023-06-22 18:16:40,640 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=15.0 2023-06-22 18:16:43,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1289412.0, ans=0.0 2023-06-22 18:16:45,936 INFO [train.py:996] (3/4) Epoch 8, batch 1450, loss[loss=0.2145, simple_loss=0.2761, pruned_loss=0.07648, over 21682.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3163, pruned_loss=0.08485, over 4290226.47 frames. ], batch size: 247, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:16:49,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1289472.0, ans=0.125 2023-06-22 18:17:05,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1289532.0, ans=0.0 2023-06-22 18:17:15,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.51 vs. limit=10.0 2023-06-22 18:17:57,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1289652.0, ans=0.1 2023-06-22 18:18:19,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1289712.0, ans=0.125 2023-06-22 18:18:25,449 INFO [train.py:996] (3/4) Epoch 8, batch 1500, loss[loss=0.2553, simple_loss=0.3103, pruned_loss=0.1001, over 21627.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.318, pruned_loss=0.08579, over 4289451.37 frames. ], batch size: 507, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:18:46,210 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.995e+02 3.712e+02 4.836e+02 6.899e+02 1.421e+03, threshold=9.672e+02, percent-clipped=7.0 2023-06-22 18:19:41,875 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-22 18:19:51,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1290012.0, ans=0.0 2023-06-22 18:19:54,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1290012.0, ans=0.125 2023-06-22 18:19:54,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1290012.0, ans=0.125 2023-06-22 18:20:07,040 INFO [train.py:996] (3/4) Epoch 8, batch 1550, loss[loss=0.2421, simple_loss=0.3398, pruned_loss=0.07226, over 20912.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3141, pruned_loss=0.08312, over 4285882.64 frames. ], batch size: 607, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:20:31,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1290132.0, ans=0.2 2023-06-22 18:20:40,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1290132.0, ans=0.125 2023-06-22 18:21:40,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1290312.0, ans=0.05 2023-06-22 18:21:45,616 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-22 18:21:48,294 INFO [train.py:996] (3/4) Epoch 8, batch 1600, loss[loss=0.2599, simple_loss=0.366, pruned_loss=0.07695, over 21285.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3134, pruned_loss=0.08288, over 4285542.69 frames. ], batch size: 548, lr: 3.85e-03, grad_scale: 32.0 2023-06-22 18:21:53,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1290372.0, ans=0.0 2023-06-22 18:22:07,288 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-22 18:22:10,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1290432.0, ans=0.125 2023-06-22 18:22:16,348 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.959e+02 3.911e+02 5.598e+02 7.259e+02 1.641e+03, threshold=1.120e+03, percent-clipped=8.0 2023-06-22 18:22:54,661 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=22.5 2023-06-22 18:23:36,623 INFO [train.py:996] (3/4) Epoch 8, batch 1650, loss[loss=0.275, simple_loss=0.3425, pruned_loss=0.1037, over 21801.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.314, pruned_loss=0.08328, over 4280977.02 frames. ], batch size: 124, lr: 3.85e-03, grad_scale: 32.0 2023-06-22 18:24:34,170 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-22 18:24:40,456 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-22 18:25:17,557 INFO [train.py:996] (3/4) Epoch 8, batch 1700, loss[loss=0.2422, simple_loss=0.306, pruned_loss=0.08922, over 21853.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3177, pruned_loss=0.08487, over 4283644.63 frames. ], batch size: 441, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:25:21,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1290972.0, ans=0.125 2023-06-22 18:25:45,113 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.906e+02 3.944e+02 4.852e+02 6.481e+02 1.409e+03, threshold=9.704e+02, percent-clipped=2.0 2023-06-22 18:26:26,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.02 vs. limit=10.0 2023-06-22 18:27:04,026 INFO [train.py:996] (3/4) Epoch 8, batch 1750, loss[loss=0.1826, simple_loss=0.2665, pruned_loss=0.0493, over 21570.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3142, pruned_loss=0.08221, over 4274990.49 frames. ], batch size: 230, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:27:16,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1291272.0, ans=0.2 2023-06-22 18:27:46,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1291392.0, ans=0.0 2023-06-22 18:28:45,995 INFO [train.py:996] (3/4) Epoch 8, batch 1800, loss[loss=0.1916, simple_loss=0.2558, pruned_loss=0.06369, over 21352.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3129, pruned_loss=0.08036, over 4270683.45 frames. ], batch size: 211, lr: 3.84e-03, grad_scale: 8.0 2023-06-22 18:29:04,665 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.938e+02 3.886e+02 4.989e+02 8.763e+02 2.376e+03, threshold=9.977e+02, percent-clipped=20.0 2023-06-22 18:29:06,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1291632.0, ans=0.125 2023-06-22 18:29:17,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-22 18:29:22,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1291692.0, ans=0.1 2023-06-22 18:29:30,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1291692.0, ans=0.125 2023-06-22 18:29:32,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1291692.0, ans=0.125 2023-06-22 18:30:26,406 INFO [train.py:996] (3/4) Epoch 8, batch 1850, loss[loss=0.2572, simple_loss=0.3484, pruned_loss=0.08305, over 21368.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3134, pruned_loss=0.07955, over 4272727.91 frames. ], batch size: 549, lr: 3.84e-03, grad_scale: 8.0 2023-06-22 18:30:42,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1291932.0, ans=0.125 2023-06-22 18:32:06,596 INFO [train.py:996] (3/4) Epoch 8, batch 1900, loss[loss=0.2689, simple_loss=0.33, pruned_loss=0.1039, over 21739.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3144, pruned_loss=0.08008, over 4270770.23 frames. ], batch size: 389, lr: 3.84e-03, grad_scale: 8.0 2023-06-22 18:32:13,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1292172.0, ans=0.125 2023-06-22 18:32:25,781 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.699e+02 3.836e+02 4.981e+02 6.397e+02 1.530e+03, threshold=9.962e+02, percent-clipped=6.0 2023-06-22 18:32:59,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1292292.0, ans=0.125 2023-06-22 18:32:59,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1292292.0, ans=0.125 2023-06-22 18:33:28,156 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.39 vs. limit=15.0 2023-06-22 18:33:31,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1292352.0, ans=0.125 2023-06-22 18:33:47,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.72 vs. limit=15.0 2023-06-22 18:33:49,920 INFO [train.py:996] (3/4) Epoch 8, batch 1950, loss[loss=0.2399, simple_loss=0.307, pruned_loss=0.08638, over 21874.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3115, pruned_loss=0.0802, over 4275016.02 frames. ], batch size: 373, lr: 3.84e-03, grad_scale: 8.0 2023-06-22 18:33:51,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1292472.0, ans=0.0 2023-06-22 18:34:00,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1292472.0, ans=0.07 2023-06-22 18:34:19,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1292532.0, ans=0.125 2023-06-22 18:34:39,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1292592.0, ans=0.2 2023-06-22 18:35:04,853 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-22 18:35:11,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1292652.0, ans=0.125 2023-06-22 18:35:31,188 INFO [train.py:996] (3/4) Epoch 8, batch 2000, loss[loss=0.2022, simple_loss=0.2724, pruned_loss=0.06599, over 21844.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3082, pruned_loss=0.07859, over 4281532.06 frames. ], batch size: 118, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:35:32,207 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-06-22 18:35:54,806 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 4.321e+02 6.258e+02 9.587e+02 1.701e+03, threshold=1.252e+03, percent-clipped=22.0 2023-06-22 18:36:21,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1292892.0, ans=0.5 2023-06-22 18:36:44,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1292952.0, ans=0.125 2023-06-22 18:37:13,648 INFO [train.py:996] (3/4) Epoch 8, batch 2050, loss[loss=0.2306, simple_loss=0.3162, pruned_loss=0.07254, over 21632.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.313, pruned_loss=0.08, over 4289645.51 frames. ], batch size: 263, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:38:47,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1293312.0, ans=0.1 2023-06-22 18:38:53,102 INFO [train.py:996] (3/4) Epoch 8, batch 2100, loss[loss=0.2635, simple_loss=0.3376, pruned_loss=0.09471, over 21877.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3157, pruned_loss=0.08146, over 4282886.31 frames. ], batch size: 316, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:39:17,072 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.801e+02 4.056e+02 5.317e+02 7.512e+02 1.644e+03, threshold=1.063e+03, percent-clipped=5.0 2023-06-22 18:39:57,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1293492.0, ans=0.125 2023-06-22 18:40:32,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1293672.0, ans=0.2 2023-06-22 18:40:33,831 INFO [train.py:996] (3/4) Epoch 8, batch 2150, loss[loss=0.2127, simple_loss=0.2844, pruned_loss=0.07048, over 21723.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3125, pruned_loss=0.08149, over 4278112.00 frames. ], batch size: 351, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:40:37,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1293672.0, ans=0.0 2023-06-22 18:41:19,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1293792.0, ans=0.1 2023-06-22 18:41:22,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1293792.0, ans=10.0 2023-06-22 18:41:39,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1293852.0, ans=0.2 2023-06-22 18:41:54,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1293912.0, ans=0.2 2023-06-22 18:42:05,319 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.19 vs. limit=12.0 2023-06-22 18:42:06,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1293912.0, ans=0.1 2023-06-22 18:42:07,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1293912.0, ans=0.125 2023-06-22 18:42:10,349 INFO [train.py:996] (3/4) Epoch 8, batch 2200, loss[loss=0.2378, simple_loss=0.3262, pruned_loss=0.07469, over 21645.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3133, pruned_loss=0.08133, over 4267593.69 frames. ], batch size: 389, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:42:20,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1293972.0, ans=0.125 2023-06-22 18:42:23,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1293972.0, ans=0.2 2023-06-22 18:42:33,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.009e+02 3.898e+02 4.994e+02 6.578e+02 1.550e+03, threshold=9.987e+02, percent-clipped=10.0 2023-06-22 18:43:37,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1294212.0, ans=0.125 2023-06-22 18:43:50,425 INFO [train.py:996] (3/4) Epoch 8, batch 2250, loss[loss=0.2282, simple_loss=0.3012, pruned_loss=0.07757, over 21613.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3096, pruned_loss=0.08039, over 4272046.93 frames. ], batch size: 442, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:45:04,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1294452.0, ans=0.07 2023-06-22 18:45:29,764 INFO [train.py:996] (3/4) Epoch 8, batch 2300, loss[loss=0.2406, simple_loss=0.296, pruned_loss=0.09258, over 21601.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3054, pruned_loss=0.07937, over 4274552.05 frames. ], batch size: 415, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:45:44,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1294572.0, ans=0.125 2023-06-22 18:45:53,597 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.836e+02 4.015e+02 5.277e+02 7.353e+02 1.540e+03, threshold=1.055e+03, percent-clipped=5.0 2023-06-22 18:47:11,792 INFO [train.py:996] (3/4) Epoch 8, batch 2350, loss[loss=0.2401, simple_loss=0.3272, pruned_loss=0.07644, over 21788.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3052, pruned_loss=0.08137, over 4266681.95 frames. ], batch size: 351, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:48:12,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1294992.0, ans=0.1 2023-06-22 18:48:16,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1295052.0, ans=0.125 2023-06-22 18:48:38,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1295112.0, ans=0.0 2023-06-22 18:48:38,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1295112.0, ans=0.125 2023-06-22 18:48:53,757 INFO [train.py:996] (3/4) Epoch 8, batch 2400, loss[loss=0.2494, simple_loss=0.3234, pruned_loss=0.08772, over 21725.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.311, pruned_loss=0.08411, over 4269640.12 frames. ], batch size: 298, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:49:18,978 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.938e+02 4.939e+02 6.884e+02 8.991e+02 1.831e+03, threshold=1.377e+03, percent-clipped=16.0 2023-06-22 18:50:35,302 INFO [train.py:996] (3/4) Epoch 8, batch 2450, loss[loss=0.2165, simple_loss=0.3336, pruned_loss=0.0497, over 20793.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3144, pruned_loss=0.08629, over 4271899.81 frames. ], batch size: 608, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:50:43,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1295472.0, ans=0.125 2023-06-22 18:50:45,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1295472.0, ans=0.125 2023-06-22 18:51:51,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.52 vs. limit=22.5 2023-06-22 18:51:54,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1295712.0, ans=0.1 2023-06-22 18:52:13,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=15.0 2023-06-22 18:52:15,528 INFO [train.py:996] (3/4) Epoch 8, batch 2500, loss[loss=0.2389, simple_loss=0.3258, pruned_loss=0.07596, over 21514.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3149, pruned_loss=0.0848, over 4274953.98 frames. ], batch size: 389, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:52:34,797 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.052e+02 4.409e+02 5.815e+02 8.522e+02 2.143e+03, threshold=1.163e+03, percent-clipped=4.0 2023-06-22 18:53:03,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1295892.0, ans=0.0 2023-06-22 18:53:09,060 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=22.5 2023-06-22 18:53:10,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1295892.0, ans=0.1 2023-06-22 18:53:17,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1295952.0, ans=0.125 2023-06-22 18:53:52,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1296012.0, ans=0.125 2023-06-22 18:53:57,197 INFO [train.py:996] (3/4) Epoch 8, batch 2550, loss[loss=0.2043, simple_loss=0.2728, pruned_loss=0.06789, over 21818.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3127, pruned_loss=0.08456, over 4271578.76 frames. ], batch size: 317, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:53:57,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1296072.0, ans=0.125 2023-06-22 18:54:02,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1296072.0, ans=0.125 2023-06-22 18:55:07,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1296252.0, ans=10.0 2023-06-22 18:55:29,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1296312.0, ans=0.0 2023-06-22 18:55:37,955 INFO [train.py:996] (3/4) Epoch 8, batch 2600, loss[loss=0.2851, simple_loss=0.3504, pruned_loss=0.1099, over 21794.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3136, pruned_loss=0.08599, over 4273063.65 frames. ], batch size: 441, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:55:43,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1296372.0, ans=0.1 2023-06-22 18:55:43,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1296372.0, ans=0.04949747468305833 2023-06-22 18:55:57,333 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.832e+02 4.000e+02 4.999e+02 6.903e+02 1.017e+03, threshold=9.998e+02, percent-clipped=0.0 2023-06-22 18:56:34,605 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-22 18:56:39,180 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.76 vs. limit=6.0 2023-06-22 18:57:01,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1296612.0, ans=0.0 2023-06-22 18:57:14,594 INFO [train.py:996] (3/4) Epoch 8, batch 2650, loss[loss=0.2582, simple_loss=0.3262, pruned_loss=0.09511, over 21615.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3143, pruned_loss=0.08701, over 4279053.71 frames. ], batch size: 131, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:57:24,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1296672.0, ans=0.1 2023-06-22 18:57:24,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1296672.0, ans=0.125 2023-06-22 18:57:26,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1296672.0, ans=0.125 2023-06-22 18:57:34,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1296732.0, ans=0.125 2023-06-22 18:57:35,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1296732.0, ans=0.0 2023-06-22 18:58:04,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1296792.0, ans=0.2 2023-06-22 18:58:05,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1296792.0, ans=0.2 2023-06-22 18:58:55,284 INFO [train.py:996] (3/4) Epoch 8, batch 2700, loss[loss=0.1518, simple_loss=0.2008, pruned_loss=0.05139, over 16283.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3129, pruned_loss=0.08498, over 4269541.95 frames. ], batch size: 61, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:58:57,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1296972.0, ans=0.2 2023-06-22 18:58:58,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1296972.0, ans=0.125 2023-06-22 18:59:08,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1296972.0, ans=0.125 2023-06-22 18:59:14,440 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.148e+02 4.384e+02 5.256e+02 7.143e+02 1.333e+03, threshold=1.051e+03, percent-clipped=8.0 2023-06-22 18:59:18,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1297032.0, ans=0.1 2023-06-22 18:59:20,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1297032.0, ans=0.025 2023-06-22 19:00:37,134 INFO [train.py:996] (3/4) Epoch 8, batch 2750, loss[loss=0.2218, simple_loss=0.29, pruned_loss=0.07683, over 21826.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3139, pruned_loss=0.08453, over 4266166.22 frames. ], batch size: 298, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 19:00:48,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1297272.0, ans=0.04949747468305833 2023-06-22 19:01:10,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1297332.0, ans=0.1 2023-06-22 19:01:46,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1297452.0, ans=15.0 2023-06-22 19:01:58,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1297452.0, ans=0.0 2023-06-22 19:02:18,159 INFO [train.py:996] (3/4) Epoch 8, batch 2800, loss[loss=0.2782, simple_loss=0.3387, pruned_loss=0.1089, over 21365.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3204, pruned_loss=0.08674, over 4274283.08 frames. ], batch size: 549, lr: 3.83e-03, grad_scale: 32.0 2023-06-22 19:02:22,587 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.49 vs. limit=10.0 2023-06-22 19:02:56,028 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.162e+02 4.531e+02 5.975e+02 9.124e+02 1.757e+03, threshold=1.195e+03, percent-clipped=17.0 2023-06-22 19:03:13,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1297692.0, ans=0.125 2023-06-22 19:03:48,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1297812.0, ans=0.125 2023-06-22 19:03:54,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=15.0 2023-06-22 19:03:58,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1297812.0, ans=0.0 2023-06-22 19:04:00,773 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 19:04:01,692 INFO [train.py:996] (3/4) Epoch 8, batch 2850, loss[loss=0.2161, simple_loss=0.2982, pruned_loss=0.06705, over 21759.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3233, pruned_loss=0.08834, over 4278995.71 frames. ], batch size: 351, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:04:51,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1297992.0, ans=0.125 2023-06-22 19:05:25,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1298112.0, ans=0.0 2023-06-22 19:05:36,697 INFO [train.py:996] (3/4) Epoch 8, batch 2900, loss[loss=0.2356, simple_loss=0.3013, pruned_loss=0.08494, over 21482.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3181, pruned_loss=0.08735, over 4274838.92 frames. ], batch size: 194, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:05:51,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1298172.0, ans=0.1 2023-06-22 19:06:12,617 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.773e+02 4.241e+02 6.077e+02 8.455e+02 1.821e+03, threshold=1.215e+03, percent-clipped=6.0 2023-06-22 19:06:13,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1298232.0, ans=0.0 2023-06-22 19:06:25,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1298292.0, ans=0.125 2023-06-22 19:06:57,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1298352.0, ans=0.1 2023-06-22 19:07:01,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1298412.0, ans=0.125 2023-06-22 19:07:09,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1298412.0, ans=0.0 2023-06-22 19:07:15,541 INFO [train.py:996] (3/4) Epoch 8, batch 2950, loss[loss=0.2385, simple_loss=0.3273, pruned_loss=0.07486, over 21592.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3181, pruned_loss=0.08634, over 4281723.17 frames. ], batch size: 230, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:07:58,692 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.16 vs. limit=12.0 2023-06-22 19:08:05,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1298592.0, ans=0.0 2023-06-22 19:08:56,477 INFO [train.py:996] (3/4) Epoch 8, batch 3000, loss[loss=0.2523, simple_loss=0.3266, pruned_loss=0.08906, over 21546.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3231, pruned_loss=0.08767, over 4288365.34 frames. ], batch size: 131, lr: 3.83e-03, grad_scale: 8.0 2023-06-22 19:08:56,477 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 19:09:17,904 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2518, simple_loss=0.3464, pruned_loss=0.0786, over 1796401.00 frames. 2023-06-22 19:09:17,904 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-22 19:09:40,426 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.798e+02 4.528e+02 5.625e+02 8.163e+02 1.642e+03, threshold=1.125e+03, percent-clipped=6.0 2023-06-22 19:10:22,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1298952.0, ans=0.2 2023-06-22 19:10:57,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1299072.0, ans=0.125 2023-06-22 19:10:58,762 INFO [train.py:996] (3/4) Epoch 8, batch 3050, loss[loss=0.2119, simple_loss=0.2815, pruned_loss=0.07116, over 21511.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3211, pruned_loss=0.08553, over 4283627.35 frames. ], batch size: 194, lr: 3.83e-03, grad_scale: 8.0 2023-06-22 19:11:33,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1299192.0, ans=0.125 2023-06-22 19:11:55,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1299252.0, ans=0.1 2023-06-22 19:12:38,537 INFO [train.py:996] (3/4) Epoch 8, batch 3100, loss[loss=0.2122, simple_loss=0.2988, pruned_loss=0.06282, over 21706.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3213, pruned_loss=0.08446, over 4284889.79 frames. ], batch size: 247, lr: 3.83e-03, grad_scale: 8.0 2023-06-22 19:12:43,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1299372.0, ans=0.07 2023-06-22 19:13:05,464 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.778e+02 3.934e+02 5.608e+02 7.913e+02 1.726e+03, threshold=1.122e+03, percent-clipped=9.0 2023-06-22 19:13:13,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1299492.0, ans=0.2 2023-06-22 19:13:16,164 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=22.5 2023-06-22 19:13:31,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1299552.0, ans=0.125 2023-06-22 19:14:18,774 INFO [train.py:996] (3/4) Epoch 8, batch 3150, loss[loss=0.2591, simple_loss=0.3362, pruned_loss=0.09104, over 21729.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.323, pruned_loss=0.08523, over 4284777.58 frames. ], batch size: 298, lr: 3.83e-03, grad_scale: 8.0 2023-06-22 19:14:41,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1299732.0, ans=0.125 2023-06-22 19:14:49,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1299732.0, ans=0.125 2023-06-22 19:16:05,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1299972.0, ans=0.125 2023-06-22 19:16:06,900 INFO [train.py:996] (3/4) Epoch 8, batch 3200, loss[loss=0.2665, simple_loss=0.3303, pruned_loss=0.1013, over 21340.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3253, pruned_loss=0.0865, over 4280110.32 frames. ], batch size: 176, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:16:29,621 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.108e+02 3.882e+02 4.322e+02 5.833e+02 1.816e+03, threshold=8.643e+02, percent-clipped=1.0 2023-06-22 19:17:11,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1300152.0, ans=0.2 2023-06-22 19:17:29,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1300212.0, ans=0.125 2023-06-22 19:17:46,239 INFO [train.py:996] (3/4) Epoch 8, batch 3250, loss[loss=0.222, simple_loss=0.2835, pruned_loss=0.08024, over 21683.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3261, pruned_loss=0.08828, over 4282056.48 frames. ], batch size: 333, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:17:51,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1300272.0, ans=0.125 2023-06-22 19:18:00,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1300332.0, ans=0.0 2023-06-22 19:18:00,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1300332.0, ans=0.125 2023-06-22 19:18:49,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1300452.0, ans=0.125 2023-06-22 19:18:51,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1300452.0, ans=0.0 2023-06-22 19:19:14,260 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.72 vs. limit=15.0 2023-06-22 19:19:25,806 INFO [train.py:996] (3/4) Epoch 8, batch 3300, loss[loss=0.2375, simple_loss=0.327, pruned_loss=0.07397, over 21664.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3241, pruned_loss=0.08747, over 4267619.73 frames. ], batch size: 298, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:19:38,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1300572.0, ans=0.125 2023-06-22 19:19:47,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1300632.0, ans=0.125 2023-06-22 19:19:48,432 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.884e+02 4.529e+02 6.033e+02 9.657e+02 1.783e+03, threshold=1.207e+03, percent-clipped=28.0 2023-06-22 19:20:04,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1300632.0, ans=0.125 2023-06-22 19:21:04,773 INFO [train.py:996] (3/4) Epoch 8, batch 3350, loss[loss=0.2502, simple_loss=0.3153, pruned_loss=0.09253, over 21384.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3253, pruned_loss=0.08807, over 4274809.50 frames. ], batch size: 131, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:21:10,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1300872.0, ans=0.125 2023-06-22 19:21:42,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1300932.0, ans=0.0 2023-06-22 19:21:51,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1300992.0, ans=0.125 2023-06-22 19:22:06,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1301052.0, ans=0.5 2023-06-22 19:22:43,276 INFO [train.py:996] (3/4) Epoch 8, batch 3400, loss[loss=0.217, simple_loss=0.2866, pruned_loss=0.07377, over 21663.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3235, pruned_loss=0.0877, over 4277723.16 frames. ], batch size: 247, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:22:55,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1301172.0, ans=0.2 2023-06-22 19:23:08,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1301232.0, ans=0.0 2023-06-22 19:23:15,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1301232.0, ans=0.2 2023-06-22 19:23:16,071 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.036e+02 4.257e+02 5.465e+02 6.871e+02 1.586e+03, threshold=1.093e+03, percent-clipped=5.0 2023-06-22 19:23:20,176 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.68 vs. limit=6.0 2023-06-22 19:23:27,300 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.80 vs. limit=15.0 2023-06-22 19:23:28,694 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.97 vs. limit=15.0 2023-06-22 19:23:31,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1301292.0, ans=10.0 2023-06-22 19:23:50,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1301352.0, ans=0.125 2023-06-22 19:24:11,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1301412.0, ans=0.125 2023-06-22 19:24:24,317 INFO [train.py:996] (3/4) Epoch 8, batch 3450, loss[loss=0.2553, simple_loss=0.3193, pruned_loss=0.09566, over 21820.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3193, pruned_loss=0.08672, over 4282833.80 frames. ], batch size: 441, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:25:03,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1301532.0, ans=0.0 2023-06-22 19:25:29,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1301592.0, ans=0.5 2023-06-22 19:25:39,456 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.83 vs. limit=10.0 2023-06-22 19:26:09,199 INFO [train.py:996] (3/4) Epoch 8, batch 3500, loss[loss=0.2972, simple_loss=0.3602, pruned_loss=0.1171, over 21256.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3289, pruned_loss=0.09108, over 4285494.46 frames. ], batch size: 159, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:26:11,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1301772.0, ans=0.125 2023-06-22 19:26:22,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1301772.0, ans=0.125 2023-06-22 19:26:35,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1301832.0, ans=0.125 2023-06-22 19:26:36,795 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.220e+02 4.832e+02 6.635e+02 8.517e+02 1.814e+03, threshold=1.327e+03, percent-clipped=16.0 2023-06-22 19:27:42,632 INFO [train.py:996] (3/4) Epoch 8, batch 3550, loss[loss=0.2033, simple_loss=0.2674, pruned_loss=0.06953, over 19857.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3316, pruned_loss=0.0928, over 4285339.95 frames. ], batch size: 703, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:27:57,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1302072.0, ans=0.0 2023-06-22 19:28:03,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1302132.0, ans=0.125 2023-06-22 19:28:14,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1302132.0, ans=0.125 2023-06-22 19:28:16,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1302132.0, ans=0.2 2023-06-22 19:29:21,126 INFO [train.py:996] (3/4) Epoch 8, batch 3600, loss[loss=0.2468, simple_loss=0.2992, pruned_loss=0.09722, over 21842.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3248, pruned_loss=0.09228, over 4285850.49 frames. ], batch size: 98, lr: 3.83e-03, grad_scale: 32.0 2023-06-22 19:29:30,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1302372.0, ans=0.1 2023-06-22 19:29:31,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1302372.0, ans=0.125 2023-06-22 19:29:34,896 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 19:29:34,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1302372.0, ans=0.0 2023-06-22 19:29:48,389 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.074e+02 4.388e+02 6.270e+02 8.797e+02 1.377e+03, threshold=1.254e+03, percent-clipped=1.0 2023-06-22 19:30:12,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1302492.0, ans=0.125 2023-06-22 19:30:15,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1302492.0, ans=0.125 2023-06-22 19:30:25,291 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 19:30:32,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1302612.0, ans=0.125 2023-06-22 19:30:59,799 INFO [train.py:996] (3/4) Epoch 8, batch 3650, loss[loss=0.2223, simple_loss=0.3102, pruned_loss=0.06717, over 21778.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3231, pruned_loss=0.09085, over 4283439.20 frames. ], batch size: 332, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:31:07,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1302672.0, ans=0.1 2023-06-22 19:31:07,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1302672.0, ans=0.125 2023-06-22 19:31:21,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1302732.0, ans=0.0 2023-06-22 19:32:01,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1302852.0, ans=0.125 2023-06-22 19:32:37,353 INFO [train.py:996] (3/4) Epoch 8, batch 3700, loss[loss=0.2498, simple_loss=0.3161, pruned_loss=0.09174, over 21804.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3208, pruned_loss=0.0896, over 4293271.72 frames. ], batch size: 107, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:32:50,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1302972.0, ans=0.0 2023-06-22 19:32:58,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1303032.0, ans=0.125 2023-06-22 19:33:03,994 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-22 19:33:04,875 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 19:33:05,939 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.810e+02 4.107e+02 5.215e+02 7.517e+02 1.439e+03, threshold=1.043e+03, percent-clipped=3.0 2023-06-22 19:33:14,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1303032.0, ans=0.125 2023-06-22 19:33:33,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1303152.0, ans=0.1 2023-06-22 19:33:55,921 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-22 19:34:16,533 INFO [train.py:996] (3/4) Epoch 8, batch 3750, loss[loss=0.2267, simple_loss=0.2981, pruned_loss=0.07763, over 21823.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3194, pruned_loss=0.08877, over 4291708.95 frames. ], batch size: 298, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:34:16,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1303272.0, ans=0.125 2023-06-22 19:34:36,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1303332.0, ans=0.025 2023-06-22 19:35:03,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1303392.0, ans=0.125 2023-06-22 19:36:00,327 INFO [train.py:996] (3/4) Epoch 8, batch 3800, loss[loss=0.2662, simple_loss=0.3379, pruned_loss=0.09723, over 21559.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.316, pruned_loss=0.08617, over 4285547.75 frames. ], batch size: 389, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:36:06,189 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-22 19:36:27,874 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-22 19:36:28,071 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.955e+02 4.734e+02 6.125e+02 7.875e+02 1.546e+03, threshold=1.225e+03, percent-clipped=6.0 2023-06-22 19:36:35,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1303692.0, ans=0.125 2023-06-22 19:36:42,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1303692.0, ans=0.1 2023-06-22 19:37:07,966 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=15.0 2023-06-22 19:37:36,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1303872.0, ans=0.0 2023-06-22 19:37:37,320 INFO [train.py:996] (3/4) Epoch 8, batch 3850, loss[loss=0.2112, simple_loss=0.272, pruned_loss=0.07523, over 21635.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3138, pruned_loss=0.08695, over 4286455.32 frames. ], batch size: 298, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:37:59,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1303932.0, ans=0.0 2023-06-22 19:38:21,024 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=15.0 2023-06-22 19:38:47,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1304112.0, ans=0.125 2023-06-22 19:39:16,044 INFO [train.py:996] (3/4) Epoch 8, batch 3900, loss[loss=0.2356, simple_loss=0.2986, pruned_loss=0.08631, over 21867.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3099, pruned_loss=0.08635, over 4277723.17 frames. ], batch size: 371, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:39:45,198 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.973e+02 4.645e+02 5.915e+02 7.788e+02 1.896e+03, threshold=1.183e+03, percent-clipped=6.0 2023-06-22 19:40:20,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1304352.0, ans=0.1 2023-06-22 19:40:39,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1304412.0, ans=0.0 2023-06-22 19:40:56,629 INFO [train.py:996] (3/4) Epoch 8, batch 3950, loss[loss=0.2396, simple_loss=0.326, pruned_loss=0.07664, over 21786.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.312, pruned_loss=0.08536, over 4277486.68 frames. ], batch size: 282, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 19:41:39,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-22 19:41:41,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1304592.0, ans=0.125 2023-06-22 19:42:36,352 INFO [train.py:996] (3/4) Epoch 8, batch 4000, loss[loss=0.2032, simple_loss=0.2618, pruned_loss=0.07229, over 21903.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3052, pruned_loss=0.08127, over 4277051.30 frames. ], batch size: 113, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 19:43:05,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.962e+02 4.102e+02 5.710e+02 7.605e+02 1.219e+03, threshold=1.142e+03, percent-clipped=1.0 2023-06-22 19:43:13,379 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=15.0 2023-06-22 19:44:21,098 INFO [train.py:996] (3/4) Epoch 8, batch 4050, loss[loss=0.2128, simple_loss=0.2973, pruned_loss=0.06416, over 21792.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3051, pruned_loss=0.0801, over 4276107.77 frames. ], batch size: 332, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 19:44:23,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1305072.0, ans=0.125 2023-06-22 19:44:32,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1305072.0, ans=0.125 2023-06-22 19:46:00,749 INFO [train.py:996] (3/4) Epoch 8, batch 4100, loss[loss=0.2272, simple_loss=0.3079, pruned_loss=0.07319, over 21783.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3072, pruned_loss=0.08123, over 4281979.21 frames. ], batch size: 332, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 19:46:14,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1305372.0, ans=0.0 2023-06-22 19:46:26,618 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.689e+02 3.586e+02 4.759e+02 6.058e+02 1.628e+03, threshold=9.517e+02, percent-clipped=6.0 2023-06-22 19:46:31,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1305492.0, ans=0.125 2023-06-22 19:46:49,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1305492.0, ans=0.125 2023-06-22 19:47:08,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1305552.0, ans=0.125 2023-06-22 19:47:19,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1305612.0, ans=0.0 2023-06-22 19:47:22,472 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-06-22 19:47:37,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1305612.0, ans=0.125 2023-06-22 19:47:40,274 INFO [train.py:996] (3/4) Epoch 8, batch 4150, loss[loss=0.2522, simple_loss=0.3279, pruned_loss=0.08824, over 21593.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3068, pruned_loss=0.07773, over 4289971.66 frames. ], batch size: 414, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 19:47:47,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1305672.0, ans=0.2 2023-06-22 19:47:54,001 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-22 19:48:44,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1305852.0, ans=0.05 2023-06-22 19:48:56,909 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 19:49:12,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1305912.0, ans=0.2 2023-06-22 19:49:16,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1305912.0, ans=0.125 2023-06-22 19:49:23,578 INFO [train.py:996] (3/4) Epoch 8, batch 4200, loss[loss=0.3533, simple_loss=0.424, pruned_loss=0.1413, over 21453.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3091, pruned_loss=0.07889, over 4287495.82 frames. ], batch size: 471, lr: 3.82e-03, grad_scale: 8.0 2023-06-22 19:49:57,475 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.597e+02 4.423e+02 6.282e+02 9.309e+02 2.210e+03, threshold=1.256e+03, percent-clipped=22.0 2023-06-22 19:50:58,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1306212.0, ans=0.125 2023-06-22 19:51:06,561 INFO [train.py:996] (3/4) Epoch 8, batch 4250, loss[loss=0.2726, simple_loss=0.349, pruned_loss=0.09809, over 21755.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3147, pruned_loss=0.08003, over 4274238.29 frames. ], batch size: 124, lr: 3.82e-03, grad_scale: 8.0 2023-06-22 19:51:28,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1306332.0, ans=0.2 2023-06-22 19:51:46,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1306332.0, ans=0.0 2023-06-22 19:51:53,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1306392.0, ans=0.125 2023-06-22 19:52:29,178 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-22 19:52:55,167 INFO [train.py:996] (3/4) Epoch 8, batch 4300, loss[loss=0.2322, simple_loss=0.3468, pruned_loss=0.05882, over 21241.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3247, pruned_loss=0.0839, over 4271790.03 frames. ], batch size: 548, lr: 3.82e-03, grad_scale: 8.0 2023-06-22 19:53:38,413 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.109e+02 4.490e+02 6.473e+02 1.024e+03 2.368e+03, threshold=1.295e+03, percent-clipped=12.0 2023-06-22 19:54:33,373 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.04 vs. limit=6.0 2023-06-22 19:54:35,535 INFO [train.py:996] (3/4) Epoch 8, batch 4350, loss[loss=0.1832, simple_loss=0.2486, pruned_loss=0.0589, over 21595.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3224, pruned_loss=0.08246, over 4271228.08 frames. ], batch size: 231, lr: 3.82e-03, grad_scale: 8.0 2023-06-22 19:54:53,990 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=22.5 2023-06-22 19:55:03,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1306932.0, ans=0.0 2023-06-22 19:55:12,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1306932.0, ans=0.125 2023-06-22 19:55:26,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1306992.0, ans=0.2 2023-06-22 19:55:47,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1307052.0, ans=0.0 2023-06-22 19:56:01,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1307112.0, ans=0.0 2023-06-22 19:56:15,888 INFO [train.py:996] (3/4) Epoch 8, batch 4400, loss[loss=0.2246, simple_loss=0.2869, pruned_loss=0.08115, over 21148.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3168, pruned_loss=0.08193, over 4265559.27 frames. ], batch size: 143, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 19:56:43,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1307232.0, ans=0.1 2023-06-22 19:56:50,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1307232.0, ans=0.0 2023-06-22 19:56:53,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.962e+02 4.391e+02 5.978e+02 7.745e+02 1.639e+03, threshold=1.196e+03, percent-clipped=7.0 2023-06-22 19:56:56,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1307232.0, ans=0.2 2023-06-22 19:57:13,815 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 19:57:33,379 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-22 19:57:35,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1307412.0, ans=0.125 2023-06-22 19:58:02,005 INFO [train.py:996] (3/4) Epoch 8, batch 4450, loss[loss=0.2688, simple_loss=0.36, pruned_loss=0.08881, over 21649.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3249, pruned_loss=0.08335, over 4254214.35 frames. ], batch size: 263, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 19:59:01,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1307652.0, ans=0.2 2023-06-22 19:59:05,568 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.22 vs. limit=22.5 2023-06-22 19:59:23,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1307712.0, ans=0.1 2023-06-22 19:59:23,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1307712.0, ans=0.1 2023-06-22 19:59:39,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1307712.0, ans=0.0 2023-06-22 19:59:48,261 INFO [train.py:996] (3/4) Epoch 8, batch 4500, loss[loss=0.3033, simple_loss=0.3765, pruned_loss=0.115, over 21732.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3246, pruned_loss=0.0845, over 4263125.86 frames. ], batch size: 441, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:00:14,652 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.895e+02 4.151e+02 5.436e+02 7.450e+02 1.876e+03, threshold=1.087e+03, percent-clipped=7.0 2023-06-22 20:01:28,207 INFO [train.py:996] (3/4) Epoch 8, batch 4550, loss[loss=0.3303, simple_loss=0.3869, pruned_loss=0.1368, over 21323.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3265, pruned_loss=0.08473, over 4268636.80 frames. ], batch size: 507, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:01:40,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1308072.0, ans=0.125 2023-06-22 20:02:10,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1308192.0, ans=0.125 2023-06-22 20:02:14,575 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.03 vs. limit=6.0 2023-06-22 20:02:19,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1308192.0, ans=0.125 2023-06-22 20:03:08,080 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-22 20:03:08,576 INFO [train.py:996] (3/4) Epoch 8, batch 4600, loss[loss=0.2304, simple_loss=0.3001, pruned_loss=0.08032, over 21165.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3292, pruned_loss=0.08669, over 4271720.91 frames. ], batch size: 608, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:03:26,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1308432.0, ans=0.035 2023-06-22 20:03:40,808 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.710e+02 4.165e+02 5.279e+02 6.740e+02 1.716e+03, threshold=1.056e+03, percent-clipped=6.0 2023-06-22 20:04:06,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1308552.0, ans=0.125 2023-06-22 20:04:47,310 INFO [train.py:996] (3/4) Epoch 8, batch 4650, loss[loss=0.2274, simple_loss=0.2964, pruned_loss=0.07923, over 21726.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3245, pruned_loss=0.08582, over 4280301.34 frames. ], batch size: 441, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:05:03,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1308732.0, ans=0.1 2023-06-22 20:05:26,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1308792.0, ans=0.1 2023-06-22 20:05:39,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1308792.0, ans=0.025 2023-06-22 20:06:18,231 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-22 20:06:22,464 INFO [train.py:996] (3/4) Epoch 8, batch 4700, loss[loss=0.2255, simple_loss=0.2858, pruned_loss=0.08257, over 21691.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3141, pruned_loss=0.08332, over 4278900.73 frames. ], batch size: 282, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:06:46,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1309032.0, ans=0.125 2023-06-22 20:06:51,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1309032.0, ans=0.0 2023-06-22 20:06:54,228 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.434e+02 3.523e+02 4.183e+02 5.885e+02 1.412e+03, threshold=8.365e+02, percent-clipped=3.0 2023-06-22 20:06:57,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1309092.0, ans=0.2 2023-06-22 20:07:39,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1309152.0, ans=0.125 2023-06-22 20:08:02,200 INFO [train.py:996] (3/4) Epoch 8, batch 4750, loss[loss=0.2297, simple_loss=0.2883, pruned_loss=0.08559, over 21292.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3081, pruned_loss=0.08345, over 4281984.62 frames. ], batch size: 159, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:08:15,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1309272.0, ans=0.1 2023-06-22 20:09:36,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1309512.0, ans=0.0 2023-06-22 20:09:42,387 INFO [train.py:996] (3/4) Epoch 8, batch 4800, loss[loss=0.2617, simple_loss=0.36, pruned_loss=0.08172, over 21532.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3095, pruned_loss=0.085, over 4285429.62 frames. ], batch size: 471, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:10:09,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1309632.0, ans=0.125 2023-06-22 20:10:14,332 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.048e+02 4.179e+02 5.198e+02 6.996e+02 1.429e+03, threshold=1.040e+03, percent-clipped=10.0 2023-06-22 20:10:37,570 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-22 20:11:21,048 INFO [train.py:996] (3/4) Epoch 8, batch 4850, loss[loss=0.2687, simple_loss=0.3309, pruned_loss=0.1033, over 21637.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3083, pruned_loss=0.08289, over 4278274.02 frames. ], batch size: 507, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:12:36,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1310052.0, ans=0.125 2023-06-22 20:13:00,619 INFO [train.py:996] (3/4) Epoch 8, batch 4900, loss[loss=0.2744, simple_loss=0.3619, pruned_loss=0.09341, over 21721.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3099, pruned_loss=0.08386, over 4285099.96 frames. ], batch size: 351, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:13:22,634 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.43 vs. limit=22.5 2023-06-22 20:13:32,427 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.010e+02 3.969e+02 5.003e+02 6.919e+02 1.603e+03, threshold=1.001e+03, percent-clipped=6.0 2023-06-22 20:13:37,665 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:14:40,865 INFO [train.py:996] (3/4) Epoch 8, batch 4950, loss[loss=0.1927, simple_loss=0.2903, pruned_loss=0.04757, over 21739.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3136, pruned_loss=0.08224, over 4278573.78 frames. ], batch size: 351, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:14:51,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-22 20:16:19,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1310772.0, ans=0.125 2023-06-22 20:16:25,030 INFO [train.py:996] (3/4) Epoch 8, batch 5000, loss[loss=0.2668, simple_loss=0.3368, pruned_loss=0.09846, over 21841.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3127, pruned_loss=0.07878, over 4276893.38 frames. ], batch size: 371, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:16:27,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1310772.0, ans=0.0 2023-06-22 20:16:51,889 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.498e+02 3.608e+02 4.622e+02 7.271e+02 1.664e+03, threshold=9.243e+02, percent-clipped=6.0 2023-06-22 20:17:20,112 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=22.5 2023-06-22 20:17:25,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1310952.0, ans=0.125 2023-06-22 20:17:36,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1311012.0, ans=0.2 2023-06-22 20:17:53,494 INFO [train.py:996] (3/4) Epoch 8, batch 5050, loss[loss=0.2342, simple_loss=0.3085, pruned_loss=0.07995, over 21567.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3123, pruned_loss=0.07971, over 4277839.77 frames. ], batch size: 195, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:18:17,873 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-22 20:19:28,949 INFO [train.py:996] (3/4) Epoch 8, batch 5100, loss[loss=0.2345, simple_loss=0.2981, pruned_loss=0.08546, over 21835.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3111, pruned_loss=0.08108, over 4280860.48 frames. ], batch size: 282, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:19:29,985 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.71 vs. limit=15.0 2023-06-22 20:19:30,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1311372.0, ans=0.2 2023-06-22 20:19:45,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1311372.0, ans=0.2 2023-06-22 20:20:02,310 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.036e+02 3.868e+02 4.783e+02 6.520e+02 1.021e+03, threshold=9.567e+02, percent-clipped=2.0 2023-06-22 20:20:09,941 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.74 vs. limit=15.0 2023-06-22 20:21:08,400 INFO [train.py:996] (3/4) Epoch 8, batch 5150, loss[loss=0.2272, simple_loss=0.2957, pruned_loss=0.07932, over 21597.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3103, pruned_loss=0.08161, over 4278208.67 frames. ], batch size: 263, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:21:25,544 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=22.5 2023-06-22 20:21:29,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1311732.0, ans=0.0 2023-06-22 20:21:56,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.03 vs. limit=15.0 2023-06-22 20:22:34,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1311912.0, ans=0.125 2023-06-22 20:22:41,072 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=12.0 2023-06-22 20:22:42,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1311912.0, ans=0.1 2023-06-22 20:22:52,481 INFO [train.py:996] (3/4) Epoch 8, batch 5200, loss[loss=0.3004, simple_loss=0.3969, pruned_loss=0.1019, over 21230.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3149, pruned_loss=0.08258, over 4273764.91 frames. ], batch size: 548, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:23:09,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1312032.0, ans=0.1 2023-06-22 20:23:10,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1312032.0, ans=0.125 2023-06-22 20:23:14,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1312032.0, ans=0.125 2023-06-22 20:23:20,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1312032.0, ans=0.125 2023-06-22 20:23:20,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1312032.0, ans=0.0 2023-06-22 20:23:26,969 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.134e+02 4.423e+02 5.674e+02 8.806e+02 1.736e+03, threshold=1.135e+03, percent-clipped=18.0 2023-06-22 20:24:32,941 INFO [train.py:996] (3/4) Epoch 8, batch 5250, loss[loss=0.2048, simple_loss=0.2825, pruned_loss=0.06359, over 21770.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3192, pruned_loss=0.08102, over 4268786.24 frames. ], batch size: 112, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:25:08,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1312392.0, ans=0.1 2023-06-22 20:25:14,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1312392.0, ans=0.125 2023-06-22 20:25:42,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1312452.0, ans=0.0 2023-06-22 20:25:50,941 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-22 20:25:57,652 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-06-22 20:26:11,365 INFO [train.py:996] (3/4) Epoch 8, batch 5300, loss[loss=0.2461, simple_loss=0.3066, pruned_loss=0.09285, over 21862.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3179, pruned_loss=0.08178, over 4275865.93 frames. ], batch size: 282, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:26:14,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1312572.0, ans=0.09899494936611666 2023-06-22 20:26:30,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1312632.0, ans=0.1 2023-06-22 20:26:44,648 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.970e+02 3.721e+02 4.525e+02 6.404e+02 1.262e+03, threshold=9.050e+02, percent-clipped=2.0 2023-06-22 20:27:03,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1312692.0, ans=0.125 2023-06-22 20:27:40,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1312812.0, ans=0.2 2023-06-22 20:27:49,127 INFO [train.py:996] (3/4) Epoch 8, batch 5350, loss[loss=0.2404, simple_loss=0.3006, pruned_loss=0.0901, over 21821.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3179, pruned_loss=0.08433, over 4276977.10 frames. ], batch size: 247, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:29:00,191 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-06-22 20:29:03,298 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=22.5 2023-06-22 20:29:07,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1313112.0, ans=0.125 2023-06-22 20:29:22,746 INFO [train.py:996] (3/4) Epoch 8, batch 5400, loss[loss=0.2659, simple_loss=0.4085, pruned_loss=0.06168, over 19740.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3169, pruned_loss=0.08527, over 4277633.82 frames. ], batch size: 702, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:30:01,565 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.001e+02 4.337e+02 6.656e+02 9.891e+02 1.935e+03, threshold=1.331e+03, percent-clipped=29.0 2023-06-22 20:30:36,111 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-22 20:30:36,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-22 20:30:41,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1313412.0, ans=0.125 2023-06-22 20:31:03,440 INFO [train.py:996] (3/4) Epoch 8, batch 5450, loss[loss=0.2155, simple_loss=0.3023, pruned_loss=0.06436, over 21383.00 frames. ], tot_loss[loss=0.241, simple_loss=0.317, pruned_loss=0.08253, over 4275980.45 frames. ], batch size: 131, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:31:04,774 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.66 vs. limit=12.0 2023-06-22 20:31:08,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1313472.0, ans=0.2 2023-06-22 20:31:08,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1313472.0, ans=0.125 2023-06-22 20:31:46,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=1313532.0, ans=0.02 2023-06-22 20:31:53,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1313592.0, ans=0.125 2023-06-22 20:32:43,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1313712.0, ans=0.125 2023-06-22 20:32:46,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1313712.0, ans=0.125 2023-06-22 20:32:49,356 INFO [train.py:996] (3/4) Epoch 8, batch 5500, loss[loss=0.1987, simple_loss=0.2924, pruned_loss=0.05246, over 21667.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3199, pruned_loss=0.07955, over 4274149.42 frames. ], batch size: 247, lr: 3.81e-03, grad_scale: 8.0 2023-06-22 20:32:54,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1313772.0, ans=0.125 2023-06-22 20:33:11,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1313772.0, ans=0.0 2023-06-22 20:33:18,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1313832.0, ans=0.125 2023-06-22 20:33:24,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1313832.0, ans=0.05 2023-06-22 20:33:31,686 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.855e+02 4.320e+02 6.103e+02 1.036e+03 2.497e+03, threshold=1.221e+03, percent-clipped=15.0 2023-06-22 20:33:55,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1313952.0, ans=0.0 2023-06-22 20:34:40,439 INFO [train.py:996] (3/4) Epoch 8, batch 5550, loss[loss=0.2101, simple_loss=0.2991, pruned_loss=0.06055, over 21672.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3199, pruned_loss=0.07707, over 4271243.62 frames. ], batch size: 247, lr: 3.81e-03, grad_scale: 8.0 2023-06-22 20:34:50,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1314072.0, ans=0.125 2023-06-22 20:34:57,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1314132.0, ans=0.1 2023-06-22 20:34:57,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1314132.0, ans=0.5 2023-06-22 20:35:45,866 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.73 vs. limit=10.0 2023-06-22 20:36:20,963 INFO [train.py:996] (3/4) Epoch 8, batch 5600, loss[loss=0.2113, simple_loss=0.2944, pruned_loss=0.06413, over 21057.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3183, pruned_loss=0.07439, over 4277111.13 frames. ], batch size: 143, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:36:39,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1314432.0, ans=0.1 2023-06-22 20:36:58,094 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.586e+02 4.144e+02 6.431e+02 9.394e+02 1.823e+03, threshold=1.286e+03, percent-clipped=11.0 2023-06-22 20:37:06,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1314492.0, ans=0.125 2023-06-22 20:37:29,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1314552.0, ans=0.125 2023-06-22 20:37:55,135 INFO [train.py:996] (3/4) Epoch 8, batch 5650, loss[loss=0.328, simple_loss=0.3748, pruned_loss=0.1406, over 21688.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3219, pruned_loss=0.07747, over 4284427.43 frames. ], batch size: 507, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:38:53,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1314852.0, ans=0.0 2023-06-22 20:39:19,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1314912.0, ans=0.125 2023-06-22 20:39:19,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1314912.0, ans=0.05 2023-06-22 20:39:34,653 INFO [train.py:996] (3/4) Epoch 8, batch 5700, loss[loss=0.2322, simple_loss=0.2929, pruned_loss=0.08571, over 21275.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3211, pruned_loss=0.07979, over 4279641.47 frames. ], batch size: 608, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:39:59,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1315032.0, ans=0.125 2023-06-22 20:40:12,303 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.709e+02 4.552e+02 6.267e+02 8.726e+02 1.736e+03, threshold=1.253e+03, percent-clipped=4.0 2023-06-22 20:40:41,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1315152.0, ans=0.125 2023-06-22 20:40:52,043 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.75 vs. limit=15.0 2023-06-22 20:41:01,827 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2023-06-22 20:41:19,533 INFO [train.py:996] (3/4) Epoch 8, batch 5750, loss[loss=0.2045, simple_loss=0.2958, pruned_loss=0.05656, over 21638.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.318, pruned_loss=0.07754, over 4272610.85 frames. ], batch size: 247, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:42:22,023 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-06-22 20:42:23,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1315452.0, ans=0.04949747468305833 2023-06-22 20:42:59,380 INFO [train.py:996] (3/4) Epoch 8, batch 5800, loss[loss=0.261, simple_loss=0.3539, pruned_loss=0.08406, over 21767.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3151, pruned_loss=0.07529, over 4262440.16 frames. ], batch size: 332, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:43:03,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1315572.0, ans=0.125 2023-06-22 20:43:36,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1315632.0, ans=0.2 2023-06-22 20:43:41,881 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=12.0 2023-06-22 20:43:42,640 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.530e+02 3.870e+02 5.356e+02 7.893e+02 2.349e+03, threshold=1.071e+03, percent-clipped=9.0 2023-06-22 20:43:45,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1315692.0, ans=0.125 2023-06-22 20:43:45,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1315692.0, ans=0.0 2023-06-22 20:44:32,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1315812.0, ans=0.125 2023-06-22 20:44:35,125 INFO [train.py:996] (3/4) Epoch 8, batch 5850, loss[loss=0.1836, simple_loss=0.2796, pruned_loss=0.04379, over 21388.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3124, pruned_loss=0.07064, over 4271164.94 frames. ], batch size: 211, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:44:51,699 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=22.5 2023-06-22 20:45:08,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1315932.0, ans=0.0 2023-06-22 20:45:23,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1315992.0, ans=0.2 2023-06-22 20:45:52,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1316112.0, ans=0.0 2023-06-22 20:46:01,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1316112.0, ans=0.125 2023-06-22 20:46:07,374 INFO [train.py:996] (3/4) Epoch 8, batch 5900, loss[loss=0.2203, simple_loss=0.2935, pruned_loss=0.07354, over 21279.00 frames. ], tot_loss[loss=0.217, simple_loss=0.3043, pruned_loss=0.06489, over 4277579.75 frames. ], batch size: 159, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:46:37,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1316232.0, ans=0.125 2023-06-22 20:46:43,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1316232.0, ans=0.1 2023-06-22 20:46:48,285 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 3.361e+02 4.544e+02 6.338e+02 1.644e+03, threshold=9.088e+02, percent-clipped=4.0 2023-06-22 20:47:01,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1316292.0, ans=0.1 2023-06-22 20:47:14,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1316352.0, ans=0.125 2023-06-22 20:47:17,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1316352.0, ans=0.125 2023-06-22 20:47:19,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1316352.0, ans=0.125 2023-06-22 20:47:32,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1316412.0, ans=0.0 2023-06-22 20:47:39,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1316412.0, ans=0.0 2023-06-22 20:47:42,260 INFO [train.py:996] (3/4) Epoch 8, batch 5950, loss[loss=0.2336, simple_loss=0.2905, pruned_loss=0.0883, over 21833.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.3022, pruned_loss=0.0682, over 4273218.87 frames. ], batch size: 98, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:47:48,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1316472.0, ans=0.125 2023-06-22 20:48:09,729 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-06-22 20:48:23,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1316592.0, ans=0.125 2023-06-22 20:49:03,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1316712.0, ans=0.125 2023-06-22 20:49:03,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1316712.0, ans=0.1 2023-06-22 20:49:19,223 INFO [train.py:996] (3/4) Epoch 8, batch 6000, loss[loss=0.252, simple_loss=0.3011, pruned_loss=0.1014, over 21393.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2992, pruned_loss=0.07238, over 4262758.68 frames. ], batch size: 473, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:49:19,223 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 20:49:32,144 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.1449, 2.0061, 1.7980, 2.7031], device='cuda:3') 2023-06-22 20:49:40,898 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2636, simple_loss=0.3606, pruned_loss=0.08334, over 1796401.00 frames. 2023-06-22 20:49:40,899 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-22 20:50:04,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1316832.0, ans=0.0 2023-06-22 20:50:09,691 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-22 20:50:16,499 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.164e+02 5.189e+02 7.580e+02 1.356e+03, threshold=1.038e+03, percent-clipped=17.0 2023-06-22 20:51:09,166 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2023-06-22 20:51:14,230 INFO [train.py:996] (3/4) Epoch 8, batch 6050, loss[loss=0.159, simple_loss=0.2405, pruned_loss=0.03881, over 21454.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2945, pruned_loss=0.07398, over 4267912.71 frames. ], batch size: 195, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:51:47,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1317132.0, ans=0.05 2023-06-22 20:51:55,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1317192.0, ans=0.0 2023-06-22 20:52:09,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1317252.0, ans=0.2 2023-06-22 20:52:36,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1317312.0, ans=0.125 2023-06-22 20:52:37,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1317312.0, ans=0.2 2023-06-22 20:52:50,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1317372.0, ans=0.125 2023-06-22 20:52:51,282 INFO [train.py:996] (3/4) Epoch 8, batch 6100, loss[loss=0.2427, simple_loss=0.3294, pruned_loss=0.07794, over 21507.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.293, pruned_loss=0.07198, over 4273716.32 frames. ], batch size: 471, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:53:06,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1317432.0, ans=0.2 2023-06-22 20:53:07,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1317432.0, ans=0.125 2023-06-22 20:53:22,421 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.64 vs. limit=22.5 2023-06-22 20:53:29,011 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.740e+02 3.450e+02 4.266e+02 5.544e+02 1.374e+03, threshold=8.532e+02, percent-clipped=4.0 2023-06-22 20:53:40,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1317492.0, ans=0.125 2023-06-22 20:53:46,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1317552.0, ans=0.125 2023-06-22 20:54:01,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1317612.0, ans=0.125 2023-06-22 20:54:13,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1317612.0, ans=0.0 2023-06-22 20:54:28,759 INFO [train.py:996] (3/4) Epoch 8, batch 6150, loss[loss=0.2172, simple_loss=0.2854, pruned_loss=0.07454, over 21867.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.297, pruned_loss=0.07513, over 4280407.32 frames. ], batch size: 98, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:54:40,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1317672.0, ans=0.125 2023-06-22 20:56:06,736 INFO [train.py:996] (3/4) Epoch 8, batch 6200, loss[loss=0.2382, simple_loss=0.3076, pruned_loss=0.08445, over 21524.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3025, pruned_loss=0.07601, over 4285817.46 frames. ], batch size: 195, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:56:33,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1318032.0, ans=0.0 2023-06-22 20:56:41,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1318092.0, ans=0.2 2023-06-22 20:56:44,776 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.020e+02 4.357e+02 5.420e+02 8.092e+02 2.121e+03, threshold=1.084e+03, percent-clipped=22.0 2023-06-22 20:57:36,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1318212.0, ans=0.2 2023-06-22 20:57:45,356 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-22 20:57:47,406 INFO [train.py:996] (3/4) Epoch 8, batch 6250, loss[loss=0.1948, simple_loss=0.2912, pruned_loss=0.04917, over 21391.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3085, pruned_loss=0.07578, over 4285731.39 frames. ], batch size: 211, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 20:57:57,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1318272.0, ans=0.0 2023-06-22 20:57:58,281 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=15.0 2023-06-22 20:59:22,307 INFO [train.py:996] (3/4) Epoch 8, batch 6300, loss[loss=0.2466, simple_loss=0.3142, pruned_loss=0.08947, over 21594.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3111, pruned_loss=0.0743, over 4278224.68 frames. ], batch size: 212, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 20:59:24,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1318572.0, ans=0.0 2023-06-22 20:59:57,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1318692.0, ans=0.125 2023-06-22 20:59:59,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1318692.0, ans=0.0 2023-06-22 20:59:59,987 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.614e+02 4.288e+02 6.355e+02 8.462e+02 1.476e+03, threshold=1.271e+03, percent-clipped=15.0 2023-06-22 21:00:21,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1318692.0, ans=0.0 2023-06-22 21:00:25,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1318752.0, ans=0.2 2023-06-22 21:00:32,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1318752.0, ans=0.2 2023-06-22 21:01:04,419 INFO [train.py:996] (3/4) Epoch 8, batch 6350, loss[loss=0.2898, simple_loss=0.3545, pruned_loss=0.1126, over 21558.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3135, pruned_loss=0.07862, over 4284951.93 frames. ], batch size: 414, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:01:15,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1318872.0, ans=0.125 2023-06-22 21:01:15,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1318872.0, ans=0.2 2023-06-22 21:01:23,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1318932.0, ans=0.0 2023-06-22 21:02:43,817 INFO [train.py:996] (3/4) Epoch 8, batch 6400, loss[loss=0.2459, simple_loss=0.3222, pruned_loss=0.08485, over 21960.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3206, pruned_loss=0.08403, over 4284888.40 frames. ], batch size: 372, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:03:14,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1319232.0, ans=0.125 2023-06-22 21:03:16,586 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.08 vs. limit=12.0 2023-06-22 21:03:31,899 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.536e+02 4.404e+02 5.449e+02 7.418e+02 1.410e+03, threshold=1.090e+03, percent-clipped=1.0 2023-06-22 21:04:21,640 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.95 vs. limit=22.5 2023-06-22 21:04:21,984 INFO [train.py:996] (3/4) Epoch 8, batch 6450, loss[loss=0.2167, simple_loss=0.2975, pruned_loss=0.06791, over 21736.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3236, pruned_loss=0.0838, over 4284680.14 frames. ], batch size: 282, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:04:27,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1319472.0, ans=0.95 2023-06-22 21:05:10,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1319592.0, ans=0.035 2023-06-22 21:05:44,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1319712.0, ans=0.2 2023-06-22 21:05:59,779 INFO [train.py:996] (3/4) Epoch 8, batch 6500, loss[loss=0.2034, simple_loss=0.2615, pruned_loss=0.07268, over 21235.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.318, pruned_loss=0.08336, over 4285365.41 frames. ], batch size: 159, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:06:33,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1319832.0, ans=0.125 2023-06-22 21:06:45,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1319892.0, ans=0.125 2023-06-22 21:06:48,028 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.971e+02 4.929e+02 6.597e+02 9.364e+02 1.745e+03, threshold=1.319e+03, percent-clipped=16.0 2023-06-22 21:06:58,547 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=12.0 2023-06-22 21:07:28,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1320012.0, ans=0.1 2023-06-22 21:07:44,849 INFO [train.py:996] (3/4) Epoch 8, batch 6550, loss[loss=0.2283, simple_loss=0.3016, pruned_loss=0.07745, over 21818.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3165, pruned_loss=0.0807, over 4276456.69 frames. ], batch size: 282, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:08:13,516 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-22 21:08:22,803 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2023-06-22 21:08:58,027 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=15.0 2023-06-22 21:09:17,827 INFO [train.py:996] (3/4) Epoch 8, batch 6600, loss[loss=0.1929, simple_loss=0.2575, pruned_loss=0.06418, over 21802.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3101, pruned_loss=0.07987, over 4276343.23 frames. ], batch size: 98, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:09:41,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1320432.0, ans=0.125 2023-06-22 21:09:43,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1320432.0, ans=0.125 2023-06-22 21:09:57,227 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.721e+02 4.019e+02 5.750e+02 8.785e+02 1.668e+03, threshold=1.150e+03, percent-clipped=5.0 2023-06-22 21:10:53,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1320612.0, ans=0.0 2023-06-22 21:10:55,812 INFO [train.py:996] (3/4) Epoch 8, batch 6650, loss[loss=0.1851, simple_loss=0.2435, pruned_loss=0.06337, over 20883.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3021, pruned_loss=0.0781, over 4274779.17 frames. ], batch size: 608, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:12:00,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1320852.0, ans=0.0 2023-06-22 21:12:21,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1320912.0, ans=0.0 2023-06-22 21:12:23,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1320912.0, ans=0.95 2023-06-22 21:12:29,163 INFO [train.py:996] (3/4) Epoch 8, batch 6700, loss[loss=0.3077, simple_loss=0.3561, pruned_loss=0.1297, over 21448.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2975, pruned_loss=0.07844, over 4273189.46 frames. ], batch size: 509, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:13:08,235 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.668e+02 3.822e+02 4.568e+02 6.367e+02 1.164e+03, threshold=9.137e+02, percent-clipped=1.0 2023-06-22 21:13:16,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1321092.0, ans=0.125 2023-06-22 21:13:37,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1321152.0, ans=0.0 2023-06-22 21:13:45,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1321212.0, ans=0.125 2023-06-22 21:13:48,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1321212.0, ans=0.125 2023-06-22 21:14:02,856 INFO [train.py:996] (3/4) Epoch 8, batch 6750, loss[loss=0.2454, simple_loss=0.3075, pruned_loss=0.09165, over 21818.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2948, pruned_loss=0.07825, over 4271026.31 frames. ], batch size: 333, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:14:40,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1321392.0, ans=0.1 2023-06-22 21:14:57,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1321392.0, ans=22.5 2023-06-22 21:15:37,244 INFO [train.py:996] (3/4) Epoch 8, batch 6800, loss[loss=0.2591, simple_loss=0.3097, pruned_loss=0.1043, over 21260.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.2959, pruned_loss=0.08021, over 4281526.29 frames. ], batch size: 159, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:15:54,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1321572.0, ans=0.1 2023-06-22 21:15:59,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1321632.0, ans=0.0 2023-06-22 21:16:07,729 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 21:16:10,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1321632.0, ans=0.0 2023-06-22 21:16:15,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1321692.0, ans=0.125 2023-06-22 21:16:16,527 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.952e+02 4.656e+02 6.234e+02 8.844e+02 1.935e+03, threshold=1.247e+03, percent-clipped=22.0 2023-06-22 21:16:40,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1321752.0, ans=0.125 2023-06-22 21:17:04,207 INFO [train.py:996] (3/4) Epoch 8, batch 6850, loss[loss=0.2058, simple_loss=0.2706, pruned_loss=0.07048, over 21770.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2961, pruned_loss=0.08035, over 4276957.38 frames. ], batch size: 300, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:17:52,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1321992.0, ans=0.2 2023-06-22 21:17:54,618 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-22 21:18:05,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1322052.0, ans=0.125 2023-06-22 21:18:08,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1322052.0, ans=0.125 2023-06-22 21:18:29,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1322112.0, ans=0.125 2023-06-22 21:18:48,718 INFO [train.py:996] (3/4) Epoch 8, batch 6900, loss[loss=0.1969, simple_loss=0.2652, pruned_loss=0.06432, over 21820.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.2987, pruned_loss=0.08044, over 4282790.46 frames. ], batch size: 247, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:18:56,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1322172.0, ans=0.125 2023-06-22 21:19:11,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1322232.0, ans=0.1 2023-06-22 21:19:20,389 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-22 21:19:34,853 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.637e+02 4.516e+02 6.241e+02 9.290e+02 1.863e+03, threshold=1.248e+03, percent-clipped=14.0 2023-06-22 21:20:00,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1322352.0, ans=0.2 2023-06-22 21:20:33,615 INFO [train.py:996] (3/4) Epoch 8, batch 6950, loss[loss=0.2575, simple_loss=0.3309, pruned_loss=0.09203, over 21691.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3016, pruned_loss=0.07728, over 4287519.27 frames. ], batch size: 351, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:20:51,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1322532.0, ans=0.125 2023-06-22 21:21:09,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1322592.0, ans=0.125 2023-06-22 21:21:25,689 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=15.0 2023-06-22 21:22:12,754 INFO [train.py:996] (3/4) Epoch 8, batch 7000, loss[loss=0.2044, simple_loss=0.2689, pruned_loss=0.07, over 21737.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3027, pruned_loss=0.07858, over 4294984.79 frames. ], batch size: 317, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:22:54,250 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.120e+02 4.125e+02 4.702e+02 7.118e+02 1.401e+03, threshold=9.403e+02, percent-clipped=1.0 2023-06-22 21:23:26,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1322952.0, ans=0.1 2023-06-22 21:23:45,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1323012.0, ans=0.0 2023-06-22 21:23:51,474 INFO [train.py:996] (3/4) Epoch 8, batch 7050, loss[loss=0.2445, simple_loss=0.3285, pruned_loss=0.08022, over 21608.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3007, pruned_loss=0.07835, over 4279141.89 frames. ], batch size: 414, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:24:51,818 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-22 21:25:01,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=1323252.0, ans=0.2 2023-06-22 21:25:31,387 INFO [train.py:996] (3/4) Epoch 8, batch 7100, loss[loss=0.2624, simple_loss=0.3399, pruned_loss=0.09249, over 21694.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3061, pruned_loss=0.07999, over 4285759.63 frames. ], batch size: 351, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:25:33,835 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.35 vs. limit=10.0 2023-06-22 21:26:12,651 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 4.059e+02 5.159e+02 6.413e+02 1.166e+03, threshold=1.032e+03, percent-clipped=5.0 2023-06-22 21:26:28,445 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-22 21:26:42,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1323552.0, ans=0.125 2023-06-22 21:26:47,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1323552.0, ans=0.2 2023-06-22 21:26:52,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1323552.0, ans=0.0 2023-06-22 21:26:58,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1323612.0, ans=0.125 2023-06-22 21:27:15,334 INFO [train.py:996] (3/4) Epoch 8, batch 7150, loss[loss=0.2729, simple_loss=0.3397, pruned_loss=0.103, over 21351.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3042, pruned_loss=0.0782, over 4284247.38 frames. ], batch size: 549, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:27:23,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1323672.0, ans=0.05 2023-06-22 21:27:40,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1323732.0, ans=0.125 2023-06-22 21:27:40,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1323732.0, ans=0.0 2023-06-22 21:27:43,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1323732.0, ans=0.125 2023-06-22 21:28:04,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1323792.0, ans=0.125 2023-06-22 21:28:05,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1323792.0, ans=0.125 2023-06-22 21:28:32,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1323912.0, ans=0.0 2023-06-22 21:28:37,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1323912.0, ans=0.035 2023-06-22 21:28:54,619 INFO [train.py:996] (3/4) Epoch 8, batch 7200, loss[loss=0.2306, simple_loss=0.3029, pruned_loss=0.07911, over 21835.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3059, pruned_loss=0.07965, over 4279172.21 frames. ], batch size: 98, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:28:58,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1323972.0, ans=0.125 2023-06-22 21:29:33,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1324092.0, ans=0.0 2023-06-22 21:29:35,429 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.797e+02 4.643e+02 6.367e+02 8.701e+02 1.653e+03, threshold=1.273e+03, percent-clipped=12.0 2023-06-22 21:29:52,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1324152.0, ans=0.125 2023-06-22 21:30:27,778 INFO [train.py:996] (3/4) Epoch 8, batch 7250, loss[loss=0.2251, simple_loss=0.2782, pruned_loss=0.086, over 21332.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3037, pruned_loss=0.08106, over 4273221.79 frames. ], batch size: 473, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:30:48,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1324332.0, ans=0.0 2023-06-22 21:31:23,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1324392.0, ans=0.2 2023-06-22 21:31:23,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1324392.0, ans=0.125 2023-06-22 21:32:03,420 INFO [train.py:996] (3/4) Epoch 8, batch 7300, loss[loss=0.2128, simple_loss=0.2738, pruned_loss=0.07586, over 21519.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2981, pruned_loss=0.07997, over 4272706.53 frames. ], batch size: 391, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:32:07,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1324572.0, ans=0.07 2023-06-22 21:32:48,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1324692.0, ans=0.1 2023-06-22 21:32:49,707 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.998e+02 4.041e+02 5.063e+02 7.441e+02 1.428e+03, threshold=1.013e+03, percent-clipped=2.0 2023-06-22 21:33:42,995 INFO [train.py:996] (3/4) Epoch 8, batch 7350, loss[loss=0.2581, simple_loss=0.3211, pruned_loss=0.09752, over 21319.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.295, pruned_loss=0.07976, over 4264954.37 frames. ], batch size: 159, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:33:48,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1324872.0, ans=0.125 2023-06-22 21:33:58,508 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-22 21:34:39,369 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=22.5 2023-06-22 21:34:48,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1325052.0, ans=0.125 2023-06-22 21:35:10,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1325112.0, ans=0.125 2023-06-22 21:35:18,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1325112.0, ans=0.1 2023-06-22 21:35:24,097 INFO [train.py:996] (3/4) Epoch 8, batch 7400, loss[loss=0.2093, simple_loss=0.2661, pruned_loss=0.07624, over 20707.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3, pruned_loss=0.08298, over 4264795.96 frames. ], batch size: 609, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:36:10,334 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.225e+02 4.086e+02 4.943e+02 6.567e+02 1.302e+03, threshold=9.886e+02, percent-clipped=5.0 2023-06-22 21:36:43,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1325412.0, ans=0.0 2023-06-22 21:36:47,602 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-22 21:36:56,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1325472.0, ans=0.0 2023-06-22 21:36:58,154 INFO [train.py:996] (3/4) Epoch 8, batch 7450, loss[loss=0.2299, simple_loss=0.2951, pruned_loss=0.08231, over 21634.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.2992, pruned_loss=0.08188, over 4272286.94 frames. ], batch size: 415, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:37:00,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1325472.0, ans=0.125 2023-06-22 21:38:18,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1325652.0, ans=0.125 2023-06-22 21:38:24,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1325712.0, ans=0.0 2023-06-22 21:38:38,903 INFO [train.py:996] (3/4) Epoch 8, batch 7500, loss[loss=0.2325, simple_loss=0.3165, pruned_loss=0.07423, over 21221.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3038, pruned_loss=0.08279, over 4274603.09 frames. ], batch size: 143, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:39:35,925 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.043e+02 4.495e+02 6.774e+02 8.927e+02 1.705e+03, threshold=1.355e+03, percent-clipped=18.0 2023-06-22 21:39:46,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1325952.0, ans=0.125 2023-06-22 21:39:52,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1325952.0, ans=0.125 2023-06-22 21:39:57,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1325952.0, ans=0.125 2023-06-22 21:40:05,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1326012.0, ans=0.125 2023-06-22 21:40:11,125 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-22 21:40:23,990 INFO [train.py:996] (3/4) Epoch 8, batch 7550, loss[loss=0.2466, simple_loss=0.3465, pruned_loss=0.07333, over 20026.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3113, pruned_loss=0.08189, over 4260841.44 frames. ], batch size: 702, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:40:49,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1326132.0, ans=0.125 2023-06-22 21:40:51,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1326132.0, ans=0.125 2023-06-22 21:41:43,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1326312.0, ans=10.0 2023-06-22 21:41:48,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1326312.0, ans=0.125 2023-06-22 21:41:49,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1326312.0, ans=0.125 2023-06-22 21:41:50,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1326312.0, ans=0.1 2023-06-22 21:41:58,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1326312.0, ans=0.125 2023-06-22 21:42:00,772 INFO [train.py:996] (3/4) Epoch 8, batch 7600, loss[loss=0.2565, simple_loss=0.317, pruned_loss=0.09798, over 21586.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3115, pruned_loss=0.08138, over 4271099.83 frames. ], batch size: 548, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:42:01,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1326372.0, ans=0.0 2023-06-22 21:42:02,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1326372.0, ans=0.025 2023-06-22 21:42:10,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1326372.0, ans=0.1 2023-06-22 21:42:25,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1326432.0, ans=0.2 2023-06-22 21:42:50,584 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.183e+02 4.607e+02 6.581e+02 9.726e+02 1.530e+03, threshold=1.316e+03, percent-clipped=7.0 2023-06-22 21:42:50,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1326492.0, ans=0.125 2023-06-22 21:43:10,699 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.51 vs. limit=15.0 2023-06-22 21:43:26,448 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 21:43:38,606 INFO [train.py:996] (3/4) Epoch 8, batch 7650, loss[loss=0.2253, simple_loss=0.2971, pruned_loss=0.07673, over 21760.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3106, pruned_loss=0.08177, over 4281857.40 frames. ], batch size: 389, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:43:39,637 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-22 21:43:50,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1326672.0, ans=0.125 2023-06-22 21:43:57,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1326672.0, ans=0.125 2023-06-22 21:44:11,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1326732.0, ans=22.5 2023-06-22 21:44:34,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1326792.0, ans=0.0 2023-06-22 21:44:36,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1326792.0, ans=0.125 2023-06-22 21:44:42,140 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.52 vs. limit=15.0 2023-06-22 21:44:49,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1326852.0, ans=0.125 2023-06-22 21:44:50,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1326852.0, ans=0.0 2023-06-22 21:44:57,809 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 21:45:18,099 INFO [train.py:996] (3/4) Epoch 8, batch 7700, loss[loss=0.2807, simple_loss=0.343, pruned_loss=0.1092, over 21408.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3131, pruned_loss=0.08478, over 4282596.24 frames. ], batch size: 131, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:45:30,309 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-22 21:45:50,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1327032.0, ans=0.125 2023-06-22 21:46:04,653 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.953e+02 3.847e+02 4.781e+02 6.112e+02 1.345e+03, threshold=9.563e+02, percent-clipped=1.0 2023-06-22 21:46:17,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1327152.0, ans=0.125 2023-06-22 21:46:58,537 INFO [train.py:996] (3/4) Epoch 8, batch 7750, loss[loss=0.284, simple_loss=0.3858, pruned_loss=0.09106, over 21864.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.318, pruned_loss=0.08479, over 4282736.67 frames. ], batch size: 372, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:47:43,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1327392.0, ans=0.2 2023-06-22 21:47:59,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-22 21:48:00,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1327452.0, ans=0.0 2023-06-22 21:48:36,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1327572.0, ans=0.125 2023-06-22 21:48:42,259 INFO [train.py:996] (3/4) Epoch 8, batch 7800, loss[loss=0.1793, simple_loss=0.2321, pruned_loss=0.06326, over 21703.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3201, pruned_loss=0.08521, over 4271911.68 frames. ], batch size: 124, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:48:44,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1327572.0, ans=0.125 2023-06-22 21:49:19,169 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.166e+02 4.676e+02 6.316e+02 9.000e+02 2.015e+03, threshold=1.263e+03, percent-clipped=20.0 2023-06-22 21:49:36,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1327752.0, ans=0.125 2023-06-22 21:49:56,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1327812.0, ans=0.125 2023-06-22 21:50:02,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1327812.0, ans=0.1 2023-06-22 21:50:03,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1327812.0, ans=0.125 2023-06-22 21:50:14,999 INFO [train.py:996] (3/4) Epoch 8, batch 7850, loss[loss=0.2333, simple_loss=0.291, pruned_loss=0.08782, over 21539.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3155, pruned_loss=0.08467, over 4261970.38 frames. ], batch size: 391, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:50:16,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=22.5 2023-06-22 21:50:16,354 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.83 vs. limit=5.0 2023-06-22 21:50:46,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1327932.0, ans=0.0 2023-06-22 21:51:01,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1327992.0, ans=0.2 2023-06-22 21:51:10,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1328052.0, ans=0.0 2023-06-22 21:51:14,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1328052.0, ans=0.125 2023-06-22 21:51:49,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1328112.0, ans=0.125 2023-06-22 21:52:00,779 INFO [train.py:996] (3/4) Epoch 8, batch 7900, loss[loss=0.2223, simple_loss=0.3495, pruned_loss=0.04751, over 19825.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3102, pruned_loss=0.08354, over 4262924.07 frames. ], batch size: 702, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:52:03,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-06-22 21:52:44,747 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.884e+02 3.830e+02 4.891e+02 7.312e+02 1.897e+03, threshold=9.781e+02, percent-clipped=5.0 2023-06-22 21:53:05,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1328352.0, ans=0.125 2023-06-22 21:53:07,225 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-22 21:53:08,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1328352.0, ans=0.125 2023-06-22 21:53:08,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1328352.0, ans=10.0 2023-06-22 21:53:30,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1328412.0, ans=0.0 2023-06-22 21:53:41,616 INFO [train.py:996] (3/4) Epoch 8, batch 7950, loss[loss=0.2426, simple_loss=0.3221, pruned_loss=0.0816, over 21429.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3147, pruned_loss=0.08318, over 4267397.54 frames. ], batch size: 194, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:53:52,414 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.33 vs. limit=22.5 2023-06-22 21:55:03,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1328652.0, ans=0.0 2023-06-22 21:55:06,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1328712.0, ans=0.125 2023-06-22 21:55:24,221 INFO [train.py:996] (3/4) Epoch 8, batch 8000, loss[loss=0.2103, simple_loss=0.2577, pruned_loss=0.08145, over 20033.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3182, pruned_loss=0.08592, over 4269932.03 frames. ], batch size: 704, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:55:41,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1328772.0, ans=0.2 2023-06-22 21:55:59,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1328832.0, ans=0.1 2023-06-22 21:56:25,234 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.345e+02 4.453e+02 5.766e+02 9.258e+02 3.143e+03, threshold=1.153e+03, percent-clipped=22.0 2023-06-22 21:56:40,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1328952.0, ans=0.125 2023-06-22 21:57:10,673 INFO [train.py:996] (3/4) Epoch 8, batch 8050, loss[loss=0.2223, simple_loss=0.3042, pruned_loss=0.07027, over 21787.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3219, pruned_loss=0.08614, over 4271305.57 frames. ], batch size: 282, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:57:29,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1329072.0, ans=0.125 2023-06-22 21:58:32,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=22.5 2023-06-22 21:58:56,231 INFO [train.py:996] (3/4) Epoch 8, batch 8100, loss[loss=0.2327, simple_loss=0.2999, pruned_loss=0.08274, over 21784.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3206, pruned_loss=0.08715, over 4277134.96 frames. ], batch size: 247, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:59:39,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1329492.0, ans=0.0 2023-06-22 21:59:47,640 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.065e+02 4.714e+02 7.153e+02 1.179e+03 2.402e+03, threshold=1.431e+03, percent-clipped=27.0 2023-06-22 22:00:43,217 INFO [train.py:996] (3/4) Epoch 8, batch 8150, loss[loss=0.2337, simple_loss=0.3349, pruned_loss=0.06623, over 21792.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3296, pruned_loss=0.08902, over 4277464.79 frames. ], batch size: 352, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:00:45,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1329672.0, ans=0.125 2023-06-22 22:01:10,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1329732.0, ans=0.0 2023-06-22 22:01:12,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1329732.0, ans=0.0 2023-06-22 22:01:23,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1329792.0, ans=0.1 2023-06-22 22:01:26,493 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 22:01:51,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-22 22:01:58,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1329912.0, ans=10.0 2023-06-22 22:02:22,141 INFO [train.py:996] (3/4) Epoch 8, batch 8200, loss[loss=0.1892, simple_loss=0.2448, pruned_loss=0.06677, over 21167.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3218, pruned_loss=0.08671, over 4272899.13 frames. ], batch size: 143, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:02:24,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1329972.0, ans=0.0 2023-06-22 22:02:34,579 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.41 vs. limit=10.0 2023-06-22 22:03:02,693 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.206e+02 4.983e+02 6.789e+02 1.065e+03 2.564e+03, threshold=1.358e+03, percent-clipped=14.0 2023-06-22 22:03:43,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1330212.0, ans=0.125 2023-06-22 22:04:02,054 INFO [train.py:996] (3/4) Epoch 8, batch 8250, loss[loss=0.2006, simple_loss=0.2841, pruned_loss=0.05853, over 21334.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3199, pruned_loss=0.08585, over 4278956.98 frames. ], batch size: 131, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:04:28,327 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=15.0 2023-06-22 22:05:38,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1330512.0, ans=0.2 2023-06-22 22:05:41,318 INFO [train.py:996] (3/4) Epoch 8, batch 8300, loss[loss=0.2395, simple_loss=0.3304, pruned_loss=0.0743, over 21206.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3178, pruned_loss=0.08329, over 4274428.23 frames. ], batch size: 548, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:05:44,202 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=15.0 2023-06-22 22:06:15,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1330692.0, ans=0.125 2023-06-22 22:06:27,258 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.942e+02 4.371e+02 5.179e+02 7.811e+02 1.980e+03, threshold=1.036e+03, percent-clipped=4.0 2023-06-22 22:07:10,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1330812.0, ans=0.125 2023-06-22 22:07:17,069 INFO [train.py:996] (3/4) Epoch 8, batch 8350, loss[loss=0.2201, simple_loss=0.3022, pruned_loss=0.06897, over 21668.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3165, pruned_loss=0.08106, over 4273841.82 frames. ], batch size: 415, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:07:17,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1330872.0, ans=0.125 2023-06-22 22:07:43,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1330932.0, ans=0.125 2023-06-22 22:08:19,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1331052.0, ans=0.2 2023-06-22 22:08:43,482 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-06-22 22:08:56,590 INFO [train.py:996] (3/4) Epoch 8, batch 8400, loss[loss=0.2348, simple_loss=0.3075, pruned_loss=0.08108, over 21403.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3139, pruned_loss=0.0783, over 4276310.57 frames. ], batch size: 194, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 22:09:02,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1331172.0, ans=0.035 2023-06-22 22:09:19,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1331232.0, ans=0.125 2023-06-22 22:09:22,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1331232.0, ans=0.125 2023-06-22 22:09:35,091 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.661e+02 3.911e+02 4.503e+02 6.126e+02 1.860e+03, threshold=9.006e+02, percent-clipped=8.0 2023-06-22 22:10:32,006 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=22.5 2023-06-22 22:10:34,278 INFO [train.py:996] (3/4) Epoch 8, batch 8450, loss[loss=0.2272, simple_loss=0.2942, pruned_loss=0.08013, over 21735.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3105, pruned_loss=0.0781, over 4281924.89 frames. ], batch size: 414, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 22:10:37,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1331472.0, ans=0.0 2023-06-22 22:10:40,416 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.74 vs. limit=12.0 2023-06-22 22:10:58,794 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=15.0 2023-06-22 22:12:12,530 INFO [train.py:996] (3/4) Epoch 8, batch 8500, loss[loss=0.2192, simple_loss=0.2759, pruned_loss=0.08123, over 21531.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3067, pruned_loss=0.07921, over 4280047.01 frames. ], batch size: 230, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 22:12:37,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1331832.0, ans=0.04949747468305833 2023-06-22 22:12:58,440 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.125e+02 4.200e+02 5.679e+02 8.112e+02 1.673e+03, threshold=1.136e+03, percent-clipped=13.0 2023-06-22 22:13:43,598 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.52 vs. limit=10.0 2023-06-22 22:13:44,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1332012.0, ans=0.125 2023-06-22 22:13:54,924 INFO [train.py:996] (3/4) Epoch 8, batch 8550, loss[loss=0.2762, simple_loss=0.3801, pruned_loss=0.08612, over 21250.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3135, pruned_loss=0.08262, over 4286494.78 frames. ], batch size: 548, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:14:13,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1332132.0, ans=0.1 2023-06-22 22:15:13,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1332252.0, ans=0.0 2023-06-22 22:15:35,655 INFO [train.py:996] (3/4) Epoch 8, batch 8600, loss[loss=0.3056, simple_loss=0.3666, pruned_loss=0.1223, over 21309.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3207, pruned_loss=0.08531, over 4282047.12 frames. ], batch size: 143, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:15:50,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1332432.0, ans=0.0 2023-06-22 22:16:04,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1332432.0, ans=0.0 2023-06-22 22:16:32,345 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.201e+02 4.143e+02 4.841e+02 5.659e+02 1.807e+03, threshold=9.683e+02, percent-clipped=7.0 2023-06-22 22:16:47,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1332552.0, ans=0.125 2023-06-22 22:16:57,622 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=15.0 2023-06-22 22:17:15,218 INFO [train.py:996] (3/4) Epoch 8, batch 8650, loss[loss=0.2022, simple_loss=0.2858, pruned_loss=0.0593, over 21106.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3256, pruned_loss=0.08514, over 4285242.67 frames. ], batch size: 143, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:17:42,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1332732.0, ans=0.125 2023-06-22 22:18:53,768 INFO [train.py:996] (3/4) Epoch 8, batch 8700, loss[loss=0.199, simple_loss=0.278, pruned_loss=0.05998, over 15281.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.318, pruned_loss=0.08216, over 4270533.60 frames. ], batch size: 61, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:18:54,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1332972.0, ans=0.125 2023-06-22 22:19:35,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1333092.0, ans=0.125 2023-06-22 22:19:38,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1333092.0, ans=0.125 2023-06-22 22:19:48,711 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.793e+02 4.063e+02 5.741e+02 9.934e+02 1.995e+03, threshold=1.148e+03, percent-clipped=26.0 2023-06-22 22:20:06,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1333152.0, ans=0.125 2023-06-22 22:20:32,228 INFO [train.py:996] (3/4) Epoch 8, batch 8750, loss[loss=0.2387, simple_loss=0.3111, pruned_loss=0.08318, over 21878.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3148, pruned_loss=0.08271, over 4271979.01 frames. ], batch size: 124, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:21:43,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1333452.0, ans=0.125 2023-06-22 22:21:45,465 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=22.5 2023-06-22 22:21:49,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1333452.0, ans=0.125 2023-06-22 22:22:16,514 INFO [train.py:996] (3/4) Epoch 8, batch 8800, loss[loss=0.3406, simple_loss=0.4085, pruned_loss=0.1364, over 21443.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.326, pruned_loss=0.08588, over 4271096.98 frames. ], batch size: 507, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:22:18,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1333572.0, ans=0.125 2023-06-22 22:22:54,231 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 22:23:03,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1333692.0, ans=0.125 2023-06-22 22:23:05,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1333692.0, ans=0.2 2023-06-22 22:23:07,917 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.331e+02 4.564e+02 6.230e+02 9.935e+02 2.348e+03, threshold=1.246e+03, percent-clipped=15.0 2023-06-22 22:23:35,458 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-22 22:23:50,302 INFO [train.py:996] (3/4) Epoch 8, batch 8850, loss[loss=0.2238, simple_loss=0.3228, pruned_loss=0.0624, over 21295.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3325, pruned_loss=0.08827, over 4272777.62 frames. ], batch size: 176, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:23:55,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1333872.0, ans=0.125 2023-06-22 22:24:00,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1333872.0, ans=0.125 2023-06-22 22:24:00,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1333872.0, ans=0.125 2023-06-22 22:24:22,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1333932.0, ans=0.2 2023-06-22 22:25:24,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1334172.0, ans=0.125 2023-06-22 22:25:26,317 INFO [train.py:996] (3/4) Epoch 8, batch 8900, loss[loss=0.215, simple_loss=0.2966, pruned_loss=0.06674, over 21754.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3254, pruned_loss=0.08667, over 4265993.98 frames. ], batch size: 351, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:26:20,774 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.244e+02 4.551e+02 5.330e+02 7.944e+02 2.391e+03, threshold=1.066e+03, percent-clipped=3.0 2023-06-22 22:27:00,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1334412.0, ans=0.5 2023-06-22 22:27:11,650 INFO [train.py:996] (3/4) Epoch 8, batch 8950, loss[loss=0.2276, simple_loss=0.2929, pruned_loss=0.08115, over 21596.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3256, pruned_loss=0.08511, over 4267478.88 frames. ], batch size: 263, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:27:35,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1334532.0, ans=0.2 2023-06-22 22:27:42,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1334532.0, ans=0.025 2023-06-22 22:27:56,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1334592.0, ans=0.125 2023-06-22 22:27:58,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1334592.0, ans=0.125 2023-06-22 22:28:20,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1334652.0, ans=0.125 2023-06-22 22:28:36,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1334712.0, ans=0.125 2023-06-22 22:28:50,966 INFO [train.py:996] (3/4) Epoch 8, batch 9000, loss[loss=0.2391, simple_loss=0.2955, pruned_loss=0.0913, over 21617.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3192, pruned_loss=0.08455, over 4270993.52 frames. ], batch size: 332, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:28:50,967 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-22 22:29:12,148 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2658, simple_loss=0.3603, pruned_loss=0.0856, over 1796401.00 frames. 2023-06-22 22:29:12,149 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-22 22:29:19,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1334772.0, ans=0.0 2023-06-22 22:29:36,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1334832.0, ans=0.125 2023-06-22 22:29:52,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1334892.0, ans=0.2 2023-06-22 22:30:00,387 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.733e+02 4.159e+02 6.404e+02 9.275e+02 1.956e+03, threshold=1.281e+03, percent-clipped=15.0 2023-06-22 22:30:19,620 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 22:30:29,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1335012.0, ans=0.125 2023-06-22 22:30:41,310 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2023-06-22 22:30:43,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1335012.0, ans=0.125 2023-06-22 22:30:51,404 INFO [train.py:996] (3/4) Epoch 8, batch 9050, loss[loss=0.2126, simple_loss=0.2954, pruned_loss=0.06491, over 21331.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3149, pruned_loss=0.08091, over 4270018.30 frames. ], batch size: 211, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:31:20,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-22 22:32:09,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1335252.0, ans=0.1 2023-06-22 22:32:13,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1335252.0, ans=0.0 2023-06-22 22:32:14,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1335252.0, ans=0.2 2023-06-22 22:32:24,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1335312.0, ans=0.2 2023-06-22 22:32:26,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1335312.0, ans=0.125 2023-06-22 22:32:32,040 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-06-22 22:32:33,985 INFO [train.py:996] (3/4) Epoch 8, batch 9100, loss[loss=0.2212, simple_loss=0.3158, pruned_loss=0.06326, over 21586.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3208, pruned_loss=0.08312, over 4273397.74 frames. ], batch size: 230, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:33:15,624 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.14 vs. limit=6.0 2023-06-22 22:33:32,276 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.829e+02 4.374e+02 5.511e+02 8.272e+02 1.713e+03, threshold=1.102e+03, percent-clipped=4.0 2023-06-22 22:34:01,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1335612.0, ans=0.2 2023-06-22 22:34:06,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1335612.0, ans=0.035 2023-06-22 22:34:15,745 INFO [train.py:996] (3/4) Epoch 8, batch 9150, loss[loss=0.291, simple_loss=0.3854, pruned_loss=0.09824, over 21578.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3253, pruned_loss=0.08086, over 4271472.73 frames. ], batch size: 471, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:34:29,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1335672.0, ans=0.0 2023-06-22 22:34:40,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1335732.0, ans=0.0 2023-06-22 22:35:22,447 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-22 22:35:42,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1335912.0, ans=0.125 2023-06-22 22:35:56,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1335912.0, ans=0.125 2023-06-22 22:36:01,113 INFO [train.py:996] (3/4) Epoch 8, batch 9200, loss[loss=0.2576, simple_loss=0.3433, pruned_loss=0.08599, over 21268.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3261, pruned_loss=0.07942, over 4274798.87 frames. ], batch size: 548, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:36:16,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1336032.0, ans=0.0 2023-06-22 22:36:40,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1336032.0, ans=0.125 2023-06-22 22:36:59,568 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.939e+02 4.377e+02 5.436e+02 8.538e+02 1.737e+03, threshold=1.087e+03, percent-clipped=12.0 2023-06-22 22:37:22,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1336212.0, ans=0.125 2023-06-22 22:37:40,977 INFO [train.py:996] (3/4) Epoch 8, batch 9250, loss[loss=0.2422, simple_loss=0.3646, pruned_loss=0.05987, over 19712.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3268, pruned_loss=0.08278, over 4277690.70 frames. ], batch size: 702, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:38:46,586 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-22 22:39:02,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1336512.0, ans=0.125 2023-06-22 22:39:16,201 INFO [train.py:996] (3/4) Epoch 8, batch 9300, loss[loss=0.2319, simple_loss=0.28, pruned_loss=0.0919, over 21343.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3224, pruned_loss=0.08392, over 4271563.44 frames. ], batch size: 177, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:39:16,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1336572.0, ans=0.0 2023-06-22 22:39:19,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1336572.0, ans=10.0 2023-06-22 22:39:25,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1336572.0, ans=0.125 2023-06-22 22:40:11,044 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.867e+02 5.198e+02 7.448e+02 1.175e+03 2.635e+03, threshold=1.490e+03, percent-clipped=31.0 2023-06-22 22:40:13,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1336752.0, ans=0.2 2023-06-22 22:40:20,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1336752.0, ans=0.0 2023-06-22 22:40:27,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1336752.0, ans=0.0 2023-06-22 22:40:51,247 INFO [train.py:996] (3/4) Epoch 8, batch 9350, loss[loss=0.2685, simple_loss=0.3383, pruned_loss=0.09928, over 21484.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3278, pruned_loss=0.08511, over 4265397.94 frames. ], batch size: 211, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:41:44,531 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.51 vs. limit=12.0 2023-06-22 22:42:36,750 INFO [train.py:996] (3/4) Epoch 8, batch 9400, loss[loss=0.2203, simple_loss=0.2899, pruned_loss=0.07539, over 21762.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3278, pruned_loss=0.08518, over 4273391.26 frames. ], batch size: 124, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:42:53,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1337172.0, ans=0.125 2023-06-22 22:43:20,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1337292.0, ans=0.1 2023-06-22 22:43:32,553 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.205e+02 4.546e+02 6.111e+02 8.751e+02 2.078e+03, threshold=1.222e+03, percent-clipped=3.0 2023-06-22 22:43:42,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1337352.0, ans=0.125 2023-06-22 22:43:57,185 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-22 22:44:16,702 INFO [train.py:996] (3/4) Epoch 8, batch 9450, loss[loss=0.1929, simple_loss=0.2609, pruned_loss=0.06241, over 21672.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3201, pruned_loss=0.08458, over 4269780.83 frames. ], batch size: 282, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:44:17,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1337472.0, ans=0.125 2023-06-22 22:44:53,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=22.5 2023-06-22 22:45:07,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1337592.0, ans=0.125 2023-06-22 22:45:08,257 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=15.0 2023-06-22 22:45:54,745 INFO [train.py:996] (3/4) Epoch 8, batch 9500, loss[loss=0.2175, simple_loss=0.2798, pruned_loss=0.07761, over 22008.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3109, pruned_loss=0.08164, over 4260352.56 frames. ], batch size: 375, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:46:08,949 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-22 22:46:11,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1337772.0, ans=0.0 2023-06-22 22:46:16,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1337832.0, ans=0.5 2023-06-22 22:46:17,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1337832.0, ans=0.0 2023-06-22 22:46:32,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1337892.0, ans=0.125 2023-06-22 22:46:45,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1337892.0, ans=0.2 2023-06-22 22:46:50,866 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.196e+02 5.640e+02 7.713e+02 1.096e+03 2.487e+03, threshold=1.543e+03, percent-clipped=16.0 2023-06-22 22:47:05,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=22.5 2023-06-22 22:47:34,313 INFO [train.py:996] (3/4) Epoch 8, batch 9550, loss[loss=0.2549, simple_loss=0.3286, pruned_loss=0.09058, over 21745.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3156, pruned_loss=0.0839, over 4265971.73 frames. ], batch size: 441, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:47:42,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1338072.0, ans=0.07 2023-06-22 22:47:57,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1338132.0, ans=0.0 2023-06-22 22:49:14,078 INFO [train.py:996] (3/4) Epoch 8, batch 9600, loss[loss=0.2316, simple_loss=0.3008, pruned_loss=0.08123, over 21686.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3172, pruned_loss=0.08514, over 4270978.14 frames. ], batch size: 263, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:49:22,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1338372.0, ans=0.125 2023-06-22 22:49:57,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1338492.0, ans=0.0 2023-06-22 22:50:00,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1338492.0, ans=0.1 2023-06-22 22:50:03,245 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.129e+02 4.133e+02 5.747e+02 7.464e+02 1.666e+03, threshold=1.149e+03, percent-clipped=1.0 2023-06-22 22:50:12,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1338552.0, ans=0.125 2023-06-22 22:50:29,012 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 22:50:49,764 INFO [train.py:996] (3/4) Epoch 8, batch 9650, loss[loss=0.2726, simple_loss=0.3344, pruned_loss=0.1054, over 21723.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3169, pruned_loss=0.08478, over 4278088.34 frames. ], batch size: 298, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:51:45,863 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 22:51:58,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1338852.0, ans=0.125 2023-06-22 22:52:17,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1338912.0, ans=0.125 2023-06-22 22:52:28,539 INFO [train.py:996] (3/4) Epoch 8, batch 9700, loss[loss=0.2589, simple_loss=0.3309, pruned_loss=0.09344, over 21766.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3206, pruned_loss=0.0851, over 4276623.40 frames. ], batch size: 414, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:52:41,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1338972.0, ans=0.1 2023-06-22 22:53:03,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1339092.0, ans=0.1 2023-06-22 22:53:04,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1339092.0, ans=0.0 2023-06-22 22:53:18,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.146e+02 4.568e+02 6.321e+02 8.796e+02 1.656e+03, threshold=1.264e+03, percent-clipped=3.0 2023-06-22 22:53:57,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1339212.0, ans=0.125 2023-06-22 22:54:05,573 INFO [train.py:996] (3/4) Epoch 8, batch 9750, loss[loss=0.2403, simple_loss=0.3578, pruned_loss=0.06137, over 20833.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3149, pruned_loss=0.08314, over 4271211.56 frames. ], batch size: 608, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 22:54:07,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1339272.0, ans=0.125 2023-06-22 22:55:07,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1339452.0, ans=10.0 2023-06-22 22:55:11,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1339512.0, ans=0.2 2023-06-22 22:55:20,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1339512.0, ans=0.05 2023-06-22 22:55:42,074 INFO [train.py:996] (3/4) Epoch 8, batch 9800, loss[loss=0.2339, simple_loss=0.3025, pruned_loss=0.08262, over 21882.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3143, pruned_loss=0.08365, over 4270929.78 frames. ], batch size: 371, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 22:55:50,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1339572.0, ans=0.125 2023-06-22 22:56:31,477 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.189e+02 3.646e+02 4.309e+02 6.187e+02 1.699e+03, threshold=8.618e+02, percent-clipped=3.0 2023-06-22 22:56:59,278 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-06-22 22:57:19,565 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.86 vs. limit=15.0 2023-06-22 22:57:19,959 INFO [train.py:996] (3/4) Epoch 8, batch 9850, loss[loss=0.223, simple_loss=0.2818, pruned_loss=0.08213, over 21677.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3104, pruned_loss=0.08379, over 4272120.74 frames. ], batch size: 282, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 22:57:39,521 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-22 22:58:06,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1339992.0, ans=10.0 2023-06-22 22:58:12,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1340052.0, ans=0.95 2023-06-22 22:58:39,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1340112.0, ans=0.1 2023-06-22 22:58:51,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1340112.0, ans=0.0 2023-06-22 22:58:54,068 INFO [train.py:996] (3/4) Epoch 8, batch 9900, loss[loss=0.2787, simple_loss=0.3483, pruned_loss=0.1045, over 21693.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3063, pruned_loss=0.08245, over 4265314.07 frames. ], batch size: 351, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 22:59:39,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1340292.0, ans=0.125 2023-06-22 22:59:45,629 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.205e+02 4.510e+02 5.793e+02 9.115e+02 1.830e+03, threshold=1.159e+03, percent-clipped=29.0 2023-06-22 22:59:47,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1340352.0, ans=0.125 2023-06-22 23:00:25,758 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-22 23:00:33,416 INFO [train.py:996] (3/4) Epoch 8, batch 9950, loss[loss=0.2654, simple_loss=0.3563, pruned_loss=0.08725, over 19960.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3099, pruned_loss=0.08517, over 4261150.23 frames. ], batch size: 703, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:00:44,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1340472.0, ans=0.2 2023-06-22 23:01:16,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1340592.0, ans=0.125 2023-06-22 23:01:25,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1340652.0, ans=0.125 2023-06-22 23:02:13,484 INFO [train.py:996] (3/4) Epoch 8, batch 10000, loss[loss=0.2259, simple_loss=0.2966, pruned_loss=0.07765, over 21741.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3044, pruned_loss=0.08292, over 4265534.66 frames. ], batch size: 352, lr: 3.77e-03, grad_scale: 32.0 2023-06-22 23:02:46,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1340832.0, ans=0.0 2023-06-22 23:03:01,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1340892.0, ans=0.2 2023-06-22 23:03:05,269 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.683e+02 4.495e+02 6.092e+02 8.521e+02 2.124e+03, threshold=1.218e+03, percent-clipped=12.0 2023-06-22 23:03:11,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=1340952.0, ans=0.2 2023-06-22 23:03:54,440 INFO [train.py:996] (3/4) Epoch 8, batch 10050, loss[loss=0.1806, simple_loss=0.2676, pruned_loss=0.04685, over 16792.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3071, pruned_loss=0.08361, over 4270323.07 frames. ], batch size: 60, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:04:21,765 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 23:04:32,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1341192.0, ans=0.1 2023-06-22 23:04:45,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1341192.0, ans=0.0 2023-06-22 23:05:32,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1341372.0, ans=0.125 2023-06-22 23:05:33,490 INFO [train.py:996] (3/4) Epoch 8, batch 10100, loss[loss=0.2027, simple_loss=0.2549, pruned_loss=0.07521, over 20211.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3039, pruned_loss=0.0814, over 4274868.74 frames. ], batch size: 703, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:05:49,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1341372.0, ans=0.125 2023-06-22 23:06:36,439 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-22 23:06:40,185 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.964e+02 4.513e+02 5.773e+02 8.039e+02 1.456e+03, threshold=1.155e+03, percent-clipped=7.0 2023-06-22 23:06:48,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1341552.0, ans=0.1 2023-06-22 23:07:16,816 INFO [train.py:996] (3/4) Epoch 8, batch 10150, loss[loss=0.2633, simple_loss=0.3265, pruned_loss=0.1001, over 21434.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3111, pruned_loss=0.08454, over 4271297.29 frames. ], batch size: 194, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:07:20,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1341672.0, ans=0.125 2023-06-22 23:07:33,260 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 23:08:01,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1341792.0, ans=0.2 2023-06-22 23:08:39,267 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.20 vs. limit=15.0 2023-06-22 23:08:50,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1341912.0, ans=0.05 2023-06-22 23:08:55,274 INFO [train.py:996] (3/4) Epoch 8, batch 10200, loss[loss=0.23, simple_loss=0.3099, pruned_loss=0.07503, over 21699.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.311, pruned_loss=0.08274, over 4265184.24 frames. ], batch size: 298, lr: 3.77e-03, grad_scale: 8.0 2023-06-22 23:09:52,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1342092.0, ans=0.1 2023-06-22 23:09:59,679 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.693e+02 4.053e+02 5.323e+02 7.160e+02 1.292e+03, threshold=1.065e+03, percent-clipped=4.0 2023-06-22 23:10:23,204 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2023-06-22 23:10:35,041 INFO [train.py:996] (3/4) Epoch 8, batch 10250, loss[loss=0.2903, simple_loss=0.365, pruned_loss=0.1078, over 21832.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3083, pruned_loss=0.07888, over 4259864.28 frames. ], batch size: 124, lr: 3.77e-03, grad_scale: 8.0 2023-06-22 23:12:14,709 INFO [train.py:996] (3/4) Epoch 8, batch 10300, loss[loss=0.253, simple_loss=0.3354, pruned_loss=0.08533, over 21900.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3108, pruned_loss=0.07829, over 4262218.99 frames. ], batch size: 372, lr: 3.77e-03, grad_scale: 8.0 2023-06-22 23:12:57,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1342632.0, ans=0.0 2023-06-22 23:13:19,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1342752.0, ans=0.1 2023-06-22 23:13:20,046 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.911e+02 4.156e+02 6.277e+02 8.296e+02 2.131e+03, threshold=1.255e+03, percent-clipped=15.0 2023-06-22 23:13:36,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1342752.0, ans=0.0 2023-06-22 23:13:42,685 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 23:14:00,719 INFO [train.py:996] (3/4) Epoch 8, batch 10350, loss[loss=0.2141, simple_loss=0.2916, pruned_loss=0.06827, over 21835.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3129, pruned_loss=0.07869, over 4257862.89 frames. ], batch size: 317, lr: 3.77e-03, grad_scale: 8.0 2023-06-22 23:14:38,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1342932.0, ans=0.1 2023-06-22 23:15:43,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1343112.0, ans=0.125 2023-06-22 23:15:51,348 INFO [train.py:996] (3/4) Epoch 8, batch 10400, loss[loss=0.2067, simple_loss=0.2746, pruned_loss=0.06943, over 21402.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3068, pruned_loss=0.07746, over 4257916.66 frames. ], batch size: 194, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:16:45,507 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.450e+02 4.781e+02 6.358e+02 9.315e+02 2.129e+03, threshold=1.272e+03, percent-clipped=10.0 2023-06-22 23:16:58,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1343412.0, ans=0.125 2023-06-22 23:16:58,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1343412.0, ans=0.04949747468305833 2023-06-22 23:17:31,393 INFO [train.py:996] (3/4) Epoch 8, batch 10450, loss[loss=0.2464, simple_loss=0.3225, pruned_loss=0.08519, over 21656.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3095, pruned_loss=0.08016, over 4266553.69 frames. ], batch size: 263, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:19:09,663 INFO [train.py:996] (3/4) Epoch 8, batch 10500, loss[loss=0.2301, simple_loss=0.2909, pruned_loss=0.08462, over 21297.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3087, pruned_loss=0.07996, over 4271094.64 frames. ], batch size: 471, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:19:11,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1343772.0, ans=0.1 2023-06-22 23:19:14,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1343772.0, ans=0.125 2023-06-22 23:19:15,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1343772.0, ans=0.125 2023-06-22 23:19:27,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1343832.0, ans=0.0 2023-06-22 23:19:27,983 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-22 23:19:38,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1343832.0, ans=0.0 2023-06-22 23:20:03,199 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.164e+02 4.877e+02 7.502e+02 1.116e+03 2.000e+03, threshold=1.500e+03, percent-clipped=17.0 2023-06-22 23:20:51,449 INFO [train.py:996] (3/4) Epoch 8, batch 10550, loss[loss=0.2245, simple_loss=0.2936, pruned_loss=0.07771, over 21199.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.305, pruned_loss=0.07935, over 4259081.15 frames. ], batch size: 548, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:21:24,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1344132.0, ans=0.125 2023-06-22 23:21:31,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1344192.0, ans=0.0 2023-06-22 23:21:33,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1344192.0, ans=0.125 2023-06-22 23:21:41,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1344192.0, ans=0.05 2023-06-22 23:22:16,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1344252.0, ans=0.125 2023-06-22 23:22:18,227 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-22 23:22:43,176 INFO [train.py:996] (3/4) Epoch 8, batch 10600, loss[loss=0.2714, simple_loss=0.3644, pruned_loss=0.08921, over 21624.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3026, pruned_loss=0.07885, over 4252100.04 frames. ], batch size: 414, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:23:49,146 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.068e+02 4.057e+02 5.623e+02 8.035e+02 1.796e+03, threshold=1.125e+03, percent-clipped=5.0 2023-06-22 23:24:05,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1344552.0, ans=0.0 2023-06-22 23:24:23,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1344612.0, ans=0.1 2023-06-22 23:24:30,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1344612.0, ans=0.125 2023-06-22 23:24:35,498 INFO [train.py:996] (3/4) Epoch 8, batch 10650, loss[loss=0.1939, simple_loss=0.2779, pruned_loss=0.05496, over 21699.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3046, pruned_loss=0.0773, over 4252965.20 frames. ], batch size: 298, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:24:53,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1344672.0, ans=0.0 2023-06-22 23:25:17,806 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=12.0 2023-06-22 23:25:37,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1344852.0, ans=0.125 2023-06-22 23:25:39,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1344852.0, ans=0.125 2023-06-22 23:26:30,717 INFO [train.py:996] (3/4) Epoch 8, batch 10700, loss[loss=0.198, simple_loss=0.2686, pruned_loss=0.06368, over 21403.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3035, pruned_loss=0.07772, over 4251431.37 frames. ], batch size: 194, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:26:45,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1345032.0, ans=0.125 2023-06-22 23:26:56,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1345032.0, ans=0.125 2023-06-22 23:27:00,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1345032.0, ans=0.0 2023-06-22 23:27:28,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1345152.0, ans=0.125 2023-06-22 23:27:36,623 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.264e+02 5.189e+02 6.885e+02 9.178e+02 1.741e+03, threshold=1.377e+03, percent-clipped=11.0 2023-06-22 23:27:47,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1345152.0, ans=0.125 2023-06-22 23:27:49,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1345152.0, ans=0.2 2023-06-22 23:27:53,967 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 23:28:13,086 INFO [train.py:996] (3/4) Epoch 8, batch 10750, loss[loss=0.3088, simple_loss=0.3997, pruned_loss=0.1089, over 21685.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3141, pruned_loss=0.08215, over 4258399.43 frames. ], batch size: 441, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:28:23,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1345272.0, ans=0.125 2023-06-22 23:28:24,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1345272.0, ans=0.0 2023-06-22 23:28:31,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1345332.0, ans=0.0 2023-06-22 23:28:48,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1345332.0, ans=0.1 2023-06-22 23:29:42,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1345512.0, ans=0.1 2023-06-22 23:29:55,283 INFO [train.py:996] (3/4) Epoch 8, batch 10800, loss[loss=0.2384, simple_loss=0.3159, pruned_loss=0.08046, over 19989.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3187, pruned_loss=0.08213, over 4262556.05 frames. ], batch size: 702, lr: 3.77e-03, grad_scale: 32.0 2023-06-22 23:31:04,396 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.247e+02 4.824e+02 6.509e+02 9.810e+02 2.428e+03, threshold=1.302e+03, percent-clipped=4.0 2023-06-22 23:31:39,579 INFO [train.py:996] (3/4) Epoch 8, batch 10850, loss[loss=0.2149, simple_loss=0.2902, pruned_loss=0.0698, over 21703.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3183, pruned_loss=0.0825, over 4268912.13 frames. ], batch size: 333, lr: 3.77e-03, grad_scale: 32.0 2023-06-22 23:31:52,359 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=15.0 2023-06-22 23:33:01,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1346112.0, ans=0.125 2023-06-22 23:33:03,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1346112.0, ans=0.1 2023-06-22 23:33:15,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1346112.0, ans=0.1 2023-06-22 23:33:19,394 INFO [train.py:996] (3/4) Epoch 8, batch 10900, loss[loss=0.2046, simple_loss=0.2871, pruned_loss=0.06105, over 21398.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3108, pruned_loss=0.08028, over 4246799.53 frames. ], batch size: 194, lr: 3.77e-03, grad_scale: 32.0 2023-06-22 23:33:25,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1346172.0, ans=0.125 2023-06-22 23:34:19,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1346292.0, ans=0.2 2023-06-22 23:34:23,719 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.825e+02 3.987e+02 5.547e+02 7.924e+02 1.642e+03, threshold=1.109e+03, percent-clipped=4.0 2023-06-22 23:35:00,085 INFO [train.py:996] (3/4) Epoch 8, batch 10950, loss[loss=0.2135, simple_loss=0.2814, pruned_loss=0.07281, over 21576.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3069, pruned_loss=0.07823, over 4248725.09 frames. ], batch size: 263, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:35:14,168 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-22 23:35:51,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1346592.0, ans=0.04949747468305833 2023-06-22 23:35:54,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1346592.0, ans=0.125 2023-06-22 23:36:05,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1346652.0, ans=0.0 2023-06-22 23:36:16,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1346712.0, ans=0.125 2023-06-22 23:36:38,544 INFO [train.py:996] (3/4) Epoch 8, batch 11000, loss[loss=0.2588, simple_loss=0.3219, pruned_loss=0.09789, over 21793.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3052, pruned_loss=0.07928, over 4263581.26 frames. ], batch size: 441, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:36:41,008 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.67 vs. limit=6.0 2023-06-22 23:36:42,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1346772.0, ans=0.035 2023-06-22 23:36:42,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1346772.0, ans=0.0 2023-06-22 23:36:44,024 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.99 vs. limit=10.0 2023-06-22 23:37:38,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1346952.0, ans=0.0 2023-06-22 23:37:43,002 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.878e+02 3.827e+02 4.499e+02 6.468e+02 1.217e+03, threshold=8.999e+02, percent-clipped=2.0 2023-06-22 23:38:15,768 INFO [train.py:996] (3/4) Epoch 8, batch 11050, loss[loss=0.2044, simple_loss=0.2678, pruned_loss=0.07049, over 21756.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3034, pruned_loss=0.08106, over 4273863.84 frames. ], batch size: 300, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:38:27,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1347072.0, ans=0.0 2023-06-22 23:38:28,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-06-22 23:38:36,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1347132.0, ans=0.125 2023-06-22 23:39:02,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1347192.0, ans=10.0 2023-06-22 23:39:09,795 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.69 vs. limit=22.5 2023-06-22 23:39:26,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1347252.0, ans=0.125 2023-06-22 23:39:27,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1347252.0, ans=0.1 2023-06-22 23:39:54,430 INFO [train.py:996] (3/4) Epoch 8, batch 11100, loss[loss=0.2366, simple_loss=0.3103, pruned_loss=0.08141, over 21756.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.303, pruned_loss=0.08037, over 4279816.56 frames. ], batch size: 351, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:40:02,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1347372.0, ans=0.0 2023-06-22 23:40:37,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1347492.0, ans=0.05 2023-06-22 23:40:48,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1347492.0, ans=0.2 2023-06-22 23:40:48,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1347492.0, ans=0.0 2023-06-22 23:41:00,900 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.278e+02 4.372e+02 5.317e+02 7.818e+02 1.562e+03, threshold=1.063e+03, percent-clipped=13.0 2023-06-22 23:41:01,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1347552.0, ans=0.1 2023-06-22 23:41:03,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1347552.0, ans=0.0 2023-06-22 23:41:34,681 INFO [train.py:996] (3/4) Epoch 8, batch 11150, loss[loss=0.2313, simple_loss=0.3104, pruned_loss=0.07612, over 21306.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3019, pruned_loss=0.08012, over 4270396.37 frames. ], batch size: 144, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:42:14,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1347732.0, ans=0.5 2023-06-22 23:42:35,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1347852.0, ans=0.125 2023-06-22 23:42:44,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1347852.0, ans=0.0 2023-06-22 23:42:48,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1347852.0, ans=0.125 2023-06-22 23:43:08,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1347912.0, ans=0.0 2023-06-22 23:43:15,306 INFO [train.py:996] (3/4) Epoch 8, batch 11200, loss[loss=0.2049, simple_loss=0.2726, pruned_loss=0.06861, over 21873.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3005, pruned_loss=0.07947, over 4270750.67 frames. ], batch size: 373, lr: 3.76e-03, grad_scale: 32.0 2023-06-22 23:44:03,447 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-22 23:44:09,566 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-22 23:44:19,991 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.309e+02 4.266e+02 5.477e+02 7.611e+02 1.407e+03, threshold=1.095e+03, percent-clipped=4.0 2023-06-22 23:44:53,121 INFO [train.py:996] (3/4) Epoch 8, batch 11250, loss[loss=0.2547, simple_loss=0.3318, pruned_loss=0.08874, over 21774.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3006, pruned_loss=0.07951, over 4262290.00 frames. ], batch size: 351, lr: 3.76e-03, grad_scale: 32.0 2023-06-22 23:45:38,252 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-22 23:45:56,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1348452.0, ans=0.125 2023-06-22 23:46:31,389 INFO [train.py:996] (3/4) Epoch 8, batch 11300, loss[loss=0.2192, simple_loss=0.3053, pruned_loss=0.06653, over 21753.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3008, pruned_loss=0.07969, over 4269175.24 frames. ], batch size: 391, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:46:35,468 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=15.0 2023-06-22 23:47:10,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1348632.0, ans=0.125 2023-06-22 23:47:39,336 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.094e+02 3.884e+02 4.784e+02 6.961e+02 1.768e+03, threshold=9.568e+02, percent-clipped=7.0 2023-06-22 23:47:49,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1348752.0, ans=0.0 2023-06-22 23:48:10,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1348872.0, ans=0.125 2023-06-22 23:48:11,943 INFO [train.py:996] (3/4) Epoch 8, batch 11350, loss[loss=0.281, simple_loss=0.3565, pruned_loss=0.1027, over 21736.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3012, pruned_loss=0.07949, over 4262356.30 frames. ], batch size: 332, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:48:40,063 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=12.0 2023-06-22 23:48:49,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1348932.0, ans=0.0 2023-06-22 23:48:52,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1348992.0, ans=0.125 2023-06-22 23:49:02,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1348992.0, ans=0.1 2023-06-22 23:49:19,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1349052.0, ans=0.0 2023-06-22 23:49:41,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1349112.0, ans=0.125 2023-06-22 23:49:54,084 INFO [train.py:996] (3/4) Epoch 8, batch 11400, loss[loss=0.2064, simple_loss=0.3047, pruned_loss=0.05401, over 20706.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3085, pruned_loss=0.08235, over 4268379.61 frames. ], batch size: 607, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:50:10,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1349172.0, ans=0.2 2023-06-22 23:50:36,231 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.90 vs. limit=15.0 2023-06-22 23:51:07,157 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.908e+02 4.448e+02 6.055e+02 8.360e+02 1.667e+03, threshold=1.211e+03, percent-clipped=10.0 2023-06-22 23:51:39,976 INFO [train.py:996] (3/4) Epoch 8, batch 11450, loss[loss=0.2341, simple_loss=0.3117, pruned_loss=0.07827, over 21599.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3091, pruned_loss=0.08103, over 4268562.32 frames. ], batch size: 263, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:52:35,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1349592.0, ans=0.0 2023-06-22 23:53:16,106 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 23:53:17,205 INFO [train.py:996] (3/4) Epoch 8, batch 11500, loss[loss=0.2506, simple_loss=0.3442, pruned_loss=0.07855, over 21858.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3122, pruned_loss=0.08214, over 4266129.81 frames. ], batch size: 371, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:53:17,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1349772.0, ans=0.0 2023-06-22 23:53:24,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1349772.0, ans=0.0 2023-06-22 23:53:29,685 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=15.0 2023-06-22 23:53:37,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1349832.0, ans=0.0 2023-06-22 23:53:59,626 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=22.5 2023-06-22 23:54:05,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1349892.0, ans=0.125 2023-06-22 23:54:22,046 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.283e+02 4.405e+02 5.880e+02 8.915e+02 1.909e+03, threshold=1.176e+03, percent-clipped=7.0 2023-06-22 23:54:25,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1349952.0, ans=0.0 2023-06-22 23:54:38,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1350012.0, ans=0.2 2023-06-22 23:55:04,812 INFO [train.py:996] (3/4) Epoch 8, batch 11550, loss[loss=0.3004, simple_loss=0.4012, pruned_loss=0.09983, over 21748.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3186, pruned_loss=0.08274, over 4267753.76 frames. ], batch size: 351, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:55:27,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1350132.0, ans=0.0 2023-06-22 23:55:44,788 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.72 vs. limit=12.0 2023-06-22 23:55:45,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1350192.0, ans=0.125 2023-06-22 23:56:11,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1350252.0, ans=0.2 2023-06-22 23:56:45,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1350372.0, ans=0.2 2023-06-22 23:56:45,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1350372.0, ans=0.125 2023-06-22 23:56:46,824 INFO [train.py:996] (3/4) Epoch 8, batch 11600, loss[loss=0.2478, simple_loss=0.3417, pruned_loss=0.07695, over 21647.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3306, pruned_loss=0.08443, over 4270677.17 frames. ], batch size: 263, lr: 3.76e-03, grad_scale: 32.0 2023-06-22 23:57:11,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1350432.0, ans=0.125 2023-06-22 23:57:14,240 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=22.5 2023-06-22 23:57:49,976 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.035e+02 5.071e+02 7.210e+02 9.611e+02 2.245e+03, threshold=1.442e+03, percent-clipped=13.0 2023-06-22 23:57:50,931 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-06-22 23:58:27,168 INFO [train.py:996] (3/4) Epoch 8, batch 11650, loss[loss=0.332, simple_loss=0.4475, pruned_loss=0.1082, over 21189.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3368, pruned_loss=0.08473, over 4268859.64 frames. ], batch size: 549, lr: 3.76e-03, grad_scale: 32.0 2023-06-22 23:59:28,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1350852.0, ans=0.025 2023-06-22 23:59:40,312 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.78 vs. limit=10.0 2023-06-23 00:00:05,889 INFO [train.py:996] (3/4) Epoch 8, batch 11700, loss[loss=0.186, simple_loss=0.2472, pruned_loss=0.06238, over 21606.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3285, pruned_loss=0.08402, over 4274750.73 frames. ], batch size: 231, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:00:43,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1351092.0, ans=0.1 2023-06-23 00:01:08,219 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.318e+02 4.484e+02 5.547e+02 7.973e+02 1.731e+03, threshold=1.109e+03, percent-clipped=2.0 2023-06-23 00:01:08,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1351152.0, ans=0.0 2023-06-23 00:01:32,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1351212.0, ans=0.125 2023-06-23 00:01:45,119 INFO [train.py:996] (3/4) Epoch 8, batch 11750, loss[loss=0.2494, simple_loss=0.3136, pruned_loss=0.09257, over 21579.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3187, pruned_loss=0.08367, over 4271510.56 frames. ], batch size: 263, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:02:11,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1351332.0, ans=0.125 2023-06-23 00:02:57,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1351452.0, ans=0.0 2023-06-23 00:03:23,425 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.23 vs. limit=6.0 2023-06-23 00:03:24,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1351572.0, ans=0.125 2023-06-23 00:03:31,063 INFO [train.py:996] (3/4) Epoch 8, batch 11800, loss[loss=0.3358, simple_loss=0.4136, pruned_loss=0.129, over 21406.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3217, pruned_loss=0.08554, over 4273362.30 frames. ], batch size: 507, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:03:47,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1351632.0, ans=0.125 2023-06-23 00:04:32,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1351752.0, ans=0.0 2023-06-23 00:04:33,196 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.146e+02 4.952e+02 6.755e+02 1.112e+03 2.056e+03, threshold=1.351e+03, percent-clipped=25.0 2023-06-23 00:04:54,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1351812.0, ans=0.0 2023-06-23 00:04:55,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1351812.0, ans=0.0 2023-06-23 00:05:11,118 INFO [train.py:996] (3/4) Epoch 8, batch 11850, loss[loss=0.2402, simple_loss=0.3262, pruned_loss=0.07704, over 21595.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3224, pruned_loss=0.08419, over 4275251.96 frames. ], batch size: 263, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:05:17,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1351872.0, ans=0.125 2023-06-23 00:05:24,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1351872.0, ans=0.0 2023-06-23 00:06:42,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1352112.0, ans=0.125 2023-06-23 00:06:52,029 INFO [train.py:996] (3/4) Epoch 8, batch 11900, loss[loss=0.2364, simple_loss=0.3191, pruned_loss=0.07686, over 21570.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3215, pruned_loss=0.08153, over 4277075.03 frames. ], batch size: 389, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:06:52,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1352172.0, ans=0.2 2023-06-23 00:07:14,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1352232.0, ans=0.0 2023-06-23 00:07:22,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1352232.0, ans=0.125 2023-06-23 00:07:47,075 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-23 00:08:07,972 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.952e+02 4.101e+02 5.216e+02 6.925e+02 1.642e+03, threshold=1.043e+03, percent-clipped=1.0 2023-06-23 00:08:11,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1352352.0, ans=0.125 2023-06-23 00:08:16,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1352352.0, ans=0.1 2023-06-23 00:08:23,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=22.5 2023-06-23 00:08:28,188 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-23 00:08:35,006 INFO [train.py:996] (3/4) Epoch 8, batch 11950, loss[loss=0.1937, simple_loss=0.2667, pruned_loss=0.06037, over 21823.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3192, pruned_loss=0.07815, over 4275437.08 frames. ], batch size: 102, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:09:39,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1352592.0, ans=0.1 2023-06-23 00:09:44,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1352652.0, ans=0.2 2023-06-23 00:09:45,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1352652.0, ans=0.02 2023-06-23 00:10:05,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1352712.0, ans=0.125 2023-06-23 00:10:13,528 INFO [train.py:996] (3/4) Epoch 8, batch 12000, loss[loss=0.2507, simple_loss=0.3016, pruned_loss=0.09987, over 21227.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.313, pruned_loss=0.07687, over 4270834.83 frames. ], batch size: 160, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:10:13,529 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 00:10:32,701 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2606, simple_loss=0.356, pruned_loss=0.08257, over 1796401.00 frames. 2023-06-23 00:10:32,702 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-23 00:11:39,449 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 4.103e+02 5.711e+02 8.012e+02 1.968e+03, threshold=1.142e+03, percent-clipped=13.0 2023-06-23 00:11:40,380 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-23 00:12:11,456 INFO [train.py:996] (3/4) Epoch 8, batch 12050, loss[loss=0.2452, simple_loss=0.3073, pruned_loss=0.09152, over 21399.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3115, pruned_loss=0.0788, over 4275651.43 frames. ], batch size: 159, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:12:51,277 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 00:13:13,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1353192.0, ans=0.125 2023-06-23 00:13:51,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1353372.0, ans=0.125 2023-06-23 00:13:53,211 INFO [train.py:996] (3/4) Epoch 8, batch 12100, loss[loss=0.3149, simple_loss=0.3783, pruned_loss=0.1257, over 21404.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3199, pruned_loss=0.08288, over 4271742.08 frames. ], batch size: 507, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:14:47,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1353492.0, ans=0.0 2023-06-23 00:15:04,087 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.947e+02 5.144e+02 7.244e+02 1.095e+03 2.232e+03, threshold=1.449e+03, percent-clipped=22.0 2023-06-23 00:15:45,507 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.80 vs. limit=6.0 2023-06-23 00:15:45,908 INFO [train.py:996] (3/4) Epoch 8, batch 12150, loss[loss=0.272, simple_loss=0.3738, pruned_loss=0.0851, over 21642.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3242, pruned_loss=0.08291, over 4271421.65 frames. ], batch size: 441, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:16:16,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1353732.0, ans=0.125 2023-06-23 00:16:33,510 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-23 00:16:39,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1353852.0, ans=0.1 2023-06-23 00:17:03,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1353912.0, ans=0.125 2023-06-23 00:17:12,005 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.50 vs. limit=15.0 2023-06-23 00:17:13,619 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-06-23 00:17:22,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1353912.0, ans=0.1 2023-06-23 00:17:25,409 INFO [train.py:996] (3/4) Epoch 8, batch 12200, loss[loss=0.2101, simple_loss=0.2729, pruned_loss=0.07362, over 21654.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3196, pruned_loss=0.08229, over 4272164.05 frames. ], batch size: 333, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:17:48,530 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=22.5 2023-06-23 00:17:57,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1354032.0, ans=0.2 2023-06-23 00:17:57,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1354032.0, ans=0.0 2023-06-23 00:18:21,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1354152.0, ans=0.125 2023-06-23 00:18:27,542 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.162e+02 4.597e+02 6.328e+02 9.392e+02 1.574e+03, threshold=1.266e+03, percent-clipped=2.0 2023-06-23 00:18:35,559 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 00:18:57,570 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-23 00:18:59,492 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.18 vs. limit=8.0 2023-06-23 00:19:03,022 INFO [train.py:996] (3/4) Epoch 8, batch 12250, loss[loss=0.1764, simple_loss=0.2557, pruned_loss=0.04855, over 21584.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3109, pruned_loss=0.07893, over 4269211.38 frames. ], batch size: 263, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:19:08,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1354272.0, ans=0.0 2023-06-23 00:19:32,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1354332.0, ans=0.125 2023-06-23 00:20:32,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1354512.0, ans=0.0 2023-06-23 00:20:41,596 INFO [train.py:996] (3/4) Epoch 8, batch 12300, loss[loss=0.205, simple_loss=0.2971, pruned_loss=0.05648, over 21881.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3021, pruned_loss=0.07271, over 4276348.35 frames. ], batch size: 316, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:21:03,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1354632.0, ans=0.125 2023-06-23 00:21:11,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1354632.0, ans=0.125 2023-06-23 00:21:12,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1354632.0, ans=0.0 2023-06-23 00:21:14,664 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-23 00:21:34,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1354752.0, ans=0.125 2023-06-23 00:21:41,015 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.472e+02 4.122e+02 6.339e+02 8.293e+02 1.636e+03, threshold=1.268e+03, percent-clipped=3.0 2023-06-23 00:22:22,487 INFO [train.py:996] (3/4) Epoch 8, batch 12350, loss[loss=0.2422, simple_loss=0.3356, pruned_loss=0.07443, over 21735.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3081, pruned_loss=0.07455, over 4280436.91 frames. ], batch size: 332, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:22:25,166 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=12.0 2023-06-23 00:22:39,457 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 00:22:40,161 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.63 vs. limit=22.5 2023-06-23 00:23:05,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1354992.0, ans=0.1 2023-06-23 00:23:40,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1355112.0, ans=0.05 2023-06-23 00:23:51,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1355112.0, ans=0.125 2023-06-23 00:24:01,244 INFO [train.py:996] (3/4) Epoch 8, batch 12400, loss[loss=0.2413, simple_loss=0.307, pruned_loss=0.08779, over 21814.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3108, pruned_loss=0.07837, over 4285920.67 frames. ], batch size: 391, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:24:01,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1355172.0, ans=0.0 2023-06-23 00:25:05,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1355352.0, ans=0.125 2023-06-23 00:25:07,641 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.244e+02 4.565e+02 7.015e+02 1.038e+03 2.241e+03, threshold=1.403e+03, percent-clipped=10.0 2023-06-23 00:25:45,035 INFO [train.py:996] (3/4) Epoch 8, batch 12450, loss[loss=0.256, simple_loss=0.3357, pruned_loss=0.08811, over 21533.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3122, pruned_loss=0.08039, over 4283381.34 frames. ], batch size: 414, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:26:08,298 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-23 00:27:21,922 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-23 00:27:22,103 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.94 vs. limit=15.0 2023-06-23 00:27:27,300 INFO [train.py:996] (3/4) Epoch 8, batch 12500, loss[loss=0.2581, simple_loss=0.3597, pruned_loss=0.07829, over 21891.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3245, pruned_loss=0.08456, over 4288084.17 frames. ], batch size: 372, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:27:44,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1355832.0, ans=0.2 2023-06-23 00:28:44,173 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.178e+02 4.947e+02 7.092e+02 9.787e+02 2.648e+03, threshold=1.418e+03, percent-clipped=11.0 2023-06-23 00:28:46,612 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.85 vs. limit=10.0 2023-06-23 00:29:12,275 INFO [train.py:996] (3/4) Epoch 8, batch 12550, loss[loss=0.2392, simple_loss=0.3231, pruned_loss=0.07763, over 21645.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3284, pruned_loss=0.08757, over 4280824.18 frames. ], batch size: 263, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:29:39,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1356132.0, ans=0.125 2023-06-23 00:29:49,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1356132.0, ans=0.1 2023-06-23 00:29:57,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1356132.0, ans=0.0 2023-06-23 00:30:58,473 INFO [train.py:996] (3/4) Epoch 8, batch 12600, loss[loss=0.2368, simple_loss=0.3637, pruned_loss=0.05499, over 20791.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3272, pruned_loss=0.08511, over 4272122.93 frames. ], batch size: 608, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:31:06,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1356372.0, ans=0.125 2023-06-23 00:31:32,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1356432.0, ans=10.0 2023-06-23 00:32:06,800 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.884e+02 4.323e+02 5.935e+02 8.611e+02 2.067e+03, threshold=1.187e+03, percent-clipped=5.0 2023-06-23 00:32:36,775 INFO [train.py:996] (3/4) Epoch 8, batch 12650, loss[loss=0.2536, simple_loss=0.32, pruned_loss=0.09364, over 21807.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3198, pruned_loss=0.08145, over 4281473.33 frames. ], batch size: 112, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:32:59,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1356672.0, ans=0.04949747468305833 2023-06-23 00:33:04,990 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-23 00:33:17,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1356792.0, ans=0.035 2023-06-23 00:33:38,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1356852.0, ans=0.04949747468305833 2023-06-23 00:34:21,572 INFO [train.py:996] (3/4) Epoch 8, batch 12700, loss[loss=0.2958, simple_loss=0.3601, pruned_loss=0.1158, over 21382.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.318, pruned_loss=0.08293, over 4283708.74 frames. ], batch size: 507, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:35:16,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1357092.0, ans=0.05 2023-06-23 00:35:25,457 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.987e+02 4.487e+02 5.893e+02 8.124e+02 1.594e+03, threshold=1.179e+03, percent-clipped=3.0 2023-06-23 00:35:59,892 INFO [train.py:996] (3/4) Epoch 8, batch 12750, loss[loss=0.2439, simple_loss=0.3243, pruned_loss=0.08175, over 21772.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3198, pruned_loss=0.08248, over 4287787.34 frames. ], batch size: 414, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:36:14,981 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-23 00:36:19,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1357332.0, ans=0.0 2023-06-23 00:36:27,689 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=12.0 2023-06-23 00:36:45,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1357392.0, ans=0.125 2023-06-23 00:36:49,191 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-23 00:37:19,064 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-23 00:37:30,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1357512.0, ans=0.125 2023-06-23 00:37:42,783 INFO [train.py:996] (3/4) Epoch 8, batch 12800, loss[loss=0.2434, simple_loss=0.3228, pruned_loss=0.08195, over 21386.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3205, pruned_loss=0.084, over 4289209.42 frames. ], batch size: 176, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:37:56,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1357572.0, ans=0.125 2023-06-23 00:38:05,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1357632.0, ans=0.0 2023-06-23 00:38:42,075 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 4.784e+02 6.128e+02 7.998e+02 1.838e+03, threshold=1.226e+03, percent-clipped=10.0 2023-06-23 00:39:18,522 INFO [train.py:996] (3/4) Epoch 8, batch 12850, loss[loss=0.2241, simple_loss=0.3337, pruned_loss=0.05727, over 19970.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3227, pruned_loss=0.08567, over 4287399.39 frames. ], batch size: 703, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:39:32,590 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=15.0 2023-06-23 00:40:45,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1358112.0, ans=0.125 2023-06-23 00:40:59,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1358112.0, ans=0.025 2023-06-23 00:41:01,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1358112.0, ans=0.0 2023-06-23 00:41:03,079 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=22.5 2023-06-23 00:41:03,879 INFO [train.py:996] (3/4) Epoch 8, batch 12900, loss[loss=0.2116, simple_loss=0.2825, pruned_loss=0.07039, over 21195.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3224, pruned_loss=0.0825, over 4279884.22 frames. ], batch size: 176, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:41:20,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1358232.0, ans=0.0 2023-06-23 00:41:38,634 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-23 00:42:12,800 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.798e+02 4.070e+02 5.539e+02 8.932e+02 2.008e+03, threshold=1.108e+03, percent-clipped=7.0 2023-06-23 00:42:16,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1358352.0, ans=0.1 2023-06-23 00:42:20,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1358352.0, ans=0.125 2023-06-23 00:42:41,236 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 00:42:43,886 INFO [train.py:996] (3/4) Epoch 8, batch 12950, loss[loss=0.247, simple_loss=0.3228, pruned_loss=0.08554, over 21700.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3209, pruned_loss=0.0808, over 4276586.03 frames. ], batch size: 298, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:42:49,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1358472.0, ans=0.1 2023-06-23 00:42:53,237 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=22.5 2023-06-23 00:42:57,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-23 00:43:20,679 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=22.5 2023-06-23 00:43:54,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1358652.0, ans=0.2 2023-06-23 00:44:11,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1358712.0, ans=0.125 2023-06-23 00:44:20,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1358712.0, ans=0.0 2023-06-23 00:44:24,111 INFO [train.py:996] (3/4) Epoch 8, batch 13000, loss[loss=0.2572, simple_loss=0.3347, pruned_loss=0.08985, over 21699.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3221, pruned_loss=0.08151, over 4280633.10 frames. ], batch size: 415, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:45:14,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1358892.0, ans=0.125 2023-06-23 00:45:24,792 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 00:45:30,877 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.055e+02 4.812e+02 8.002e+02 1.036e+03 2.306e+03, threshold=1.600e+03, percent-clipped=23.0 2023-06-23 00:45:38,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1359012.0, ans=0.95 2023-06-23 00:46:00,831 INFO [train.py:996] (3/4) Epoch 8, batch 13050, loss[loss=0.2331, simple_loss=0.3016, pruned_loss=0.08229, over 21666.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.316, pruned_loss=0.07925, over 4284951.15 frames. ], batch size: 263, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:46:04,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1359072.0, ans=0.125 2023-06-23 00:46:06,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1359072.0, ans=0.1 2023-06-23 00:46:08,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1359072.0, ans=0.125 2023-06-23 00:46:12,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1359072.0, ans=0.0 2023-06-23 00:46:34,524 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=12.0 2023-06-23 00:46:39,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1359192.0, ans=0.125 2023-06-23 00:46:58,728 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 00:47:29,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1359312.0, ans=0.035 2023-06-23 00:47:31,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1359312.0, ans=0.04949747468305833 2023-06-23 00:47:39,002 INFO [train.py:996] (3/4) Epoch 8, batch 13100, loss[loss=0.2394, simple_loss=0.3193, pruned_loss=0.07978, over 21763.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3144, pruned_loss=0.07929, over 4281631.78 frames. ], batch size: 247, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:48:53,087 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.010e+02 4.152e+02 4.803e+02 6.425e+02 1.389e+03, threshold=9.605e+02, percent-clipped=0.0 2023-06-23 00:49:10,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1359612.0, ans=0.125 2023-06-23 00:49:23,639 INFO [train.py:996] (3/4) Epoch 8, batch 13150, loss[loss=0.2552, simple_loss=0.3277, pruned_loss=0.0913, over 21377.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3157, pruned_loss=0.081, over 4282758.08 frames. ], batch size: 548, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:49:24,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1359672.0, ans=0.125 2023-06-23 00:50:22,751 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.98 vs. limit=10.0 2023-06-23 00:51:08,205 INFO [train.py:996] (3/4) Epoch 8, batch 13200, loss[loss=0.2672, simple_loss=0.3351, pruned_loss=0.09968, over 21212.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3152, pruned_loss=0.08183, over 4288451.81 frames. ], batch size: 143, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:51:13,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1359972.0, ans=0.1 2023-06-23 00:51:35,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1360032.0, ans=0.125 2023-06-23 00:51:38,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1360032.0, ans=0.125 2023-06-23 00:51:46,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1360092.0, ans=0.125 2023-06-23 00:51:58,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1360092.0, ans=0.0 2023-06-23 00:52:03,556 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=22.5 2023-06-23 00:52:13,784 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.899e+02 4.724e+02 6.289e+02 8.620e+02 1.453e+03, threshold=1.258e+03, percent-clipped=16.0 2023-06-23 00:52:18,080 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.63 vs. limit=10.0 2023-06-23 00:52:45,205 INFO [train.py:996] (3/4) Epoch 8, batch 13250, loss[loss=0.2163, simple_loss=0.2942, pruned_loss=0.06917, over 21412.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.315, pruned_loss=0.08331, over 4292931.39 frames. ], batch size: 194, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:52:47,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1360272.0, ans=0.1 2023-06-23 00:53:05,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1360332.0, ans=0.125 2023-06-23 00:54:01,241 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=22.5 2023-06-23 00:54:26,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1360512.0, ans=0.125 2023-06-23 00:54:31,040 INFO [train.py:996] (3/4) Epoch 8, batch 13300, loss[loss=0.2739, simple_loss=0.3406, pruned_loss=0.1036, over 21316.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3183, pruned_loss=0.0829, over 4291222.46 frames. ], batch size: 176, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:54:38,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1360572.0, ans=0.125 2023-06-23 00:54:41,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1360572.0, ans=0.0 2023-06-23 00:55:14,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1360692.0, ans=0.0 2023-06-23 00:55:38,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1360752.0, ans=0.1 2023-06-23 00:55:41,042 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.129e+02 4.712e+02 5.675e+02 7.796e+02 1.493e+03, threshold=1.135e+03, percent-clipped=5.0 2023-06-23 00:56:03,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=22.5 2023-06-23 00:56:11,970 INFO [train.py:996] (3/4) Epoch 8, batch 13350, loss[loss=0.2402, simple_loss=0.3248, pruned_loss=0.07775, over 20674.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3226, pruned_loss=0.08557, over 4287374.42 frames. ], batch size: 607, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 00:56:33,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=1360932.0, ans=0.1 2023-06-23 00:57:46,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1361112.0, ans=0.125 2023-06-23 00:57:57,187 INFO [train.py:996] (3/4) Epoch 8, batch 13400, loss[loss=0.3362, simple_loss=0.3806, pruned_loss=0.1459, over 21458.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3243, pruned_loss=0.08712, over 4293735.17 frames. ], batch size: 507, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 00:58:12,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1361232.0, ans=0.1 2023-06-23 00:59:05,826 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.334e+02 4.523e+02 5.899e+02 7.481e+02 1.405e+03, threshold=1.180e+03, percent-clipped=3.0 2023-06-23 00:59:12,757 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=22.5 2023-06-23 00:59:14,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1361412.0, ans=0.125 2023-06-23 00:59:33,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1361412.0, ans=0.125 2023-06-23 00:59:36,199 INFO [train.py:996] (3/4) Epoch 8, batch 13450, loss[loss=0.3257, simple_loss=0.3719, pruned_loss=0.1397, over 21451.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3255, pruned_loss=0.08929, over 4291490.73 frames. ], batch size: 509, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 00:59:46,115 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:00:16,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1361592.0, ans=0.1 2023-06-23 01:01:14,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1361772.0, ans=0.0 2023-06-23 01:01:15,841 INFO [train.py:996] (3/4) Epoch 8, batch 13500, loss[loss=0.2266, simple_loss=0.299, pruned_loss=0.07712, over 21706.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3175, pruned_loss=0.08612, over 4285477.24 frames. ], batch size: 247, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:01:27,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1361772.0, ans=0.125 2023-06-23 01:01:54,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1361832.0, ans=0.125 2023-06-23 01:02:35,453 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.831e+02 4.391e+02 6.972e+02 1.115e+03 2.286e+03, threshold=1.394e+03, percent-clipped=24.0 2023-06-23 01:02:48,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1362012.0, ans=0.2 2023-06-23 01:02:57,276 INFO [train.py:996] (3/4) Epoch 8, batch 13550, loss[loss=0.2503, simple_loss=0.3248, pruned_loss=0.08786, over 21793.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3222, pruned_loss=0.08504, over 4288850.04 frames. ], batch size: 124, lr: 3.74e-03, grad_scale: 8.0 2023-06-23 01:03:18,857 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-23 01:03:40,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1362192.0, ans=0.125 2023-06-23 01:04:04,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1362252.0, ans=0.0 2023-06-23 01:04:12,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1362252.0, ans=0.125 2023-06-23 01:04:17,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1362312.0, ans=0.1 2023-06-23 01:04:31,397 INFO [train.py:996] (3/4) Epoch 8, batch 13600, loss[loss=0.2292, simple_loss=0.2962, pruned_loss=0.08108, over 21525.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3234, pruned_loss=0.08601, over 4290727.38 frames. ], batch size: 211, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:05:02,036 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=22.5 2023-06-23 01:05:14,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1362432.0, ans=0.2 2023-06-23 01:05:44,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1362552.0, ans=0.0 2023-06-23 01:05:47,099 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.174e+02 4.391e+02 6.199e+02 8.558e+02 2.268e+03, threshold=1.240e+03, percent-clipped=7.0 2023-06-23 01:06:09,014 INFO [train.py:996] (3/4) Epoch 8, batch 13650, loss[loss=0.2052, simple_loss=0.2727, pruned_loss=0.06885, over 21848.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3157, pruned_loss=0.08188, over 4284578.94 frames. ], batch size: 118, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:07:05,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1362852.0, ans=0.035 2023-06-23 01:07:07,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1362852.0, ans=0.04949747468305833 2023-06-23 01:07:43,690 INFO [train.py:996] (3/4) Epoch 8, batch 13700, loss[loss=0.2677, simple_loss=0.3449, pruned_loss=0.09528, over 21654.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3102, pruned_loss=0.08131, over 4273577.88 frames. ], batch size: 414, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:08:01,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1362972.0, ans=0.0 2023-06-23 01:09:00,901 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.420e+02 5.296e+02 7.505e+02 1.141e+03 2.334e+03, threshold=1.501e+03, percent-clipped=22.0 2023-06-23 01:09:01,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1363152.0, ans=0.0 2023-06-23 01:09:31,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1363272.0, ans=0.125 2023-06-23 01:09:32,541 INFO [train.py:996] (3/4) Epoch 8, batch 13750, loss[loss=0.3027, simple_loss=0.3727, pruned_loss=0.1163, over 21499.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3085, pruned_loss=0.08097, over 4263914.38 frames. ], batch size: 508, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:10:08,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1363332.0, ans=0.125 2023-06-23 01:10:22,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1363392.0, ans=0.0 2023-06-23 01:10:27,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1363392.0, ans=0.2 2023-06-23 01:11:20,754 INFO [train.py:996] (3/4) Epoch 8, batch 13800, loss[loss=0.3408, simple_loss=0.4324, pruned_loss=0.1246, over 21492.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3126, pruned_loss=0.08036, over 4257627.55 frames. ], batch size: 507, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:11:35,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1363632.0, ans=0.2 2023-06-23 01:11:35,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1363632.0, ans=0.125 2023-06-23 01:11:52,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1363632.0, ans=0.1 2023-06-23 01:11:55,334 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.23 vs. limit=15.0 2023-06-23 01:12:08,707 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:12:37,254 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.245e+02 4.845e+02 7.419e+02 1.036e+03 2.562e+03, threshold=1.484e+03, percent-clipped=7.0 2023-06-23 01:13:00,658 INFO [train.py:996] (3/4) Epoch 8, batch 13850, loss[loss=0.2793, simple_loss=0.3665, pruned_loss=0.09605, over 21272.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3208, pruned_loss=0.08173, over 4261357.44 frames. ], batch size: 548, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:13:41,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1363992.0, ans=0.07 2023-06-23 01:13:42,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1363992.0, ans=0.1 2023-06-23 01:14:34,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1364112.0, ans=0.125 2023-06-23 01:14:39,104 INFO [train.py:996] (3/4) Epoch 8, batch 13900, loss[loss=0.2666, simple_loss=0.3193, pruned_loss=0.1069, over 21129.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3259, pruned_loss=0.08605, over 4266915.98 frames. ], batch size: 607, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:15:03,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1364232.0, ans=0.0 2023-06-23 01:15:39,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1364352.0, ans=0.2 2023-06-23 01:15:52,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-06-23 01:15:55,314 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.327e+02 4.266e+02 5.471e+02 7.768e+02 2.129e+03, threshold=1.094e+03, percent-clipped=1.0 2023-06-23 01:16:17,071 INFO [train.py:996] (3/4) Epoch 8, batch 13950, loss[loss=0.1896, simple_loss=0.2641, pruned_loss=0.05761, over 20829.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3262, pruned_loss=0.08791, over 4273162.23 frames. ], batch size: 608, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:16:20,949 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:16:27,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1364472.0, ans=0.125 2023-06-23 01:16:36,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1364532.0, ans=0.0 2023-06-23 01:16:50,126 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=22.5 2023-06-23 01:17:04,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1364592.0, ans=0.0 2023-06-23 01:17:21,080 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=22.5 2023-06-23 01:17:29,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1364652.0, ans=0.125 2023-06-23 01:17:53,961 INFO [train.py:996] (3/4) Epoch 8, batch 14000, loss[loss=0.2203, simple_loss=0.324, pruned_loss=0.05836, over 21706.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3231, pruned_loss=0.08561, over 4282131.90 frames. ], batch size: 389, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 01:18:46,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1364892.0, ans=0.07 2023-06-23 01:19:04,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1364952.0, ans=0.2 2023-06-23 01:19:08,956 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.634e+02 4.329e+02 5.834e+02 8.040e+02 1.947e+03, threshold=1.167e+03, percent-clipped=14.0 2023-06-23 01:19:09,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1364952.0, ans=0.5 2023-06-23 01:19:30,088 INFO [train.py:996] (3/4) Epoch 8, batch 14050, loss[loss=0.251, simple_loss=0.3324, pruned_loss=0.08476, over 21554.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3184, pruned_loss=0.08197, over 4279963.88 frames. ], batch size: 471, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 01:19:58,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1365132.0, ans=22.5 2023-06-23 01:21:12,118 INFO [train.py:996] (3/4) Epoch 8, batch 14100, loss[loss=0.2262, simple_loss=0.3474, pruned_loss=0.05246, over 20755.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.313, pruned_loss=0.08209, over 4275722.26 frames. ], batch size: 607, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:21:28,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1365432.0, ans=0.0 2023-06-23 01:21:39,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1365432.0, ans=0.125 2023-06-23 01:21:42,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1365432.0, ans=0.125 2023-06-23 01:21:49,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=1365432.0, ans=12.0 2023-06-23 01:21:51,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1365492.0, ans=0.1 2023-06-23 01:22:24,443 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.509e+02 4.771e+02 6.447e+02 8.696e+02 1.773e+03, threshold=1.289e+03, percent-clipped=8.0 2023-06-23 01:22:43,438 INFO [train.py:996] (3/4) Epoch 8, batch 14150, loss[loss=0.2095, simple_loss=0.2938, pruned_loss=0.06265, over 21183.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3153, pruned_loss=0.08222, over 4263126.02 frames. ], batch size: 176, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:22:59,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1365672.0, ans=0.0 2023-06-23 01:23:37,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1365792.0, ans=0.1 2023-06-23 01:23:41,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1365792.0, ans=0.125 2023-06-23 01:23:58,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1365852.0, ans=0.0 2023-06-23 01:24:19,878 INFO [train.py:996] (3/4) Epoch 8, batch 14200, loss[loss=0.2055, simple_loss=0.2755, pruned_loss=0.06777, over 21582.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3154, pruned_loss=0.08113, over 4258558.32 frames. ], batch size: 230, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:24:21,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1365972.0, ans=0.2 2023-06-23 01:24:23,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1365972.0, ans=0.05 2023-06-23 01:25:19,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1366152.0, ans=0.95 2023-06-23 01:25:32,004 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.024e+02 4.323e+02 5.337e+02 8.028e+02 2.442e+03, threshold=1.067e+03, percent-clipped=5.0 2023-06-23 01:25:40,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1366212.0, ans=0.1 2023-06-23 01:25:42,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1366212.0, ans=0.0 2023-06-23 01:25:42,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1366212.0, ans=0.0 2023-06-23 01:25:43,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1366212.0, ans=0.1 2023-06-23 01:25:47,276 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=15.0 2023-06-23 01:25:51,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1366212.0, ans=0.125 2023-06-23 01:25:57,607 INFO [train.py:996] (3/4) Epoch 8, batch 14250, loss[loss=0.2352, simple_loss=0.3141, pruned_loss=0.07818, over 21420.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3097, pruned_loss=0.08063, over 4268563.53 frames. ], batch size: 508, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:26:01,640 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-23 01:26:06,487 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-23 01:27:22,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1366512.0, ans=0.1 2023-06-23 01:27:35,836 INFO [train.py:996] (3/4) Epoch 8, batch 14300, loss[loss=0.218, simple_loss=0.2892, pruned_loss=0.07341, over 21186.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.31, pruned_loss=0.07958, over 4266646.83 frames. ], batch size: 176, lr: 3.74e-03, grad_scale: 8.0 2023-06-23 01:27:37,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1366572.0, ans=0.025 2023-06-23 01:27:48,524 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.65 vs. limit=15.0 2023-06-23 01:28:00,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1366632.0, ans=0.0 2023-06-23 01:28:10,097 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=22.5 2023-06-23 01:28:20,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1366692.0, ans=0.125 2023-06-23 01:28:34,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1366692.0, ans=0.2 2023-06-23 01:28:54,523 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.931e+02 4.415e+02 6.422e+02 1.030e+03 2.040e+03, threshold=1.284e+03, percent-clipped=23.0 2023-06-23 01:29:10,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1366812.0, ans=0.1 2023-06-23 01:29:13,115 INFO [train.py:996] (3/4) Epoch 8, batch 14350, loss[loss=0.2459, simple_loss=0.3581, pruned_loss=0.06683, over 19706.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3136, pruned_loss=0.07929, over 4262381.27 frames. ], batch size: 703, lr: 3.74e-03, grad_scale: 8.0 2023-06-23 01:29:32,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1366932.0, ans=0.05 2023-06-23 01:29:45,610 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:30:18,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1367052.0, ans=0.09899494936611666 2023-06-23 01:30:34,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1367112.0, ans=0.0 2023-06-23 01:30:35,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1367112.0, ans=0.0 2023-06-23 01:30:47,760 INFO [train.py:996] (3/4) Epoch 8, batch 14400, loss[loss=0.2371, simple_loss=0.2919, pruned_loss=0.09113, over 21831.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3119, pruned_loss=0.07973, over 4258809.56 frames. ], batch size: 98, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:31:09,012 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=22.5 2023-06-23 01:31:16,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1367232.0, ans=0.125 2023-06-23 01:31:44,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1367352.0, ans=0.125 2023-06-23 01:31:47,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1367352.0, ans=0.125 2023-06-23 01:31:56,158 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.112e+02 4.154e+02 4.970e+02 6.969e+02 1.897e+03, threshold=9.939e+02, percent-clipped=6.0 2023-06-23 01:32:18,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1367472.0, ans=0.125 2023-06-23 01:32:19,342 INFO [train.py:996] (3/4) Epoch 8, batch 14450, loss[loss=0.2116, simple_loss=0.2736, pruned_loss=0.07476, over 21490.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3069, pruned_loss=0.08001, over 4256210.10 frames. ], batch size: 212, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:33:32,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1367652.0, ans=0.5 2023-06-23 01:33:39,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1367652.0, ans=0.125 2023-06-23 01:33:40,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1367652.0, ans=0.015 2023-06-23 01:34:04,029 INFO [train.py:996] (3/4) Epoch 8, batch 14500, loss[loss=0.2329, simple_loss=0.2905, pruned_loss=0.08771, over 21662.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3035, pruned_loss=0.08032, over 4255075.07 frames. ], batch size: 416, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:34:11,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-23 01:34:22,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1367832.0, ans=10.0 2023-06-23 01:34:49,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1367892.0, ans=0.0 2023-06-23 01:34:52,543 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:34:55,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1367892.0, ans=0.125 2023-06-23 01:35:12,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1367952.0, ans=0.95 2023-06-23 01:35:21,287 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.039e+02 4.784e+02 6.137e+02 8.722e+02 1.642e+03, threshold=1.227e+03, percent-clipped=18.0 2023-06-23 01:35:45,094 INFO [train.py:996] (3/4) Epoch 8, batch 14550, loss[loss=0.2519, simple_loss=0.3068, pruned_loss=0.09846, over 20051.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3073, pruned_loss=0.08194, over 4260221.49 frames. ], batch size: 703, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:35:57,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1368072.0, ans=0.125 2023-06-23 01:36:20,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1368132.0, ans=0.0 2023-06-23 01:36:52,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1368252.0, ans=0.125 2023-06-23 01:36:58,247 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2023-06-23 01:37:23,762 INFO [train.py:996] (3/4) Epoch 8, batch 14600, loss[loss=0.2503, simple_loss=0.3375, pruned_loss=0.0815, over 21870.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3164, pruned_loss=0.0865, over 4268103.87 frames. ], batch size: 371, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:37:38,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1368432.0, ans=0.125 2023-06-23 01:38:38,631 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.403e+02 4.393e+02 5.466e+02 7.760e+02 1.223e+03, threshold=1.093e+03, percent-clipped=0.0 2023-06-23 01:39:02,813 INFO [train.py:996] (3/4) Epoch 8, batch 14650, loss[loss=0.1816, simple_loss=0.2707, pruned_loss=0.04628, over 21612.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3176, pruned_loss=0.0852, over 4275282.00 frames. ], batch size: 263, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:39:08,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=15.0 2023-06-23 01:40:33,337 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.51 vs. limit=10.0 2023-06-23 01:40:41,934 INFO [train.py:996] (3/4) Epoch 8, batch 14700, loss[loss=0.1869, simple_loss=0.2602, pruned_loss=0.0568, over 21419.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3117, pruned_loss=0.07863, over 4266861.11 frames. ], batch size: 194, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:41:06,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1369032.0, ans=0.125 2023-06-23 01:41:47,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1369152.0, ans=0.125 2023-06-23 01:41:55,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1369152.0, ans=0.95 2023-06-23 01:42:00,408 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.444e+02 5.470e+02 7.461e+02 1.083e+03 1.858e+03, threshold=1.492e+03, percent-clipped=24.0 2023-06-23 01:42:07,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1369212.0, ans=0.1 2023-06-23 01:42:18,573 INFO [train.py:996] (3/4) Epoch 8, batch 14750, loss[loss=0.308, simple_loss=0.3689, pruned_loss=0.1236, over 21207.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3157, pruned_loss=0.08127, over 4262838.41 frames. ], batch size: 143, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:42:31,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1369272.0, ans=0.1 2023-06-23 01:42:35,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1369272.0, ans=0.0 2023-06-23 01:43:34,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1369452.0, ans=0.0 2023-06-23 01:43:35,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1369452.0, ans=0.0 2023-06-23 01:43:50,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1369512.0, ans=0.07 2023-06-23 01:44:00,030 INFO [train.py:996] (3/4) Epoch 8, batch 14800, loss[loss=0.2348, simple_loss=0.3165, pruned_loss=0.07651, over 19999.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3305, pruned_loss=0.08857, over 4267454.63 frames. ], batch size: 702, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:44:26,756 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.45 vs. limit=6.0 2023-06-23 01:44:41,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1369692.0, ans=0.125 2023-06-23 01:45:02,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1369752.0, ans=0.0 2023-06-23 01:45:05,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1369752.0, ans=0.125 2023-06-23 01:45:18,469 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.582e+02 5.734e+02 8.027e+02 1.112e+03 2.200e+03, threshold=1.605e+03, percent-clipped=5.0 2023-06-23 01:45:32,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1369812.0, ans=0.07 2023-06-23 01:45:41,497 INFO [train.py:996] (3/4) Epoch 8, batch 14850, loss[loss=0.2241, simple_loss=0.292, pruned_loss=0.07804, over 21527.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3247, pruned_loss=0.08865, over 4265247.68 frames. ], batch size: 230, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:45:42,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.68 vs. limit=10.0 2023-06-23 01:47:23,697 INFO [train.py:996] (3/4) Epoch 8, batch 14900, loss[loss=0.2717, simple_loss=0.3268, pruned_loss=0.1083, over 21484.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3273, pruned_loss=0.09037, over 4264885.88 frames. ], batch size: 194, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:47:49,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1370232.0, ans=0.1 2023-06-23 01:48:14,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1370292.0, ans=0.125 2023-06-23 01:48:41,941 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.419e+02 4.642e+02 5.851e+02 8.319e+02 1.860e+03, threshold=1.170e+03, percent-clipped=1.0 2023-06-23 01:48:44,819 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.37 vs. limit=10.0 2023-06-23 01:49:04,625 INFO [train.py:996] (3/4) Epoch 8, batch 14950, loss[loss=0.2349, simple_loss=0.294, pruned_loss=0.08786, over 20064.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3268, pruned_loss=0.08913, over 4268093.48 frames. ], batch size: 702, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:50:40,418 INFO [train.py:996] (3/4) Epoch 8, batch 15000, loss[loss=0.2411, simple_loss=0.3105, pruned_loss=0.08588, over 21280.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3295, pruned_loss=0.09066, over 4268017.45 frames. ], batch size: 159, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:50:40,419 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 01:51:00,729 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2539, simple_loss=0.3505, pruned_loss=0.07863, over 1796401.00 frames. 2023-06-23 01:51:00,730 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-23 01:51:44,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1370892.0, ans=0.125 2023-06-23 01:51:54,874 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-23 01:52:21,541 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.233e+02 4.412e+02 5.546e+02 7.158e+02 1.443e+03, threshold=1.109e+03, percent-clipped=4.0 2023-06-23 01:52:22,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=22.5 2023-06-23 01:52:30,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1371012.0, ans=0.125 2023-06-23 01:52:43,829 INFO [train.py:996] (3/4) Epoch 8, batch 15050, loss[loss=0.2352, simple_loss=0.3153, pruned_loss=0.07755, over 21615.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3308, pruned_loss=0.09053, over 4258110.08 frames. ], batch size: 230, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:53:11,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1371132.0, ans=0.1 2023-06-23 01:53:24,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1371192.0, ans=0.1 2023-06-23 01:54:16,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1371312.0, ans=0.125 2023-06-23 01:54:29,747 INFO [train.py:996] (3/4) Epoch 8, batch 15100, loss[loss=0.3105, simple_loss=0.3773, pruned_loss=0.1219, over 21568.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3333, pruned_loss=0.09102, over 4257764.83 frames. ], batch size: 414, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:54:42,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1371372.0, ans=0.125 2023-06-23 01:54:44,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1371372.0, ans=0.125 2023-06-23 01:55:24,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1371492.0, ans=0.125 2023-06-23 01:55:50,279 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.664e+02 5.163e+02 6.983e+02 1.038e+03 2.377e+03, threshold=1.397e+03, percent-clipped=16.0 2023-06-23 01:56:06,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1371612.0, ans=0.0 2023-06-23 01:56:13,543 INFO [train.py:996] (3/4) Epoch 8, batch 15150, loss[loss=0.205, simple_loss=0.2728, pruned_loss=0.06866, over 21994.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3281, pruned_loss=0.09051, over 4260748.01 frames. ], batch size: 103, lr: 3.73e-03, grad_scale: 4.0 2023-06-23 01:56:25,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1371672.0, ans=0.0 2023-06-23 01:56:55,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1371792.0, ans=0.125 2023-06-23 01:57:09,156 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:57:26,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1371912.0, ans=0.2 2023-06-23 01:57:48,557 INFO [train.py:996] (3/4) Epoch 8, batch 15200, loss[loss=0.1883, simple_loss=0.2761, pruned_loss=0.05026, over 21722.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3205, pruned_loss=0.08698, over 4255180.60 frames. ], batch size: 282, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:59:10,117 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.032e+02 4.454e+02 6.348e+02 1.099e+03 2.249e+03, threshold=1.270e+03, percent-clipped=13.0 2023-06-23 01:59:23,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1372212.0, ans=0.0 2023-06-23 01:59:23,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1372212.0, ans=0.2 2023-06-23 01:59:29,112 INFO [train.py:996] (3/4) Epoch 8, batch 15250, loss[loss=0.2204, simple_loss=0.2785, pruned_loss=0.08112, over 21371.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3147, pruned_loss=0.08648, over 4264957.56 frames. ], batch size: 131, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:59:43,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1372332.0, ans=0.1 2023-06-23 01:59:43,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1372332.0, ans=0.2 2023-06-23 01:59:45,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1372332.0, ans=0.125 2023-06-23 02:00:13,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1372392.0, ans=0.1 2023-06-23 02:01:09,064 INFO [train.py:996] (3/4) Epoch 8, batch 15300, loss[loss=0.263, simple_loss=0.3286, pruned_loss=0.09869, over 21282.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3183, pruned_loss=0.08862, over 4258629.69 frames. ], batch size: 176, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:01:14,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1372572.0, ans=0.0 2023-06-23 02:01:38,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1372632.0, ans=0.2 2023-06-23 02:01:46,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1372692.0, ans=0.2 2023-06-23 02:01:48,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1372692.0, ans=0.1 2023-06-23 02:02:34,359 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.86 vs. limit=15.0 2023-06-23 02:02:34,703 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.584e+02 4.847e+02 6.325e+02 8.051e+02 1.474e+03, threshold=1.265e+03, percent-clipped=2.0 2023-06-23 02:02:48,584 INFO [train.py:996] (3/4) Epoch 8, batch 15350, loss[loss=0.2279, simple_loss=0.3255, pruned_loss=0.0652, over 21634.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3225, pruned_loss=0.09037, over 4267566.63 frames. ], batch size: 263, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:02:56,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1372872.0, ans=0.0 2023-06-23 02:03:00,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1372872.0, ans=0.1 2023-06-23 02:04:27,052 INFO [train.py:996] (3/4) Epoch 8, batch 15400, loss[loss=0.2368, simple_loss=0.3139, pruned_loss=0.07979, over 21903.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3238, pruned_loss=0.08905, over 4272764.04 frames. ], batch size: 107, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:04:35,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1373172.0, ans=0.1 2023-06-23 02:05:07,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1373292.0, ans=0.2 2023-06-23 02:05:42,573 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.174e+02 4.574e+02 6.918e+02 9.277e+02 1.952e+03, threshold=1.384e+03, percent-clipped=9.0 2023-06-23 02:06:06,506 INFO [train.py:996] (3/4) Epoch 8, batch 15450, loss[loss=0.2823, simple_loss=0.3711, pruned_loss=0.09676, over 21585.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3218, pruned_loss=0.08853, over 4261262.14 frames. ], batch size: 471, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:06:13,332 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:06:16,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1373472.0, ans=0.0 2023-06-23 02:06:47,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1373592.0, ans=0.0 2023-06-23 02:06:53,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1373592.0, ans=0.125 2023-06-23 02:07:36,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1373712.0, ans=0.1 2023-06-23 02:07:41,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1373712.0, ans=0.0 2023-06-23 02:07:47,616 INFO [train.py:996] (3/4) Epoch 8, batch 15500, loss[loss=0.2565, simple_loss=0.3242, pruned_loss=0.09442, over 21687.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3241, pruned_loss=0.08915, over 4264451.75 frames. ], batch size: 298, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:07:51,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1373772.0, ans=0.125 2023-06-23 02:07:56,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1373772.0, ans=0.1 2023-06-23 02:07:56,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1373772.0, ans=0.1 2023-06-23 02:08:07,043 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.52 vs. limit=15.0 2023-06-23 02:08:11,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1373832.0, ans=0.125 2023-06-23 02:08:45,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1373952.0, ans=0.0 2023-06-23 02:09:14,709 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.796e+02 4.547e+02 5.557e+02 7.236e+02 1.680e+03, threshold=1.111e+03, percent-clipped=1.0 2023-06-23 02:09:18,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1374012.0, ans=0.2 2023-06-23 02:09:28,936 INFO [train.py:996] (3/4) Epoch 8, batch 15550, loss[loss=0.2362, simple_loss=0.3123, pruned_loss=0.08007, over 21703.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3208, pruned_loss=0.08598, over 4259853.98 frames. ], batch size: 332, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:09:52,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1374132.0, ans=0.0 2023-06-23 02:10:09,262 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=12.0 2023-06-23 02:10:43,694 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-23 02:10:52,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1374312.0, ans=0.2 2023-06-23 02:10:54,596 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=15.0 2023-06-23 02:11:07,680 INFO [train.py:996] (3/4) Epoch 8, batch 15600, loss[loss=0.2557, simple_loss=0.309, pruned_loss=0.1012, over 21804.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3141, pruned_loss=0.08409, over 4259338.73 frames. ], batch size: 98, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 02:12:32,601 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.200e+02 4.456e+02 5.982e+02 8.275e+02 1.817e+03, threshold=1.196e+03, percent-clipped=9.0 2023-06-23 02:12:46,660 INFO [train.py:996] (3/4) Epoch 8, batch 15650, loss[loss=0.2246, simple_loss=0.2843, pruned_loss=0.08245, over 21374.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3126, pruned_loss=0.08334, over 4253012.56 frames. ], batch size: 194, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 02:12:56,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1374672.0, ans=0.125 2023-06-23 02:12:58,757 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-23 02:13:14,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1374732.0, ans=0.125 2023-06-23 02:13:32,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1374792.0, ans=0.0 2023-06-23 02:13:56,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1374852.0, ans=0.0 2023-06-23 02:14:31,052 INFO [train.py:996] (3/4) Epoch 8, batch 15700, loss[loss=0.2157, simple_loss=0.2937, pruned_loss=0.06886, over 21494.00 frames. ], tot_loss[loss=0.237, simple_loss=0.309, pruned_loss=0.08251, over 4258687.87 frames. ], batch size: 389, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 02:14:36,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1374972.0, ans=0.125 2023-06-23 02:14:36,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1374972.0, ans=0.0 2023-06-23 02:15:00,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1375032.0, ans=0.125 2023-06-23 02:15:09,303 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2023-06-23 02:15:41,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1375152.0, ans=0.0 2023-06-23 02:15:50,694 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.000e+02 4.439e+02 5.550e+02 6.958e+02 1.356e+03, threshold=1.110e+03, percent-clipped=1.0 2023-06-23 02:15:54,373 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:16:04,696 INFO [train.py:996] (3/4) Epoch 8, batch 15750, loss[loss=0.2175, simple_loss=0.2788, pruned_loss=0.0781, over 21843.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3047, pruned_loss=0.08204, over 4258157.54 frames. ], batch size: 98, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 02:16:20,231 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.84 vs. limit=10.0 2023-06-23 02:16:22,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1375272.0, ans=0.5 2023-06-23 02:16:45,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1375392.0, ans=0.125 2023-06-23 02:16:50,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1375392.0, ans=0.125 2023-06-23 02:17:49,024 INFO [train.py:996] (3/4) Epoch 8, batch 15800, loss[loss=0.2018, simple_loss=0.2676, pruned_loss=0.06796, over 21521.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.2998, pruned_loss=0.08118, over 4246099.23 frames. ], batch size: 195, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:17:49,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1375572.0, ans=0.0 2023-06-23 02:18:07,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1375632.0, ans=0.1 2023-06-23 02:19:04,851 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.995e+02 4.739e+02 6.417e+02 1.005e+03 2.218e+03, threshold=1.283e+03, percent-clipped=19.0 2023-06-23 02:19:24,002 INFO [train.py:996] (3/4) Epoch 8, batch 15850, loss[loss=0.2204, simple_loss=0.2794, pruned_loss=0.08073, over 21738.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3032, pruned_loss=0.08362, over 4257510.86 frames. ], batch size: 112, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:20:47,646 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-23 02:20:49,163 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-23 02:20:59,213 INFO [train.py:996] (3/4) Epoch 8, batch 15900, loss[loss=0.2606, simple_loss=0.3322, pruned_loss=0.09453, over 21841.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3026, pruned_loss=0.08367, over 4259367.25 frames. ], batch size: 118, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:21:09,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1376172.0, ans=0.125 2023-06-23 02:22:24,065 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 4.482e+02 6.671e+02 9.137e+02 1.402e+03, threshold=1.334e+03, percent-clipped=2.0 2023-06-23 02:22:27,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1376412.0, ans=0.025 2023-06-23 02:22:38,518 INFO [train.py:996] (3/4) Epoch 8, batch 15950, loss[loss=0.2397, simple_loss=0.337, pruned_loss=0.07113, over 21775.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3031, pruned_loss=0.08147, over 4253255.81 frames. ], batch size: 351, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:22:45,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1376472.0, ans=0.0 2023-06-23 02:22:57,265 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-06-23 02:23:18,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1376532.0, ans=0.0 2023-06-23 02:23:21,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1376592.0, ans=0.0 2023-06-23 02:23:26,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1376592.0, ans=0.2 2023-06-23 02:23:26,976 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-23 02:23:42,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1376652.0, ans=0.2 2023-06-23 02:24:07,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1376712.0, ans=0.125 2023-06-23 02:24:10,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1376712.0, ans=0.125 2023-06-23 02:24:12,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1376772.0, ans=0.025 2023-06-23 02:24:13,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-23 02:24:13,810 INFO [train.py:996] (3/4) Epoch 8, batch 16000, loss[loss=0.2496, simple_loss=0.3368, pruned_loss=0.08118, over 21821.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3025, pruned_loss=0.07827, over 4256489.94 frames. ], batch size: 298, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:24:30,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1376772.0, ans=0.0 2023-06-23 02:24:35,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1376832.0, ans=0.125 2023-06-23 02:25:39,161 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.869e+02 4.296e+02 5.678e+02 9.703e+02 1.741e+03, threshold=1.136e+03, percent-clipped=11.0 2023-06-23 02:25:44,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1377012.0, ans=0.0 2023-06-23 02:25:53,859 INFO [train.py:996] (3/4) Epoch 8, batch 16050, loss[loss=0.2092, simple_loss=0.3002, pruned_loss=0.05909, over 21663.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3053, pruned_loss=0.07659, over 4260320.01 frames. ], batch size: 230, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:26:03,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1377072.0, ans=0.125 2023-06-23 02:26:11,605 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-23 02:26:41,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1377192.0, ans=0.0 2023-06-23 02:27:32,631 INFO [train.py:996] (3/4) Epoch 8, batch 16100, loss[loss=0.2659, simple_loss=0.3242, pruned_loss=0.1038, over 21335.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3093, pruned_loss=0.07907, over 4265862.83 frames. ], batch size: 159, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:27:37,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1377372.0, ans=0.0 2023-06-23 02:27:44,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1377372.0, ans=0.125 2023-06-23 02:28:08,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1377432.0, ans=0.125 2023-06-23 02:28:18,787 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.34 vs. limit=10.0 2023-06-23 02:28:58,827 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.354e+02 5.010e+02 6.146e+02 8.242e+02 2.299e+03, threshold=1.229e+03, percent-clipped=9.0 2023-06-23 02:29:04,575 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=15.0 2023-06-23 02:29:12,554 INFO [train.py:996] (3/4) Epoch 8, batch 16150, loss[loss=0.2388, simple_loss=0.2949, pruned_loss=0.09136, over 21333.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3094, pruned_loss=0.08111, over 4278343.11 frames. ], batch size: 176, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:29:19,870 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.156e-02 2023-06-23 02:30:00,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1377792.0, ans=0.09899494936611666 2023-06-23 02:30:30,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1377852.0, ans=0.125 2023-06-23 02:30:53,178 INFO [train.py:996] (3/4) Epoch 8, batch 16200, loss[loss=0.2859, simple_loss=0.362, pruned_loss=0.1049, over 21702.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3151, pruned_loss=0.08363, over 4278532.03 frames. ], batch size: 351, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:31:01,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1377972.0, ans=0.035 2023-06-23 02:31:01,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1377972.0, ans=0.5 2023-06-23 02:31:13,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1378032.0, ans=0.1 2023-06-23 02:31:26,147 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:31:49,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1378092.0, ans=0.125 2023-06-23 02:32:03,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1378152.0, ans=0.0 2023-06-23 02:32:14,989 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.293e+02 4.981e+02 6.694e+02 1.065e+03 1.723e+03, threshold=1.339e+03, percent-clipped=15.0 2023-06-23 02:32:27,730 INFO [train.py:996] (3/4) Epoch 8, batch 16250, loss[loss=0.2184, simple_loss=0.2846, pruned_loss=0.0761, over 21848.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3139, pruned_loss=0.08279, over 4277889.18 frames. ], batch size: 118, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:32:33,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1378272.0, ans=0.125 2023-06-23 02:34:06,535 INFO [train.py:996] (3/4) Epoch 8, batch 16300, loss[loss=0.1873, simple_loss=0.2528, pruned_loss=0.06092, over 21866.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.309, pruned_loss=0.07837, over 4271399.60 frames. ], batch size: 107, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:35:14,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1378752.0, ans=0.1 2023-06-23 02:35:35,285 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.087e+02 4.179e+02 5.648e+02 7.276e+02 1.488e+03, threshold=1.130e+03, percent-clipped=3.0 2023-06-23 02:35:53,651 INFO [train.py:996] (3/4) Epoch 8, batch 16350, loss[loss=0.2323, simple_loss=0.3029, pruned_loss=0.08089, over 21781.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3097, pruned_loss=0.07938, over 4270482.53 frames. ], batch size: 102, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:36:04,533 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=12.0 2023-06-23 02:36:19,230 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-23 02:36:31,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1378932.0, ans=0.0 2023-06-23 02:37:09,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1379112.0, ans=0.025 2023-06-23 02:37:33,152 INFO [train.py:996] (3/4) Epoch 8, batch 16400, loss[loss=0.221, simple_loss=0.291, pruned_loss=0.07555, over 21929.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3155, pruned_loss=0.08217, over 4272740.56 frames. ], batch size: 107, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:37:43,736 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=12.0 2023-06-23 02:38:14,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1379292.0, ans=0.1 2023-06-23 02:38:47,544 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:38:55,317 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.176e+02 5.676e+02 8.447e+02 1.118e+03 2.154e+03, threshold=1.689e+03, percent-clipped=24.0 2023-06-23 02:38:55,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1379412.0, ans=0.125 2023-06-23 02:38:59,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1379412.0, ans=0.1 2023-06-23 02:39:09,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1379472.0, ans=0.125 2023-06-23 02:39:10,975 INFO [train.py:996] (3/4) Epoch 8, batch 16450, loss[loss=0.2554, simple_loss=0.3232, pruned_loss=0.09386, over 21783.00 frames. ], tot_loss[loss=0.242, simple_loss=0.316, pruned_loss=0.08399, over 4276132.72 frames. ], batch size: 441, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:39:16,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1379472.0, ans=0.1 2023-06-23 02:39:22,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1379472.0, ans=0.0 2023-06-23 02:40:21,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1379652.0, ans=0.125 2023-06-23 02:40:50,697 INFO [train.py:996] (3/4) Epoch 8, batch 16500, loss[loss=0.2186, simple_loss=0.2812, pruned_loss=0.07804, over 21667.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3145, pruned_loss=0.08404, over 4276477.66 frames. ], batch size: 263, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:41:28,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1379832.0, ans=0.125 2023-06-23 02:41:33,522 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.52 vs. limit=22.5 2023-06-23 02:41:39,956 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.84 vs. limit=15.0 2023-06-23 02:41:41,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1379892.0, ans=0.1 2023-06-23 02:41:46,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1379892.0, ans=0.125 2023-06-23 02:41:56,630 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.97 vs. limit=5.0 2023-06-23 02:42:20,401 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.553e+02 5.237e+02 7.941e+02 1.285e+03 2.739e+03, threshold=1.588e+03, percent-clipped=14.0 2023-06-23 02:42:36,163 INFO [train.py:996] (3/4) Epoch 8, batch 16550, loss[loss=0.2559, simple_loss=0.3288, pruned_loss=0.09148, over 20018.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.311, pruned_loss=0.08147, over 4269028.33 frames. ], batch size: 702, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:42:46,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1380072.0, ans=0.2 2023-06-23 02:42:58,468 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-23 02:43:07,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1380132.0, ans=0.2 2023-06-23 02:43:45,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1380252.0, ans=0.2 2023-06-23 02:43:50,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1380252.0, ans=0.125 2023-06-23 02:44:21,918 INFO [train.py:996] (3/4) Epoch 8, batch 16600, loss[loss=0.2678, simple_loss=0.3626, pruned_loss=0.08655, over 21300.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3182, pruned_loss=0.08408, over 4267681.00 frames. ], batch size: 176, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:44:38,561 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-06-23 02:45:52,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.266e+02 4.965e+02 7.305e+02 1.134e+03 2.257e+03, threshold=1.461e+03, percent-clipped=8.0 2023-06-23 02:45:59,767 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-23 02:46:02,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1380672.0, ans=0.0 2023-06-23 02:46:02,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1380672.0, ans=0.125 2023-06-23 02:46:03,720 INFO [train.py:996] (3/4) Epoch 8, batch 16650, loss[loss=0.3178, simple_loss=0.3816, pruned_loss=0.127, over 21761.00 frames. ], tot_loss[loss=0.25, simple_loss=0.327, pruned_loss=0.08652, over 4271699.85 frames. ], batch size: 441, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:46:20,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1380732.0, ans=0.2 2023-06-23 02:46:47,855 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-23 02:47:02,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1380792.0, ans=0.125 2023-06-23 02:47:45,721 INFO [train.py:996] (3/4) Epoch 8, batch 16700, loss[loss=0.2229, simple_loss=0.3113, pruned_loss=0.06723, over 21727.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.327, pruned_loss=0.08676, over 4267970.70 frames. ], batch size: 351, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:48:38,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1381092.0, ans=0.1 2023-06-23 02:49:01,817 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-23 02:49:21,410 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 4.736e+02 6.147e+02 8.645e+02 1.656e+03, threshold=1.229e+03, percent-clipped=2.0 2023-06-23 02:49:38,536 INFO [train.py:996] (3/4) Epoch 8, batch 16750, loss[loss=0.282, simple_loss=0.3704, pruned_loss=0.09678, over 21711.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3301, pruned_loss=0.08862, over 4267805.73 frames. ], batch size: 351, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:49:55,597 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-23 02:50:13,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-06-23 02:50:24,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1381392.0, ans=0.2 2023-06-23 02:50:49,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1381452.0, ans=0.0 2023-06-23 02:50:54,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1381452.0, ans=0.1 2023-06-23 02:51:13,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1381512.0, ans=0.0 2023-06-23 02:51:25,378 INFO [train.py:996] (3/4) Epoch 8, batch 16800, loss[loss=0.2818, simple_loss=0.3789, pruned_loss=0.0923, over 20701.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3334, pruned_loss=0.0883, over 4261077.55 frames. ], batch size: 607, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:51:36,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1381572.0, ans=0.125 2023-06-23 02:51:39,043 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-23 02:51:53,222 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-23 02:52:17,556 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.17 vs. limit=22.5 2023-06-23 02:52:47,892 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.159e+02 4.585e+02 6.307e+02 8.952e+02 1.873e+03, threshold=1.261e+03, percent-clipped=4.0 2023-06-23 02:53:03,691 INFO [train.py:996] (3/4) Epoch 8, batch 16850, loss[loss=0.228, simple_loss=0.2976, pruned_loss=0.07925, over 21944.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3293, pruned_loss=0.08834, over 4269545.64 frames. ], batch size: 351, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:53:24,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1381932.0, ans=0.125 2023-06-23 02:53:43,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1381992.0, ans=0.0 2023-06-23 02:54:20,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1382112.0, ans=0.1 2023-06-23 02:54:46,751 INFO [train.py:996] (3/4) Epoch 8, batch 16900, loss[loss=0.2527, simple_loss=0.3108, pruned_loss=0.09727, over 21392.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3255, pruned_loss=0.08716, over 4269604.66 frames. ], batch size: 131, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:54:47,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1382172.0, ans=0.1 2023-06-23 02:54:59,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1382172.0, ans=0.125 2023-06-23 02:55:06,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1382232.0, ans=0.0 2023-06-23 02:55:20,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1382292.0, ans=0.125 2023-06-23 02:55:25,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1382292.0, ans=0.125 2023-06-23 02:55:32,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1382292.0, ans=0.0 2023-06-23 02:56:07,249 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-23 02:56:07,853 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.361e+02 4.825e+02 6.497e+02 9.276e+02 2.744e+03, threshold=1.299e+03, percent-clipped=9.0 2023-06-23 02:56:10,525 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-06-23 02:56:26,516 INFO [train.py:996] (3/4) Epoch 8, batch 16950, loss[loss=0.2977, simple_loss=0.3351, pruned_loss=0.1301, over 21765.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3192, pruned_loss=0.08593, over 4273607.50 frames. ], batch size: 508, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:56:41,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1382532.0, ans=0.1 2023-06-23 02:56:48,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1382532.0, ans=0.0 2023-06-23 02:56:53,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1382532.0, ans=0.2 2023-06-23 02:57:06,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1382592.0, ans=0.125 2023-06-23 02:58:05,774 INFO [train.py:996] (3/4) Epoch 8, batch 17000, loss[loss=0.285, simple_loss=0.3382, pruned_loss=0.1159, over 21799.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3169, pruned_loss=0.08686, over 4281837.02 frames. ], batch size: 441, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 02:58:37,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1382892.0, ans=0.2 2023-06-23 02:59:36,301 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.719e+02 6.116e+02 8.558e+02 1.129e+03 2.527e+03, threshold=1.712e+03, percent-clipped=16.0 2023-06-23 02:59:46,390 INFO [train.py:996] (3/4) Epoch 8, batch 17050, loss[loss=0.271, simple_loss=0.3355, pruned_loss=0.1032, over 21899.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3244, pruned_loss=0.0885, over 4280904.89 frames. ], batch size: 107, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 02:59:58,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1383072.0, ans=0.125 2023-06-23 03:00:13,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1383132.0, ans=0.125 2023-06-23 03:01:16,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1383312.0, ans=0.125 2023-06-23 03:01:27,109 INFO [train.py:996] (3/4) Epoch 8, batch 17100, loss[loss=0.2374, simple_loss=0.3065, pruned_loss=0.08413, over 21863.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3237, pruned_loss=0.08897, over 4289578.43 frames. ], batch size: 124, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:01:27,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1383372.0, ans=0.125 2023-06-23 03:01:46,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1383432.0, ans=0.1 2023-06-23 03:02:07,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1383492.0, ans=0.0 2023-06-23 03:02:32,455 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:02:45,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1383612.0, ans=0.125 2023-06-23 03:02:45,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1383612.0, ans=0.0 2023-06-23 03:02:52,338 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.295e+02 4.559e+02 5.897e+02 8.703e+02 1.483e+03, threshold=1.179e+03, percent-clipped=0.0 2023-06-23 03:03:01,681 INFO [train.py:996] (3/4) Epoch 8, batch 17150, loss[loss=0.1852, simple_loss=0.2563, pruned_loss=0.05706, over 21255.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3188, pruned_loss=0.08795, over 4290242.89 frames. ], batch size: 176, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:03:27,214 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.47 vs. limit=22.5 2023-06-23 03:04:31,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1383912.0, ans=10.0 2023-06-23 03:04:42,743 INFO [train.py:996] (3/4) Epoch 8, batch 17200, loss[loss=0.2888, simple_loss=0.3661, pruned_loss=0.1058, over 21797.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3193, pruned_loss=0.08817, over 4287918.32 frames. ], batch size: 124, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:04:48,426 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=22.5 2023-06-23 03:05:13,442 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-06-23 03:05:24,798 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=12.0 2023-06-23 03:06:02,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1384152.0, ans=0.125 2023-06-23 03:06:13,105 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.394e+02 4.493e+02 5.793e+02 8.418e+02 1.650e+03, threshold=1.159e+03, percent-clipped=7.0 2023-06-23 03:06:20,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1384212.0, ans=0.125 2023-06-23 03:06:23,177 INFO [train.py:996] (3/4) Epoch 8, batch 17250, loss[loss=0.299, simple_loss=0.3732, pruned_loss=0.1124, over 21250.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3227, pruned_loss=0.0907, over 4289077.02 frames. ], batch size: 143, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:06:39,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1384272.0, ans=0.0 2023-06-23 03:06:41,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1384272.0, ans=0.125 2023-06-23 03:07:19,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1384392.0, ans=0.125 2023-06-23 03:07:30,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1384452.0, ans=0.1 2023-06-23 03:07:55,211 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.29 vs. limit=15.0 2023-06-23 03:08:09,813 INFO [train.py:996] (3/4) Epoch 8, batch 17300, loss[loss=0.2682, simple_loss=0.3394, pruned_loss=0.09853, over 21318.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3297, pruned_loss=0.09359, over 4284049.80 frames. ], batch size: 131, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:08:31,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1384632.0, ans=0.125 2023-06-23 03:08:49,497 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-23 03:09:18,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1384752.0, ans=0.125 2023-06-23 03:09:19,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=1384752.0, ans=12.0 2023-06-23 03:09:36,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1384812.0, ans=0.125 2023-06-23 03:09:41,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.372e+02 4.842e+02 6.354e+02 8.974e+02 2.324e+03, threshold=1.271e+03, percent-clipped=7.0 2023-06-23 03:09:56,672 INFO [train.py:996] (3/4) Epoch 8, batch 17350, loss[loss=0.273, simple_loss=0.3615, pruned_loss=0.09228, over 21646.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3311, pruned_loss=0.09364, over 4281128.74 frames. ], batch size: 441, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:11:09,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1385052.0, ans=0.125 2023-06-23 03:11:33,388 INFO [train.py:996] (3/4) Epoch 8, batch 17400, loss[loss=0.2153, simple_loss=0.2832, pruned_loss=0.07374, over 21429.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.327, pruned_loss=0.08925, over 4274095.15 frames. ], batch size: 211, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:12:01,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1385232.0, ans=0.125 2023-06-23 03:12:06,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1385232.0, ans=0.125 2023-06-23 03:12:28,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1385292.0, ans=0.015 2023-06-23 03:12:54,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1385412.0, ans=0.0 2023-06-23 03:13:07,295 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.616e+02 6.487e+02 8.880e+02 2.609e+03, threshold=1.297e+03, percent-clipped=10.0 2023-06-23 03:13:19,785 INFO [train.py:996] (3/4) Epoch 8, batch 17450, loss[loss=0.2121, simple_loss=0.3087, pruned_loss=0.05777, over 21670.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3207, pruned_loss=0.08511, over 4271375.31 frames. ], batch size: 414, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:13:51,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1385592.0, ans=0.125 2023-06-23 03:13:53,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1385592.0, ans=0.0 2023-06-23 03:14:03,617 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2023-06-23 03:14:07,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1385592.0, ans=0.1 2023-06-23 03:14:40,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1385712.0, ans=0.0 2023-06-23 03:15:00,648 INFO [train.py:996] (3/4) Epoch 8, batch 17500, loss[loss=0.2172, simple_loss=0.2839, pruned_loss=0.07524, over 21752.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3172, pruned_loss=0.08217, over 4276538.32 frames. ], batch size: 230, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:16:00,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1385952.0, ans=0.125 2023-06-23 03:16:12,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1385952.0, ans=0.2 2023-06-23 03:16:31,993 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.152e+02 4.278e+02 5.524e+02 8.928e+02 1.678e+03, threshold=1.105e+03, percent-clipped=3.0 2023-06-23 03:16:40,126 INFO [train.py:996] (3/4) Epoch 8, batch 17550, loss[loss=0.2121, simple_loss=0.2618, pruned_loss=0.08119, over 20361.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3172, pruned_loss=0.08164, over 4271617.53 frames. ], batch size: 703, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:17:24,823 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.57 vs. limit=15.0 2023-06-23 03:17:41,901 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-23 03:18:11,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1386312.0, ans=0.04949747468305833 2023-06-23 03:18:18,867 INFO [train.py:996] (3/4) Epoch 8, batch 17600, loss[loss=0.2585, simple_loss=0.3445, pruned_loss=0.08623, over 21823.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3192, pruned_loss=0.08214, over 4271500.18 frames. ], batch size: 124, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:18:28,423 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.61 vs. limit=12.0 2023-06-23 03:18:55,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1386432.0, ans=0.125 2023-06-23 03:19:27,764 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=15.0 2023-06-23 03:19:48,288 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.108e+02 4.735e+02 6.263e+02 8.398e+02 1.704e+03, threshold=1.253e+03, percent-clipped=10.0 2023-06-23 03:19:55,687 INFO [train.py:996] (3/4) Epoch 8, batch 17650, loss[loss=0.2015, simple_loss=0.2717, pruned_loss=0.06563, over 21756.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3174, pruned_loss=0.0822, over 4264803.66 frames. ], batch size: 282, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:20:02,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1386672.0, ans=10.0 2023-06-23 03:20:59,127 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-23 03:21:22,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1386912.0, ans=0.0 2023-06-23 03:21:26,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1386912.0, ans=0.05 2023-06-23 03:21:27,852 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.03 vs. limit=12.0 2023-06-23 03:21:36,246 INFO [train.py:996] (3/4) Epoch 8, batch 17700, loss[loss=0.2818, simple_loss=0.3619, pruned_loss=0.1008, over 21432.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3102, pruned_loss=0.07841, over 4267173.45 frames. ], batch size: 471, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:22:09,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1387032.0, ans=0.125 2023-06-23 03:23:05,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1387212.0, ans=0.0 2023-06-23 03:23:11,336 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.033e+02 4.293e+02 5.499e+02 1.006e+03 2.228e+03, threshold=1.100e+03, percent-clipped=12.0 2023-06-23 03:23:16,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1387272.0, ans=0.125 2023-06-23 03:23:17,888 INFO [train.py:996] (3/4) Epoch 8, batch 17750, loss[loss=0.2926, simple_loss=0.3629, pruned_loss=0.1111, over 21791.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3184, pruned_loss=0.08169, over 4269791.91 frames. ], batch size: 441, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:24:28,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1387452.0, ans=0.2 2023-06-23 03:24:58,723 INFO [train.py:996] (3/4) Epoch 8, batch 17800, loss[loss=0.2253, simple_loss=0.2885, pruned_loss=0.08107, over 21566.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3181, pruned_loss=0.08113, over 4269881.44 frames. ], batch size: 112, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:25:00,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1387572.0, ans=0.125 2023-06-23 03:25:04,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1387572.0, ans=0.125 2023-06-23 03:26:05,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.84 vs. limit=6.0 2023-06-23 03:26:15,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1387752.0, ans=0.2 2023-06-23 03:26:32,494 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.730e+02 4.467e+02 6.036e+02 8.319e+02 2.220e+03, threshold=1.207e+03, percent-clipped=14.0 2023-06-23 03:26:39,341 INFO [train.py:996] (3/4) Epoch 8, batch 17850, loss[loss=0.2385, simple_loss=0.3121, pruned_loss=0.08248, over 21780.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3204, pruned_loss=0.08293, over 4273058.03 frames. ], batch size: 247, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:27:07,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1387932.0, ans=0.1 2023-06-23 03:27:17,232 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.63 vs. limit=5.0 2023-06-23 03:27:20,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1387992.0, ans=0.125 2023-06-23 03:28:16,038 INFO [train.py:996] (3/4) Epoch 8, batch 17900, loss[loss=0.2672, simple_loss=0.3638, pruned_loss=0.08527, over 21647.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3259, pruned_loss=0.08544, over 4278155.07 frames. ], batch size: 414, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:29:33,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1388352.0, ans=0.1 2023-06-23 03:29:46,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1388412.0, ans=0.1 2023-06-23 03:29:48,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1388412.0, ans=10.0 2023-06-23 03:30:00,910 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.268e+02 4.650e+02 5.974e+02 7.368e+02 1.876e+03, threshold=1.195e+03, percent-clipped=6.0 2023-06-23 03:30:10,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1388472.0, ans=0.0 2023-06-23 03:30:11,643 INFO [train.py:996] (3/4) Epoch 8, batch 17950, loss[loss=0.1991, simple_loss=0.2808, pruned_loss=0.05866, over 21820.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.325, pruned_loss=0.0822, over 4270832.21 frames. ], batch size: 118, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:30:18,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1388472.0, ans=0.125 2023-06-23 03:30:32,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1388532.0, ans=0.0 2023-06-23 03:30:35,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1388532.0, ans=0.035 2023-06-23 03:31:47,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1388712.0, ans=0.0 2023-06-23 03:31:48,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1388772.0, ans=0.0 2023-06-23 03:31:49,993 INFO [train.py:996] (3/4) Epoch 8, batch 18000, loss[loss=0.2336, simple_loss=0.3017, pruned_loss=0.08275, over 21500.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3192, pruned_loss=0.08089, over 4273914.29 frames. ], batch size: 441, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:31:49,993 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 03:32:06,873 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2644, simple_loss=0.3593, pruned_loss=0.08473, over 1796401.00 frames. 2023-06-23 03:32:06,874 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-23 03:32:13,023 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=22.5 2023-06-23 03:32:17,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1388772.0, ans=0.1 2023-06-23 03:33:43,872 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.172e+02 4.301e+02 6.081e+02 8.972e+02 1.795e+03, threshold=1.216e+03, percent-clipped=14.0 2023-06-23 03:33:44,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1389012.0, ans=0.125 2023-06-23 03:33:46,945 INFO [train.py:996] (3/4) Epoch 8, batch 18050, loss[loss=0.2724, simple_loss=0.3312, pruned_loss=0.1068, over 21716.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3129, pruned_loss=0.0797, over 4274762.90 frames. ], batch size: 231, lr: 3.71e-03, grad_scale: 8.0 2023-06-23 03:34:37,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1389192.0, ans=0.0 2023-06-23 03:35:28,041 INFO [train.py:996] (3/4) Epoch 8, batch 18100, loss[loss=0.2287, simple_loss=0.3004, pruned_loss=0.07854, over 20053.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3163, pruned_loss=0.08192, over 4270732.86 frames. ], batch size: 702, lr: 3.71e-03, grad_scale: 8.0 2023-06-23 03:35:45,279 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.71 vs. limit=12.0 2023-06-23 03:36:32,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1389492.0, ans=0.125 2023-06-23 03:36:53,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1389612.0, ans=0.125 2023-06-23 03:36:54,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1389612.0, ans=0.0 2023-06-23 03:36:57,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1389612.0, ans=0.125 2023-06-23 03:36:57,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1389612.0, ans=0.1 2023-06-23 03:37:04,990 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.338e+02 4.664e+02 6.579e+02 9.782e+02 2.052e+03, threshold=1.316e+03, percent-clipped=11.0 2023-06-23 03:37:06,647 INFO [train.py:996] (3/4) Epoch 8, batch 18150, loss[loss=0.2655, simple_loss=0.3671, pruned_loss=0.08195, over 19881.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3192, pruned_loss=0.08261, over 4275147.18 frames. ], batch size: 702, lr: 3.71e-03, grad_scale: 8.0 2023-06-23 03:37:43,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1389792.0, ans=0.125 2023-06-23 03:37:51,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1389792.0, ans=0.0 2023-06-23 03:37:52,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1389792.0, ans=0.07 2023-06-23 03:37:52,878 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:38:43,069 INFO [train.py:996] (3/4) Epoch 8, batch 18200, loss[loss=0.229, simple_loss=0.2942, pruned_loss=0.08194, over 21579.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3136, pruned_loss=0.08193, over 4281754.85 frames. ], batch size: 415, lr: 3.71e-03, grad_scale: 8.0 2023-06-23 03:40:14,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1390212.0, ans=0.1 2023-06-23 03:40:17,371 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.155e+02 4.842e+02 6.713e+02 9.646e+02 2.158e+03, threshold=1.343e+03, percent-clipped=10.0 2023-06-23 03:40:19,039 INFO [train.py:996] (3/4) Epoch 8, batch 18250, loss[loss=0.2092, simple_loss=0.2771, pruned_loss=0.07061, over 21638.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3058, pruned_loss=0.0792, over 4285418.97 frames. ], batch size: 263, lr: 3.70e-03, grad_scale: 8.0 2023-06-23 03:40:22,558 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:40:46,252 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.79 vs. limit=6.0 2023-06-23 03:40:50,165 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:41:22,789 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.46 vs. limit=22.5 2023-06-23 03:41:34,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1390512.0, ans=0.025 2023-06-23 03:41:42,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1390512.0, ans=0.125 2023-06-23 03:41:56,253 INFO [train.py:996] (3/4) Epoch 8, batch 18300, loss[loss=0.1797, simple_loss=0.251, pruned_loss=0.05427, over 21823.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3065, pruned_loss=0.07989, over 4284637.28 frames. ], batch size: 102, lr: 3.70e-03, grad_scale: 8.0 2023-06-23 03:42:05,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1390572.0, ans=0.1 2023-06-23 03:43:01,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1390752.0, ans=0.0 2023-06-23 03:43:32,353 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.212e+02 4.888e+02 7.173e+02 1.170e+03 2.600e+03, threshold=1.435e+03, percent-clipped=18.0 2023-06-23 03:43:33,439 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=12.0 2023-06-23 03:43:34,031 INFO [train.py:996] (3/4) Epoch 8, batch 18350, loss[loss=0.2533, simple_loss=0.3749, pruned_loss=0.06589, over 19872.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3094, pruned_loss=0.07957, over 4248925.96 frames. ], batch size: 702, lr: 3.70e-03, grad_scale: 8.0 2023-06-23 03:44:01,662 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.88 vs. limit=10.0 2023-06-23 03:44:25,831 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.81 vs. limit=15.0 2023-06-23 03:44:50,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1391112.0, ans=0.125 2023-06-23 03:45:12,225 INFO [train.py:996] (3/4) Epoch 8, batch 18400, loss[loss=0.181, simple_loss=0.2504, pruned_loss=0.05575, over 16187.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3068, pruned_loss=0.07802, over 4236090.11 frames. ], batch size: 60, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:45:36,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1391232.0, ans=0.2 2023-06-23 03:45:51,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1391292.0, ans=0.0 2023-06-23 03:46:05,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1391352.0, ans=0.125 2023-06-23 03:46:07,849 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-06-23 03:46:30,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1391412.0, ans=0.0 2023-06-23 03:46:36,959 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:46:45,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1391412.0, ans=0.2 2023-06-23 03:46:46,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.039e+02 4.306e+02 6.034e+02 8.770e+02 2.014e+03, threshold=1.207e+03, percent-clipped=5.0 2023-06-23 03:46:48,006 INFO [train.py:996] (3/4) Epoch 8, batch 18450, loss[loss=0.2183, simple_loss=0.2877, pruned_loss=0.07442, over 21849.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3044, pruned_loss=0.07466, over 4246737.99 frames. ], batch size: 107, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:46:48,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1391472.0, ans=0.125 2023-06-23 03:46:59,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1391472.0, ans=10.0 2023-06-23 03:47:04,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1391532.0, ans=0.07 2023-06-23 03:47:34,925 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-23 03:47:55,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1391652.0, ans=10.0 2023-06-23 03:47:57,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1391652.0, ans=0.0 2023-06-23 03:48:21,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.05 vs. limit=15.0 2023-06-23 03:48:25,034 INFO [train.py:996] (3/4) Epoch 8, batch 18500, loss[loss=0.1946, simple_loss=0.2607, pruned_loss=0.06421, over 21427.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2999, pruned_loss=0.07399, over 4257250.00 frames. ], batch size: 212, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:48:25,980 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.29 vs. limit=15.0 2023-06-23 03:49:10,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1391892.0, ans=0.125 2023-06-23 03:49:56,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1392012.0, ans=0.125 2023-06-23 03:50:02,672 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.977e+02 4.135e+02 5.661e+02 7.712e+02 1.457e+03, threshold=1.132e+03, percent-clipped=3.0 2023-06-23 03:50:04,104 INFO [train.py:996] (3/4) Epoch 8, batch 18550, loss[loss=0.2616, simple_loss=0.3208, pruned_loss=0.1012, over 21933.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2973, pruned_loss=0.07363, over 4259866.00 frames. ], batch size: 103, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:50:30,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1392132.0, ans=0.125 2023-06-23 03:50:35,620 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2023-06-23 03:51:15,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1392252.0, ans=0.125 2023-06-23 03:51:17,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1392252.0, ans=0.1 2023-06-23 03:51:31,676 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-23 03:51:36,696 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=22.5 2023-06-23 03:51:39,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1392312.0, ans=0.0 2023-06-23 03:51:43,197 INFO [train.py:996] (3/4) Epoch 8, batch 18600, loss[loss=0.3027, simple_loss=0.3747, pruned_loss=0.1154, over 21520.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2959, pruned_loss=0.07464, over 4252702.96 frames. ], batch size: 473, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:52:36,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1392492.0, ans=0.125 2023-06-23 03:52:47,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1392552.0, ans=0.125 2023-06-23 03:53:17,920 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.135e+02 5.174e+02 7.950e+02 1.061e+03 1.906e+03, threshold=1.590e+03, percent-clipped=19.0 2023-06-23 03:53:19,646 INFO [train.py:996] (3/4) Epoch 8, batch 18650, loss[loss=0.2294, simple_loss=0.2933, pruned_loss=0.08278, over 21780.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2944, pruned_loss=0.07446, over 4248753.46 frames. ], batch size: 102, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:53:48,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1392732.0, ans=0.125 2023-06-23 03:54:02,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1392792.0, ans=0.2 2023-06-23 03:54:12,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1392792.0, ans=0.1 2023-06-23 03:54:14,308 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.82 vs. limit=10.0 2023-06-23 03:54:56,233 INFO [train.py:996] (3/4) Epoch 8, batch 18700, loss[loss=0.273, simple_loss=0.3187, pruned_loss=0.1136, over 21528.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2922, pruned_loss=0.07582, over 4256605.35 frames. ], batch size: 471, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:54:58,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1392972.0, ans=0.1 2023-06-23 03:54:59,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1392972.0, ans=0.035 2023-06-23 03:55:57,683 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.72 vs. limit=15.0 2023-06-23 03:56:32,195 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.347e+02 4.104e+02 5.076e+02 6.613e+02 1.727e+03, threshold=1.015e+03, percent-clipped=1.0 2023-06-23 03:56:33,804 INFO [train.py:996] (3/4) Epoch 8, batch 18750, loss[loss=0.2478, simple_loss=0.3217, pruned_loss=0.08693, over 21797.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.2972, pruned_loss=0.0795, over 4250873.78 frames. ], batch size: 124, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:56:44,004 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-23 03:57:01,471 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.47 vs. limit=10.0 2023-06-23 03:57:06,652 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.30 vs. limit=22.5 2023-06-23 03:57:09,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1393392.0, ans=0.125 2023-06-23 03:57:14,108 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.77 vs. limit=15.0 2023-06-23 03:57:33,879 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.52 vs. limit=22.5 2023-06-23 03:57:53,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1393512.0, ans=0.1 2023-06-23 03:58:11,910 INFO [train.py:996] (3/4) Epoch 8, batch 18800, loss[loss=0.2563, simple_loss=0.3415, pruned_loss=0.08554, over 21694.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3034, pruned_loss=0.08077, over 4239660.94 frames. ], batch size: 441, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 03:58:21,571 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:58:32,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1393632.0, ans=0.07 2023-06-23 03:58:57,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1393692.0, ans=0.125 2023-06-23 03:59:26,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1393752.0, ans=0.0 2023-06-23 03:59:29,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1393812.0, ans=15.0 2023-06-23 03:59:42,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1393812.0, ans=0.0 2023-06-23 03:59:48,198 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.636e+02 4.595e+02 6.296e+02 8.883e+02 2.093e+03, threshold=1.259e+03, percent-clipped=21.0 2023-06-23 03:59:50,061 INFO [train.py:996] (3/4) Epoch 8, batch 18850, loss[loss=0.2133, simple_loss=0.2735, pruned_loss=0.07659, over 21149.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2978, pruned_loss=0.07569, over 4247908.38 frames. ], batch size: 608, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:01:09,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1394112.0, ans=0.125 2023-06-23 04:01:21,360 INFO [train.py:996] (3/4) Epoch 8, batch 18900, loss[loss=0.2166, simple_loss=0.2752, pruned_loss=0.07901, over 21438.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2944, pruned_loss=0.07559, over 4248961.48 frames. ], batch size: 476, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:01:32,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1394172.0, ans=0.1 2023-06-23 04:02:05,075 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-23 04:02:32,759 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.64 vs. limit=15.0 2023-06-23 04:02:58,226 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.33 vs. limit=10.0 2023-06-23 04:02:58,737 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.232e+02 4.494e+02 5.345e+02 6.718e+02 1.434e+03, threshold=1.069e+03, percent-clipped=2.0 2023-06-23 04:03:00,404 INFO [train.py:996] (3/4) Epoch 8, batch 18950, loss[loss=0.2327, simple_loss=0.3019, pruned_loss=0.0818, over 21344.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2982, pruned_loss=0.0785, over 4262751.92 frames. ], batch size: 159, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:03:13,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1394472.0, ans=0.0 2023-06-23 04:03:19,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1394532.0, ans=0.0 2023-06-23 04:03:21,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1394532.0, ans=0.2 2023-06-23 04:04:32,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1394712.0, ans=0.125 2023-06-23 04:04:37,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1394712.0, ans=0.125 2023-06-23 04:04:39,986 INFO [train.py:996] (3/4) Epoch 8, batch 19000, loss[loss=0.2509, simple_loss=0.3296, pruned_loss=0.08611, over 21921.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3074, pruned_loss=0.07959, over 4270011.31 frames. ], batch size: 372, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:05:35,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1394952.0, ans=0.2 2023-06-23 04:05:50,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1394952.0, ans=10.0 2023-06-23 04:06:07,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1395012.0, ans=0.125 2023-06-23 04:06:11,610 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.757e+02 5.109e+02 8.049e+02 1.091e+03 2.389e+03, threshold=1.610e+03, percent-clipped=25.0 2023-06-23 04:06:13,334 INFO [train.py:996] (3/4) Epoch 8, batch 19050, loss[loss=0.2657, simple_loss=0.3242, pruned_loss=0.1036, over 21866.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3128, pruned_loss=0.08386, over 4277579.56 frames. ], batch size: 371, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:06:32,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1395132.0, ans=0.035 2023-06-23 04:07:13,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=1395252.0, ans=0.1 2023-06-23 04:07:23,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1395252.0, ans=0.125 2023-06-23 04:07:44,408 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.84 vs. limit=6.0 2023-06-23 04:07:53,266 INFO [train.py:996] (3/4) Epoch 8, batch 19100, loss[loss=0.2135, simple_loss=0.2741, pruned_loss=0.07642, over 21789.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3102, pruned_loss=0.08351, over 4282953.51 frames. ], batch size: 118, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:07:53,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1395372.0, ans=0.0 2023-06-23 04:08:04,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1395372.0, ans=0.1 2023-06-23 04:08:48,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1395492.0, ans=0.0 2023-06-23 04:09:20,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1395612.0, ans=0.05 2023-06-23 04:09:33,310 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.382e+02 4.590e+02 5.879e+02 8.375e+02 2.097e+03, threshold=1.176e+03, percent-clipped=3.0 2023-06-23 04:09:34,833 INFO [train.py:996] (3/4) Epoch 8, batch 19150, loss[loss=0.3258, simple_loss=0.415, pruned_loss=0.1183, over 21497.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3098, pruned_loss=0.08323, over 4280993.86 frames. ], batch size: 471, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:09:56,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1395732.0, ans=0.2 2023-06-23 04:10:16,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.82 vs. limit=15.0 2023-06-23 04:10:33,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1395792.0, ans=0.125 2023-06-23 04:11:19,401 INFO [train.py:996] (3/4) Epoch 8, batch 19200, loss[loss=0.2567, simple_loss=0.3576, pruned_loss=0.07792, over 21654.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.32, pruned_loss=0.08407, over 4278423.49 frames. ], batch size: 298, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:11:19,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1395972.0, ans=0.0 2023-06-23 04:12:12,122 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.32 vs. limit=15.0 2023-06-23 04:12:45,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1396212.0, ans=6.0 2023-06-23 04:12:50,651 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.764e+02 4.861e+02 7.058e+02 9.743e+02 2.046e+03, threshold=1.412e+03, percent-clipped=16.0 2023-06-23 04:12:50,672 INFO [train.py:996] (3/4) Epoch 8, batch 19250, loss[loss=0.2404, simple_loss=0.3329, pruned_loss=0.07399, over 21503.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3196, pruned_loss=0.07919, over 4271978.45 frames. ], batch size: 507, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:13:06,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1396272.0, ans=0.0 2023-06-23 04:14:10,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1396452.0, ans=0.125 2023-06-23 04:14:14,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1396512.0, ans=0.0 2023-06-23 04:14:29,806 INFO [train.py:996] (3/4) Epoch 8, batch 19300, loss[loss=0.1918, simple_loss=0.2768, pruned_loss=0.05335, over 21784.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3168, pruned_loss=0.07858, over 4280602.52 frames. ], batch size: 298, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:15:03,257 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-23 04:15:10,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1396632.0, ans=0.1 2023-06-23 04:15:23,620 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.63 vs. limit=12.0 2023-06-23 04:15:42,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1396752.0, ans=0.125 2023-06-23 04:15:45,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1396752.0, ans=0.0 2023-06-23 04:15:48,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=1396752.0, ans=0.2 2023-06-23 04:16:13,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1396872.0, ans=0.1 2023-06-23 04:16:14,255 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.848e+02 5.029e+02 6.832e+02 8.768e+02 1.869e+03, threshold=1.366e+03, percent-clipped=8.0 2023-06-23 04:16:14,275 INFO [train.py:996] (3/4) Epoch 8, batch 19350, loss[loss=0.2352, simple_loss=0.2988, pruned_loss=0.08577, over 21166.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3123, pruned_loss=0.07501, over 4278889.98 frames. ], batch size: 608, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:16:42,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1396932.0, ans=0.0 2023-06-23 04:16:53,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1396932.0, ans=0.125 2023-06-23 04:17:05,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1396992.0, ans=0.0 2023-06-23 04:17:16,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1396992.0, ans=0.1 2023-06-23 04:17:16,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1396992.0, ans=0.2 2023-06-23 04:17:26,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1397052.0, ans=0.125 2023-06-23 04:17:34,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1397112.0, ans=0.125 2023-06-23 04:17:37,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1397112.0, ans=0.125 2023-06-23 04:17:54,521 INFO [train.py:996] (3/4) Epoch 8, batch 19400, loss[loss=0.2473, simple_loss=0.3221, pruned_loss=0.08619, over 21819.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3089, pruned_loss=0.07381, over 4285230.12 frames. ], batch size: 333, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:18:04,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1397172.0, ans=0.1 2023-06-23 04:18:05,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1397172.0, ans=0.125 2023-06-23 04:18:26,911 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 04:19:38,372 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.184e+02 4.448e+02 5.769e+02 7.543e+02 1.139e+03, threshold=1.154e+03, percent-clipped=0.0 2023-06-23 04:19:38,392 INFO [train.py:996] (3/4) Epoch 8, batch 19450, loss[loss=0.2046, simple_loss=0.2891, pruned_loss=0.06002, over 20087.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3073, pruned_loss=0.07641, over 4290319.89 frames. ], batch size: 702, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:19:40,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1397472.0, ans=0.125 2023-06-23 04:19:55,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1397472.0, ans=0.125 2023-06-23 04:20:45,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1397652.0, ans=0.0 2023-06-23 04:21:02,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1397712.0, ans=0.04949747468305833 2023-06-23 04:21:16,466 INFO [train.py:996] (3/4) Epoch 8, batch 19500, loss[loss=0.2061, simple_loss=0.2586, pruned_loss=0.07679, over 20843.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3041, pruned_loss=0.07817, over 4287310.51 frames. ], batch size: 608, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:21:29,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1397772.0, ans=0.0 2023-06-23 04:22:01,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1397892.0, ans=0.125 2023-06-23 04:22:01,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1397892.0, ans=0.125 2023-06-23 04:22:22,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1397952.0, ans=0.5 2023-06-23 04:22:32,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1398012.0, ans=0.1 2023-06-23 04:22:35,545 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 04:22:54,689 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.438e+02 5.009e+02 6.847e+02 1.109e+03 2.464e+03, threshold=1.369e+03, percent-clipped=22.0 2023-06-23 04:22:54,711 INFO [train.py:996] (3/4) Epoch 8, batch 19550, loss[loss=0.2272, simple_loss=0.3029, pruned_loss=0.07575, over 21141.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3009, pruned_loss=0.07709, over 4286168.66 frames. ], batch size: 159, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:23:12,029 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 04:23:44,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1398192.0, ans=0.125 2023-06-23 04:23:49,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1398252.0, ans=0.125 2023-06-23 04:23:59,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-23 04:24:29,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1398372.0, ans=0.0 2023-06-23 04:24:30,118 INFO [train.py:996] (3/4) Epoch 8, batch 19600, loss[loss=0.2174, simple_loss=0.313, pruned_loss=0.06093, over 19811.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3015, pruned_loss=0.0774, over 4278021.90 frames. ], batch size: 704, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:24:48,082 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-23 04:25:16,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1398492.0, ans=0.125 2023-06-23 04:25:17,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1398492.0, ans=0.2 2023-06-23 04:25:59,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1398612.0, ans=0.0 2023-06-23 04:26:08,759 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.522e+02 4.617e+02 5.615e+02 7.940e+02 2.383e+03, threshold=1.123e+03, percent-clipped=6.0 2023-06-23 04:26:08,782 INFO [train.py:996] (3/4) Epoch 8, batch 19650, loss[loss=0.2649, simple_loss=0.3261, pruned_loss=0.1018, over 20708.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3063, pruned_loss=0.08073, over 4276772.09 frames. ], batch size: 607, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:27:53,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.55 vs. limit=15.0 2023-06-23 04:27:55,184 INFO [train.py:996] (3/4) Epoch 8, batch 19700, loss[loss=0.2024, simple_loss=0.3204, pruned_loss=0.0422, over 20790.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3116, pruned_loss=0.08145, over 4273618.82 frames. ], batch size: 608, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:28:15,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1399032.0, ans=0.125 2023-06-23 04:28:29,767 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-23 04:28:49,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1399092.0, ans=0.125 2023-06-23 04:29:06,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1399152.0, ans=0.0 2023-06-23 04:29:34,732 INFO [train.py:996] (3/4) Epoch 8, batch 19750, loss[loss=0.2468, simple_loss=0.3387, pruned_loss=0.07747, over 21643.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3194, pruned_loss=0.08212, over 4258791.61 frames. ], batch size: 263, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:29:36,357 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.270e+02 5.158e+02 7.198e+02 1.115e+03 3.431e+03, threshold=1.440e+03, percent-clipped=24.0 2023-06-23 04:29:43,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1399272.0, ans=0.0 2023-06-23 04:30:14,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1399392.0, ans=0.95 2023-06-23 04:30:31,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1399392.0, ans=0.1 2023-06-23 04:30:32,300 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.71 vs. limit=22.5 2023-06-23 04:30:53,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1399452.0, ans=0.125 2023-06-23 04:31:00,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1399512.0, ans=0.125 2023-06-23 04:31:12,839 INFO [train.py:996] (3/4) Epoch 8, batch 19800, loss[loss=0.2636, simple_loss=0.3228, pruned_loss=0.1022, over 21891.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3185, pruned_loss=0.08267, over 4266372.41 frames. ], batch size: 107, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:31:29,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1399572.0, ans=0.1 2023-06-23 04:31:37,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1399632.0, ans=0.0 2023-06-23 04:32:14,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1399752.0, ans=0.0 2023-06-23 04:32:51,964 INFO [train.py:996] (3/4) Epoch 8, batch 19850, loss[loss=0.2542, simple_loss=0.3296, pruned_loss=0.08942, over 21466.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3123, pruned_loss=0.0784, over 4267056.74 frames. ], batch size: 507, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:32:53,413 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.027e+02 4.926e+02 6.065e+02 9.192e+02 2.099e+03, threshold=1.213e+03, percent-clipped=4.0 2023-06-23 04:32:53,828 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 04:32:58,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1399872.0, ans=0.125 2023-06-23 04:33:59,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1400052.0, ans=0.125 2023-06-23 04:34:19,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1400112.0, ans=0.07 2023-06-23 04:34:28,522 INFO [train.py:996] (3/4) Epoch 8, batch 19900, loss[loss=0.2319, simple_loss=0.2988, pruned_loss=0.08252, over 21753.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3126, pruned_loss=0.07608, over 4271004.04 frames. ], batch size: 351, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:34:35,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1400172.0, ans=0.125 2023-06-23 04:34:48,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1400232.0, ans=0.125 2023-06-23 04:35:24,503 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-23 04:35:41,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1400352.0, ans=0.125 2023-06-23 04:35:55,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=15.0 2023-06-23 04:36:08,888 INFO [train.py:996] (3/4) Epoch 8, batch 19950, loss[loss=0.2081, simple_loss=0.2718, pruned_loss=0.07218, over 21662.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3065, pruned_loss=0.0764, over 4273298.65 frames. ], batch size: 282, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:36:10,387 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.980e+02 4.068e+02 6.013e+02 8.874e+02 2.224e+03, threshold=1.203e+03, percent-clipped=12.0 2023-06-23 04:36:15,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1400472.0, ans=0.125 2023-06-23 04:36:24,664 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=12.0 2023-06-23 04:36:38,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1400532.0, ans=0.05 2023-06-23 04:36:39,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1400532.0, ans=0.1 2023-06-23 04:36:51,465 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-23 04:37:06,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1400652.0, ans=0.125 2023-06-23 04:37:42,768 INFO [train.py:996] (3/4) Epoch 8, batch 20000, loss[loss=0.2492, simple_loss=0.3234, pruned_loss=0.08751, over 21794.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3081, pruned_loss=0.0772, over 4267818.86 frames. ], batch size: 112, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:37:43,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1400772.0, ans=0.125 2023-06-23 04:38:30,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1400892.0, ans=0.04949747468305833 2023-06-23 04:38:31,031 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=22.5 2023-06-23 04:38:34,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1400892.0, ans=15.0 2023-06-23 04:38:43,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1400952.0, ans=0.0 2023-06-23 04:39:05,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1401012.0, ans=0.125 2023-06-23 04:39:15,698 INFO [train.py:996] (3/4) Epoch 8, batch 20050, loss[loss=0.2725, simple_loss=0.3279, pruned_loss=0.1086, over 21830.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.309, pruned_loss=0.07939, over 4278002.80 frames. ], batch size: 107, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:39:18,834 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.252e+02 4.577e+02 6.319e+02 8.281e+02 1.487e+03, threshold=1.264e+03, percent-clipped=6.0 2023-06-23 04:40:43,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1401312.0, ans=0.125 2023-06-23 04:40:49,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1401312.0, ans=10.0 2023-06-23 04:40:54,564 INFO [train.py:996] (3/4) Epoch 8, batch 20100, loss[loss=0.209, simple_loss=0.2712, pruned_loss=0.07344, over 17216.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3101, pruned_loss=0.08076, over 4283949.84 frames. ], batch size: 61, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:42:21,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1401612.0, ans=0.2 2023-06-23 04:42:47,129 INFO [train.py:996] (3/4) Epoch 8, batch 20150, loss[loss=0.2054, simple_loss=0.2587, pruned_loss=0.07605, over 20040.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3199, pruned_loss=0.08399, over 4281784.69 frames. ], batch size: 704, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:42:50,157 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.397e+02 4.538e+02 5.704e+02 8.156e+02 2.453e+03, threshold=1.141e+03, percent-clipped=8.0 2023-06-23 04:43:40,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1401792.0, ans=0.125 2023-06-23 04:43:48,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1401852.0, ans=0.1 2023-06-23 04:44:03,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1401912.0, ans=0.125 2023-06-23 04:44:24,615 INFO [train.py:996] (3/4) Epoch 8, batch 20200, loss[loss=0.2564, simple_loss=0.352, pruned_loss=0.08039, over 21833.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3267, pruned_loss=0.08772, over 4283035.20 frames. ], batch size: 316, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:44:33,756 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.94 vs. limit=22.5 2023-06-23 04:44:46,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1402032.0, ans=0.125 2023-06-23 04:44:51,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=22.5 2023-06-23 04:45:42,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1402212.0, ans=0.2 2023-06-23 04:45:58,108 INFO [train.py:996] (3/4) Epoch 8, batch 20250, loss[loss=0.2766, simple_loss=0.3545, pruned_loss=0.09938, over 21588.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3283, pruned_loss=0.08674, over 4284283.32 frames. ], batch size: 471, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:46:01,463 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.429e+02 5.077e+02 7.177e+02 9.506e+02 2.179e+03, threshold=1.435e+03, percent-clipped=12.0 2023-06-23 04:46:13,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1402332.0, ans=0.2 2023-06-23 04:47:37,007 INFO [train.py:996] (3/4) Epoch 8, batch 20300, loss[loss=0.2559, simple_loss=0.3352, pruned_loss=0.08827, over 21703.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3261, pruned_loss=0.08358, over 4276623.24 frames. ], batch size: 298, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:47:52,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1402632.0, ans=0.125 2023-06-23 04:48:45,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1402752.0, ans=0.1 2023-06-23 04:48:54,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1402812.0, ans=0.2 2023-06-23 04:49:03,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.whiten.whitening_limit, batch_count=1402812.0, ans=15.0 2023-06-23 04:49:10,166 INFO [train.py:996] (3/4) Epoch 8, batch 20350, loss[loss=0.2547, simple_loss=0.3223, pruned_loss=0.09353, over 21873.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3253, pruned_loss=0.08351, over 4264140.94 frames. ], batch size: 124, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:49:11,249 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=15.0 2023-06-23 04:49:13,371 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.458e+02 4.961e+02 7.570e+02 1.006e+03 1.715e+03, threshold=1.514e+03, percent-clipped=7.0 2023-06-23 04:49:19,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1402872.0, ans=0.0 2023-06-23 04:49:27,677 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.92 vs. limit=22.5 2023-06-23 04:49:56,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1402992.0, ans=0.125 2023-06-23 04:50:16,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1403052.0, ans=0.09899494936611666 2023-06-23 04:50:23,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1403052.0, ans=0.04949747468305833 2023-06-23 04:50:44,162 INFO [train.py:996] (3/4) Epoch 8, batch 20400, loss[loss=0.3027, simple_loss=0.3772, pruned_loss=0.1141, over 21711.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3274, pruned_loss=0.08608, over 4246049.83 frames. ], batch size: 414, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:50:44,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1403172.0, ans=0.2 2023-06-23 04:51:49,856 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=15.0 2023-06-23 04:52:17,034 INFO [train.py:996] (3/4) Epoch 8, batch 20450, loss[loss=0.2451, simple_loss=0.2998, pruned_loss=0.09516, over 19961.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3281, pruned_loss=0.08812, over 4229994.37 frames. ], batch size: 704, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:52:20,036 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.597e+02 5.065e+02 6.621e+02 9.433e+02 1.870e+03, threshold=1.324e+03, percent-clipped=2.0 2023-06-23 04:53:03,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1403592.0, ans=0.0 2023-06-23 04:53:07,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1403652.0, ans=0.125 2023-06-23 04:53:26,886 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.17 vs. limit=15.0 2023-06-23 04:53:28,998 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 04:53:54,327 INFO [train.py:996] (3/4) Epoch 8, batch 20500, loss[loss=0.2315, simple_loss=0.2942, pruned_loss=0.08442, over 21300.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3229, pruned_loss=0.08831, over 4232897.61 frames. ], batch size: 159, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:53:55,423 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.70 vs. limit=22.5 2023-06-23 04:54:21,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1403832.0, ans=0.2 2023-06-23 04:54:48,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1403892.0, ans=0.125 2023-06-23 04:55:01,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1403952.0, ans=0.125 2023-06-23 04:55:28,227 INFO [train.py:996] (3/4) Epoch 8, batch 20550, loss[loss=0.2122, simple_loss=0.2833, pruned_loss=0.07055, over 21216.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3162, pruned_loss=0.08684, over 4239556.30 frames. ], batch size: 143, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:55:31,311 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.998e+02 4.304e+02 5.833e+02 8.675e+02 1.439e+03, threshold=1.167e+03, percent-clipped=3.0 2023-06-23 04:55:35,608 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=15.0 2023-06-23 04:56:42,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1404252.0, ans=0.0 2023-06-23 04:56:53,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1404312.0, ans=0.0 2023-06-23 04:57:07,602 INFO [train.py:996] (3/4) Epoch 8, batch 20600, loss[loss=0.2426, simple_loss=0.3056, pruned_loss=0.08981, over 21461.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3212, pruned_loss=0.08627, over 4244021.67 frames. ], batch size: 194, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:57:19,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1404372.0, ans=0.0 2023-06-23 04:58:18,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1404552.0, ans=10.0 2023-06-23 04:58:46,043 INFO [train.py:996] (3/4) Epoch 8, batch 20650, loss[loss=0.235, simple_loss=0.2918, pruned_loss=0.08907, over 21732.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3171, pruned_loss=0.0867, over 4255782.96 frames. ], batch size: 282, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:58:49,134 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.528e+02 4.605e+02 7.657e+02 1.188e+03 2.326e+03, threshold=1.531e+03, percent-clipped=25.0 2023-06-23 04:59:54,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1404852.0, ans=0.125 2023-06-23 05:00:04,060 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.66 vs. limit=6.0 2023-06-23 05:00:15,180 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 05:00:26,461 INFO [train.py:996] (3/4) Epoch 8, batch 20700, loss[loss=0.2189, simple_loss=0.2913, pruned_loss=0.07324, over 21766.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3074, pruned_loss=0.08232, over 4259429.50 frames. ], batch size: 282, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 05:01:18,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1405092.0, ans=0.2 2023-06-23 05:02:07,883 INFO [train.py:996] (3/4) Epoch 8, batch 20750, loss[loss=0.1773, simple_loss=0.2328, pruned_loss=0.06089, over 18286.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3103, pruned_loss=0.0816, over 4258090.85 frames. ], batch size: 70, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 05:02:08,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1405272.0, ans=0.1 2023-06-23 05:02:11,636 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.166e+02 4.120e+02 5.727e+02 9.009e+02 2.135e+03, threshold=1.145e+03, percent-clipped=5.0 2023-06-23 05:02:12,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1405272.0, ans=0.125 2023-06-23 05:02:36,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1405332.0, ans=0.125 2023-06-23 05:02:42,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1405332.0, ans=0.125 2023-06-23 05:03:06,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1405392.0, ans=0.1 2023-06-23 05:03:09,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1405392.0, ans=0.125 2023-06-23 05:03:11,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1405392.0, ans=0.0 2023-06-23 05:03:47,501 INFO [train.py:996] (3/4) Epoch 8, batch 20800, loss[loss=0.234, simple_loss=0.2958, pruned_loss=0.08604, over 21526.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3135, pruned_loss=0.08242, over 4262092.77 frames. ], batch size: 414, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:04:01,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1405572.0, ans=0.2 2023-06-23 05:04:38,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1405692.0, ans=0.0 2023-06-23 05:05:20,434 INFO [train.py:996] (3/4) Epoch 8, batch 20850, loss[loss=0.254, simple_loss=0.3179, pruned_loss=0.09507, over 21714.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3075, pruned_loss=0.08085, over 4261955.59 frames. ], batch size: 441, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:05:22,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1405872.0, ans=0.1 2023-06-23 05:05:28,411 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.968e+02 4.772e+02 9.225e+02 1.220e+03 2.670e+03, threshold=1.845e+03, percent-clipped=33.0 2023-06-23 05:05:30,935 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.16 vs. limit=15.0 2023-06-23 05:06:06,055 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 05:06:20,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1405992.0, ans=0.125 2023-06-23 05:06:27,790 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=15.0 2023-06-23 05:06:38,832 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=22.5 2023-06-23 05:06:48,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1406112.0, ans=0.0 2023-06-23 05:06:56,699 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-23 05:06:57,347 INFO [train.py:996] (3/4) Epoch 8, batch 20900, loss[loss=0.2532, simple_loss=0.3246, pruned_loss=0.09094, over 21845.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3079, pruned_loss=0.08199, over 4271520.78 frames. ], batch size: 351, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:08:02,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1406352.0, ans=0.1 2023-06-23 05:08:10,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1406352.0, ans=0.125 2023-06-23 05:08:19,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1406412.0, ans=0.125 2023-06-23 05:08:19,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1406412.0, ans=0.125 2023-06-23 05:08:33,365 INFO [train.py:996] (3/4) Epoch 8, batch 20950, loss[loss=0.1794, simple_loss=0.2623, pruned_loss=0.04825, over 21437.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3023, pruned_loss=0.0782, over 4264475.58 frames. ], batch size: 211, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:08:36,653 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.769e+02 4.522e+02 5.963e+02 9.256e+02 1.585e+03, threshold=1.193e+03, percent-clipped=0.0 2023-06-23 05:08:41,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1406472.0, ans=0.125 2023-06-23 05:09:24,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-23 05:09:33,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1406592.0, ans=0.015 2023-06-23 05:09:40,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1406652.0, ans=0.0 2023-06-23 05:09:43,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1406652.0, ans=0.1 2023-06-23 05:10:11,176 INFO [train.py:996] (3/4) Epoch 8, batch 21000, loss[loss=0.2573, simple_loss=0.3782, pruned_loss=0.06824, over 19776.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3015, pruned_loss=0.07807, over 4252782.84 frames. ], batch size: 702, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:10:11,176 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 05:10:27,251 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2634, simple_loss=0.3611, pruned_loss=0.08288, over 1796401.00 frames. 2023-06-23 05:10:27,252 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24463MB 2023-06-23 05:10:27,669 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 05:10:32,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1406772.0, ans=0.0 2023-06-23 05:10:33,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1406772.0, ans=0.0 2023-06-23 05:11:01,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1406832.0, ans=0.0 2023-06-23 05:11:26,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1406892.0, ans=0.0 2023-06-23 05:11:29,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1406892.0, ans=0.1 2023-06-23 05:12:04,098 INFO [train.py:996] (3/4) Epoch 8, batch 21050, loss[loss=0.262, simple_loss=0.3016, pruned_loss=0.1112, over 21425.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3005, pruned_loss=0.07882, over 4254271.33 frames. ], batch size: 509, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:12:07,318 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.662e+02 4.992e+02 6.776e+02 1.028e+03 2.055e+03, threshold=1.355e+03, percent-clipped=16.0 2023-06-23 05:12:41,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1407132.0, ans=0.0 2023-06-23 05:13:11,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1407252.0, ans=0.0 2023-06-23 05:13:39,944 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.18 vs. limit=22.5 2023-06-23 05:13:42,088 INFO [train.py:996] (3/4) Epoch 8, batch 21100, loss[loss=0.1759, simple_loss=0.2849, pruned_loss=0.03347, over 19818.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2973, pruned_loss=0.07826, over 4230154.59 frames. ], batch size: 703, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:14:07,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1407432.0, ans=0.125 2023-06-23 05:14:27,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1407492.0, ans=0.1 2023-06-23 05:14:58,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1407612.0, ans=0.125 2023-06-23 05:15:14,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1407672.0, ans=0.125 2023-06-23 05:15:15,042 INFO [train.py:996] (3/4) Epoch 8, batch 21150, loss[loss=0.2375, simple_loss=0.2902, pruned_loss=0.09245, over 21681.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2938, pruned_loss=0.07853, over 4223447.91 frames. ], batch size: 333, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:15:17,990 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.999e+02 4.608e+02 5.910e+02 9.200e+02 1.578e+03, threshold=1.182e+03, percent-clipped=4.0 2023-06-23 05:15:18,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1407672.0, ans=0.125 2023-06-23 05:15:51,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1407732.0, ans=0.125 2023-06-23 05:16:40,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1407912.0, ans=0.125 2023-06-23 05:16:54,275 INFO [train.py:996] (3/4) Epoch 8, batch 21200, loss[loss=0.1664, simple_loss=0.2409, pruned_loss=0.04596, over 20779.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2908, pruned_loss=0.07833, over 4238743.63 frames. ], batch size: 608, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:17:47,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1408092.0, ans=0.125 2023-06-23 05:17:54,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1408092.0, ans=0.125 2023-06-23 05:18:07,869 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.41 vs. limit=12.0 2023-06-23 05:18:31,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1408272.0, ans=0.5 2023-06-23 05:18:32,274 INFO [train.py:996] (3/4) Epoch 8, batch 21250, loss[loss=0.2627, simple_loss=0.3281, pruned_loss=0.09859, over 21632.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2894, pruned_loss=0.07869, over 4240458.48 frames. ], batch size: 391, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:18:41,947 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.312e+02 4.434e+02 5.437e+02 7.242e+02 2.137e+03, threshold=1.087e+03, percent-clipped=7.0 2023-06-23 05:18:43,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1408272.0, ans=0.125 2023-06-23 05:19:23,179 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.06 vs. limit=15.0 2023-06-23 05:19:36,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1408392.0, ans=0.2 2023-06-23 05:19:41,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1408452.0, ans=0.1 2023-06-23 05:19:44,218 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-23 05:19:49,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1408452.0, ans=0.125 2023-06-23 05:19:57,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1408512.0, ans=0.125 2023-06-23 05:20:11,416 INFO [train.py:996] (3/4) Epoch 8, batch 21300, loss[loss=0.2436, simple_loss=0.3095, pruned_loss=0.08881, over 21372.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2955, pruned_loss=0.08024, over 4244574.27 frames. ], batch size: 176, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:20:11,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1408572.0, ans=0.1 2023-06-23 05:21:23,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1408752.0, ans=0.0 2023-06-23 05:21:54,420 INFO [train.py:996] (3/4) Epoch 8, batch 21350, loss[loss=0.213, simple_loss=0.3117, pruned_loss=0.0571, over 21645.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3001, pruned_loss=0.08055, over 4259507.91 frames. ], batch size: 389, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:22:10,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.040e+02 5.053e+02 6.684e+02 9.217e+02 2.330e+03, threshold=1.337e+03, percent-clipped=18.0 2023-06-23 05:22:32,410 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.37 vs. limit=12.0 2023-06-23 05:22:51,982 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-23 05:23:06,793 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.50 vs. limit=22.5 2023-06-23 05:23:13,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1409112.0, ans=0.0 2023-06-23 05:23:18,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1409112.0, ans=0.0 2023-06-23 05:23:34,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1409112.0, ans=0.1 2023-06-23 05:23:38,546 INFO [train.py:996] (3/4) Epoch 8, batch 21400, loss[loss=0.2159, simple_loss=0.307, pruned_loss=0.06238, over 21830.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3041, pruned_loss=0.08045, over 4268365.36 frames. ], batch size: 371, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:23:47,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1409172.0, ans=0.1 2023-06-23 05:24:05,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1409232.0, ans=0.2 2023-06-23 05:25:22,967 INFO [train.py:996] (3/4) Epoch 8, batch 21450, loss[loss=0.2695, simple_loss=0.3308, pruned_loss=0.1041, over 21599.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3073, pruned_loss=0.08251, over 4266403.97 frames. ], batch size: 548, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:25:28,992 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.984e+02 4.393e+02 5.335e+02 6.741e+02 1.398e+03, threshold=1.067e+03, percent-clipped=1.0 2023-06-23 05:25:50,852 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-23 05:27:01,220 INFO [train.py:996] (3/4) Epoch 8, batch 21500, loss[loss=0.217, simple_loss=0.2806, pruned_loss=0.07671, over 21678.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3057, pruned_loss=0.08272, over 4260103.03 frames. ], batch size: 333, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:27:01,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1409772.0, ans=0.125 2023-06-23 05:27:57,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1409952.0, ans=0.125 2023-06-23 05:28:38,641 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 05:28:39,685 INFO [train.py:996] (3/4) Epoch 8, batch 21550, loss[loss=0.1627, simple_loss=0.238, pruned_loss=0.0437, over 21649.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.2983, pruned_loss=0.07957, over 4271410.04 frames. ], batch size: 298, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:28:46,245 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.014e+02 4.565e+02 6.143e+02 8.904e+02 1.889e+03, threshold=1.229e+03, percent-clipped=13.0 2023-06-23 05:29:12,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1410132.0, ans=0.025 2023-06-23 05:29:54,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1410312.0, ans=0.1 2023-06-23 05:30:17,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1410312.0, ans=0.035 2023-06-23 05:30:26,673 INFO [train.py:996] (3/4) Epoch 8, batch 21600, loss[loss=0.1839, simple_loss=0.27, pruned_loss=0.04883, over 21377.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2944, pruned_loss=0.07869, over 4261985.87 frames. ], batch size: 211, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:30:57,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1410492.0, ans=0.0 2023-06-23 05:31:40,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1410612.0, ans=0.2 2023-06-23 05:31:53,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1410612.0, ans=0.2 2023-06-23 05:32:05,247 INFO [train.py:996] (3/4) Epoch 8, batch 21650, loss[loss=0.2023, simple_loss=0.2861, pruned_loss=0.05924, over 21198.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2975, pruned_loss=0.07632, over 4260263.70 frames. ], batch size: 159, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:32:05,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1410672.0, ans=0.02 2023-06-23 05:32:10,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.286e+02 5.401e+02 7.635e+02 1.107e+03 2.032e+03, threshold=1.527e+03, percent-clipped=20.0 2023-06-23 05:32:53,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1410852.0, ans=0.125 2023-06-23 05:32:54,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1410852.0, ans=0.2 2023-06-23 05:33:22,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1410912.0, ans=0.125 2023-06-23 05:33:33,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1410912.0, ans=0.125 2023-06-23 05:33:36,427 INFO [train.py:996] (3/4) Epoch 8, batch 21700, loss[loss=0.2276, simple_loss=0.2932, pruned_loss=0.08098, over 21551.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2962, pruned_loss=0.07456, over 4248577.21 frames. ], batch size: 414, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:33:55,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1411032.0, ans=0.1 2023-06-23 05:34:24,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1411092.0, ans=0.125 2023-06-23 05:35:02,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1411212.0, ans=0.125 2023-06-23 05:35:15,271 INFO [train.py:996] (3/4) Epoch 8, batch 21750, loss[loss=0.2078, simple_loss=0.2643, pruned_loss=0.07569, over 21477.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.294, pruned_loss=0.07412, over 4251462.99 frames. ], batch size: 195, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:35:27,356 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.075e+02 4.568e+02 6.230e+02 8.144e+02 2.277e+03, threshold=1.246e+03, percent-clipped=1.0 2023-06-23 05:35:35,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1411332.0, ans=0.1 2023-06-23 05:35:37,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1411332.0, ans=0.125 2023-06-23 05:36:08,297 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-23 05:36:12,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1411452.0, ans=0.0 2023-06-23 05:36:26,153 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-06-23 05:37:00,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1411572.0, ans=0.125 2023-06-23 05:37:01,100 INFO [train.py:996] (3/4) Epoch 8, batch 21800, loss[loss=0.2462, simple_loss=0.3377, pruned_loss=0.07731, over 21766.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2927, pruned_loss=0.07555, over 4253056.97 frames. ], batch size: 333, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:37:13,600 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.53 vs. limit=15.0 2023-06-23 05:37:40,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1411692.0, ans=0.05 2023-06-23 05:37:45,604 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.24 vs. limit=10.0 2023-06-23 05:37:58,507 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=15.0 2023-06-23 05:38:39,302 INFO [train.py:996] (3/4) Epoch 8, batch 21850, loss[loss=0.2969, simple_loss=0.352, pruned_loss=0.1209, over 21791.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2998, pruned_loss=0.07629, over 4246596.43 frames. ], batch size: 441, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:38:47,502 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.413e+02 4.593e+02 6.628e+02 8.915e+02 2.617e+03, threshold=1.326e+03, percent-clipped=11.0 2023-06-23 05:39:07,972 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=22.5 2023-06-23 05:39:22,763 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=15.0 2023-06-23 05:40:11,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1412112.0, ans=0.1 2023-06-23 05:40:20,477 INFO [train.py:996] (3/4) Epoch 8, batch 21900, loss[loss=0.2102, simple_loss=0.2708, pruned_loss=0.07475, over 21343.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3012, pruned_loss=0.07769, over 4262933.97 frames. ], batch size: 131, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:40:20,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1412172.0, ans=0.125 2023-06-23 05:40:23,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1412172.0, ans=0.0 2023-06-23 05:40:57,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1412292.0, ans=0.1 2023-06-23 05:40:57,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1412292.0, ans=0.1 2023-06-23 05:42:00,028 INFO [train.py:996] (3/4) Epoch 8, batch 21950, loss[loss=0.1929, simple_loss=0.2562, pruned_loss=0.06475, over 21777.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2955, pruned_loss=0.07639, over 4254012.85 frames. ], batch size: 107, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:42:01,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1412472.0, ans=0.125 2023-06-23 05:42:07,944 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.901e+02 4.723e+02 6.314e+02 7.880e+02 1.650e+03, threshold=1.263e+03, percent-clipped=2.0 2023-06-23 05:42:11,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1412472.0, ans=0.1 2023-06-23 05:42:29,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1412532.0, ans=0.125 2023-06-23 05:42:33,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1412592.0, ans=0.125 2023-06-23 05:43:01,372 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 05:43:32,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1412712.0, ans=0.125 2023-06-23 05:43:38,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1412772.0, ans=0.125 2023-06-23 05:43:38,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1412772.0, ans=0.1 2023-06-23 05:43:38,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1412772.0, ans=0.0 2023-06-23 05:43:40,020 INFO [train.py:996] (3/4) Epoch 8, batch 22000, loss[loss=0.2283, simple_loss=0.3172, pruned_loss=0.06973, over 21228.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2912, pruned_loss=0.0745, over 4252198.55 frames. ], batch size: 549, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:44:13,264 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.25 vs. limit=12.0 2023-06-23 05:44:29,734 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.83 vs. limit=10.0 2023-06-23 05:44:30,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1412892.0, ans=0.0 2023-06-23 05:44:53,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1412952.0, ans=0.125 2023-06-23 05:45:03,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1413012.0, ans=0.0 2023-06-23 05:45:21,155 INFO [train.py:996] (3/4) Epoch 8, batch 22050, loss[loss=0.2811, simple_loss=0.3591, pruned_loss=0.1016, over 21911.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2963, pruned_loss=0.07578, over 4250909.18 frames. ], batch size: 372, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:45:33,013 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.844e+02 4.843e+02 7.365e+02 1.302e+03 3.775e+03, threshold=1.473e+03, percent-clipped=26.0 2023-06-23 05:45:38,549 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=22.5 2023-06-23 05:45:50,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1413132.0, ans=0.125 2023-06-23 05:46:29,300 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-23 05:47:02,684 INFO [train.py:996] (3/4) Epoch 8, batch 22100, loss[loss=0.2447, simple_loss=0.3212, pruned_loss=0.08417, over 21711.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3059, pruned_loss=0.08028, over 4251963.40 frames. ], batch size: 298, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:47:58,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1413552.0, ans=0.0 2023-06-23 05:48:30,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1413612.0, ans=0.0 2023-06-23 05:48:38,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1413612.0, ans=0.125 2023-06-23 05:48:41,573 INFO [train.py:996] (3/4) Epoch 8, batch 22150, loss[loss=0.2429, simple_loss=0.3051, pruned_loss=0.09035, over 21200.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3089, pruned_loss=0.08192, over 4261114.63 frames. ], batch size: 143, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:48:44,116 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.02 vs. limit=22.5 2023-06-23 05:48:51,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1413672.0, ans=0.0 2023-06-23 05:48:52,628 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.178e+02 4.829e+02 6.848e+02 1.021e+03 2.130e+03, threshold=1.370e+03, percent-clipped=6.0 2023-06-23 05:48:54,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1413672.0, ans=0.035 2023-06-23 05:48:56,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1413732.0, ans=0.125 2023-06-23 05:49:13,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1413792.0, ans=0.125 2023-06-23 05:49:40,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1413852.0, ans=0.0 2023-06-23 05:49:48,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1413852.0, ans=0.125 2023-06-23 05:50:11,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1413912.0, ans=0.0 2023-06-23 05:50:15,449 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.68 vs. limit=10.0 2023-06-23 05:50:21,009 INFO [train.py:996] (3/4) Epoch 8, batch 22200, loss[loss=0.3283, simple_loss=0.4034, pruned_loss=0.1266, over 21568.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3107, pruned_loss=0.08342, over 4270590.43 frames. ], batch size: 471, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:50:34,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1413972.0, ans=0.125 2023-06-23 05:50:44,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1414032.0, ans=0.125 2023-06-23 05:51:02,629 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=22.5 2023-06-23 05:51:02,831 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.42 vs. limit=15.0 2023-06-23 05:51:33,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.13 vs. limit=15.0 2023-06-23 05:51:54,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1414212.0, ans=0.0 2023-06-23 05:52:00,973 INFO [train.py:996] (3/4) Epoch 8, batch 22250, loss[loss=0.2844, simple_loss=0.3583, pruned_loss=0.1053, over 21339.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3175, pruned_loss=0.08567, over 4275630.31 frames. ], batch size: 548, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:52:12,835 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.556e+02 5.033e+02 6.376e+02 9.699e+02 1.847e+03, threshold=1.275e+03, percent-clipped=11.0 2023-06-23 05:52:14,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1414272.0, ans=0.0 2023-06-23 05:52:16,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1414332.0, ans=0.1 2023-06-23 05:52:20,179 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2023-06-23 05:52:50,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1414392.0, ans=0.125 2023-06-23 05:53:40,284 INFO [train.py:996] (3/4) Epoch 8, batch 22300, loss[loss=0.2425, simple_loss=0.3052, pruned_loss=0.08994, over 21688.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3189, pruned_loss=0.08688, over 4271934.09 frames. ], batch size: 263, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:53:52,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1414572.0, ans=0.0 2023-06-23 05:53:56,692 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=22.5 2023-06-23 05:54:03,865 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-06-23 05:54:36,851 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.51 vs. limit=22.5 2023-06-23 05:55:13,962 INFO [train.py:996] (3/4) Epoch 8, batch 22350, loss[loss=0.2243, simple_loss=0.2889, pruned_loss=0.07984, over 21033.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3187, pruned_loss=0.08836, over 4279387.29 frames. ], batch size: 607, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:55:23,177 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2023-06-23 05:55:25,667 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.473e+02 4.765e+02 6.117e+02 7.891e+02 1.509e+03, threshold=1.223e+03, percent-clipped=2.0 2023-06-23 05:55:50,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1414992.0, ans=0.125 2023-06-23 05:55:55,813 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.72 vs. limit=12.0 2023-06-23 05:56:23,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=22.5 2023-06-23 05:56:48,466 INFO [train.py:996] (3/4) Epoch 8, batch 22400, loss[loss=0.2168, simple_loss=0.2882, pruned_loss=0.07268, over 21430.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3149, pruned_loss=0.08605, over 4283036.81 frames. ], batch size: 212, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 05:56:52,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1415172.0, ans=0.0 2023-06-23 05:57:37,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1415292.0, ans=0.125 2023-06-23 05:58:08,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1415352.0, ans=0.0 2023-06-23 05:58:08,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1415352.0, ans=0.0 2023-06-23 05:58:26,816 INFO [train.py:996] (3/4) Epoch 8, batch 22450, loss[loss=0.2174, simple_loss=0.2775, pruned_loss=0.07862, over 16231.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3076, pruned_loss=0.08398, over 4281277.00 frames. ], batch size: 66, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 05:58:27,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-06-23 05:58:37,906 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.026e+02 3.949e+02 5.140e+02 7.263e+02 1.360e+03, threshold=1.028e+03, percent-clipped=2.0 2023-06-23 06:00:06,791 INFO [train.py:996] (3/4) Epoch 8, batch 22500, loss[loss=0.2603, simple_loss=0.3442, pruned_loss=0.08817, over 21652.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3039, pruned_loss=0.08381, over 4274718.73 frames. ], batch size: 247, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:00:28,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1415832.0, ans=0.125 2023-06-23 06:01:47,294 INFO [train.py:996] (3/4) Epoch 8, batch 22550, loss[loss=0.2498, simple_loss=0.3237, pruned_loss=0.08794, over 21795.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3069, pruned_loss=0.0831, over 4282442.02 frames. ], batch size: 298, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:02:04,064 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.380e+02 5.264e+02 6.977e+02 1.047e+03 2.151e+03, threshold=1.395e+03, percent-clipped=25.0 2023-06-23 06:02:34,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1416192.0, ans=0.125 2023-06-23 06:03:26,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1416312.0, ans=0.125 2023-06-23 06:03:29,259 INFO [train.py:996] (3/4) Epoch 8, batch 22600, loss[loss=0.2621, simple_loss=0.36, pruned_loss=0.08209, over 21231.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3098, pruned_loss=0.08308, over 4279195.36 frames. ], batch size: 548, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:03:39,491 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.88 vs. limit=10.0 2023-06-23 06:03:39,537 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-23 06:03:58,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1416432.0, ans=0.125 2023-06-23 06:04:43,514 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-06-23 06:05:05,341 INFO [train.py:996] (3/4) Epoch 8, batch 22650, loss[loss=0.2084, simple_loss=0.265, pruned_loss=0.07591, over 21464.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3066, pruned_loss=0.08252, over 4258086.42 frames. ], batch size: 195, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:05:21,123 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.542e+02 6.131e+02 9.012e+02 1.354e+03 2.560e+03, threshold=1.802e+03, percent-clipped=24.0 2023-06-23 06:05:46,269 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.54 vs. limit=15.0 2023-06-23 06:06:00,208 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-23 06:06:37,785 INFO [train.py:996] (3/4) Epoch 8, batch 22700, loss[loss=0.2149, simple_loss=0.2673, pruned_loss=0.08127, over 21219.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3016, pruned_loss=0.08311, over 4268881.69 frames. ], batch size: 176, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:07:23,364 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=22.5 2023-06-23 06:07:41,053 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-23 06:07:41,144 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.92 vs. limit=10.0 2023-06-23 06:07:53,606 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 06:08:16,114 INFO [train.py:996] (3/4) Epoch 8, batch 22750, loss[loss=0.1814, simple_loss=0.2336, pruned_loss=0.06461, over 20689.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3025, pruned_loss=0.08503, over 4254522.47 frames. ], batch size: 607, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:08:21,553 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-23 06:08:31,922 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.389e+02 4.804e+02 6.420e+02 9.928e+02 2.099e+03, threshold=1.284e+03, percent-clipped=4.0 2023-06-23 06:09:54,196 INFO [train.py:996] (3/4) Epoch 8, batch 22800, loss[loss=0.239, simple_loss=0.3138, pruned_loss=0.0821, over 21877.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3049, pruned_loss=0.08598, over 4263710.18 frames. ], batch size: 107, lr: 3.67e-03, grad_scale: 32.0 2023-06-23 06:10:07,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1417572.0, ans=0.1 2023-06-23 06:10:48,834 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-06-23 06:10:56,305 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 06:11:32,437 INFO [train.py:996] (3/4) Epoch 8, batch 22850, loss[loss=0.2215, simple_loss=0.28, pruned_loss=0.08149, over 21848.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3017, pruned_loss=0.08522, over 4260490.25 frames. ], batch size: 118, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:11:35,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1417872.0, ans=0.125 2023-06-23 06:11:49,366 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.448e+02 5.341e+02 7.317e+02 9.622e+02 1.873e+03, threshold=1.463e+03, percent-clipped=13.0 2023-06-23 06:12:06,676 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.84 vs. limit=10.0 2023-06-23 06:12:13,224 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.72 vs. limit=10.0 2023-06-23 06:12:22,249 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-06-23 06:12:31,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1418052.0, ans=0.125 2023-06-23 06:12:35,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1418052.0, ans=22.5 2023-06-23 06:13:07,137 INFO [train.py:996] (3/4) Epoch 8, batch 22900, loss[loss=0.1607, simple_loss=0.2135, pruned_loss=0.054, over 16154.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.304, pruned_loss=0.08436, over 4254816.16 frames. ], batch size: 60, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:13:14,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1418172.0, ans=0.125 2023-06-23 06:13:40,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1418232.0, ans=0.09899494936611666 2023-06-23 06:14:03,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1418292.0, ans=0.0 2023-06-23 06:14:10,417 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.20 vs. limit=15.0 2023-06-23 06:14:26,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1418352.0, ans=0.0 2023-06-23 06:14:34,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1418412.0, ans=0.2 2023-06-23 06:14:43,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1418412.0, ans=0.09899494936611666 2023-06-23 06:14:46,564 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=22.5 2023-06-23 06:14:56,811 INFO [train.py:996] (3/4) Epoch 8, batch 22950, loss[loss=0.2201, simple_loss=0.3247, pruned_loss=0.05775, over 21450.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3157, pruned_loss=0.08253, over 4260247.45 frames. ], batch size: 211, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:14:58,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1418472.0, ans=0.0 2023-06-23 06:15:03,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1418472.0, ans=0.0 2023-06-23 06:15:10,124 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.029e+02 4.953e+02 7.269e+02 1.039e+03 2.026e+03, threshold=1.454e+03, percent-clipped=12.0 2023-06-23 06:15:10,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1418472.0, ans=0.09899494936611666 2023-06-23 06:15:12,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1418532.0, ans=0.125 2023-06-23 06:15:37,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1418592.0, ans=0.125 2023-06-23 06:15:38,411 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.29 vs. limit=10.0 2023-06-23 06:16:00,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1418652.0, ans=0.0 2023-06-23 06:16:36,800 INFO [train.py:996] (3/4) Epoch 8, batch 23000, loss[loss=0.2727, simple_loss=0.3291, pruned_loss=0.1081, over 21538.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3157, pruned_loss=0.08024, over 4261395.96 frames. ], batch size: 548, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:18:12,361 INFO [train.py:996] (3/4) Epoch 8, batch 23050, loss[loss=0.2525, simple_loss=0.3217, pruned_loss=0.09166, over 21450.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3175, pruned_loss=0.08242, over 4270851.45 frames. ], batch size: 211, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:18:25,318 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.192e+02 4.592e+02 5.368e+02 6.927e+02 1.540e+03, threshold=1.074e+03, percent-clipped=1.0 2023-06-23 06:19:38,406 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-23 06:19:42,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1419312.0, ans=0.1 2023-06-23 06:19:47,022 INFO [train.py:996] (3/4) Epoch 8, batch 23100, loss[loss=0.1968, simple_loss=0.2542, pruned_loss=0.06964, over 21598.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3127, pruned_loss=0.08278, over 4272164.74 frames. ], batch size: 247, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:20:30,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1419492.0, ans=0.125 2023-06-23 06:20:40,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1419552.0, ans=0.125 2023-06-23 06:21:05,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1419612.0, ans=0.05 2023-06-23 06:21:09,223 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=15.0 2023-06-23 06:21:21,795 INFO [train.py:996] (3/4) Epoch 8, batch 23150, loss[loss=0.2054, simple_loss=0.2739, pruned_loss=0.06843, over 21830.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3071, pruned_loss=0.08189, over 4280119.90 frames. ], batch size: 247, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:21:22,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1419672.0, ans=0.125 2023-06-23 06:21:33,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1419672.0, ans=0.1 2023-06-23 06:21:34,629 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.444e+02 4.721e+02 6.329e+02 9.421e+02 1.968e+03, threshold=1.266e+03, percent-clipped=20.0 2023-06-23 06:21:41,797 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-06-23 06:21:42,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1419732.0, ans=0.125 2023-06-23 06:21:58,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1419792.0, ans=0.07 2023-06-23 06:22:14,428 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-23 06:22:15,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1419792.0, ans=0.125 2023-06-23 06:22:20,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1419852.0, ans=0.0 2023-06-23 06:22:37,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1419852.0, ans=0.0 2023-06-23 06:22:37,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1419852.0, ans=0.125 2023-06-23 06:22:59,177 INFO [train.py:996] (3/4) Epoch 8, batch 23200, loss[loss=0.228, simple_loss=0.2983, pruned_loss=0.07889, over 21383.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3056, pruned_loss=0.08255, over 4286540.97 frames. ], batch size: 159, lr: 3.67e-03, grad_scale: 32.0 2023-06-23 06:23:23,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1420032.0, ans=0.1 2023-06-23 06:24:37,810 INFO [train.py:996] (3/4) Epoch 8, batch 23250, loss[loss=0.2419, simple_loss=0.3076, pruned_loss=0.08812, over 21901.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3071, pruned_loss=0.08397, over 4286055.43 frames. ], batch size: 316, lr: 3.67e-03, grad_scale: 32.0 2023-06-23 06:24:39,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1420272.0, ans=0.2 2023-06-23 06:24:50,324 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.546e+02 4.969e+02 6.559e+02 1.052e+03 2.390e+03, threshold=1.312e+03, percent-clipped=18.0 2023-06-23 06:25:15,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1420392.0, ans=0.2 2023-06-23 06:25:24,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1420392.0, ans=0.0 2023-06-23 06:25:32,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1420392.0, ans=0.1 2023-06-23 06:25:51,096 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2023-06-23 06:26:18,060 INFO [train.py:996] (3/4) Epoch 8, batch 23300, loss[loss=0.2447, simple_loss=0.3378, pruned_loss=0.07582, over 21811.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3149, pruned_loss=0.08669, over 4282383.69 frames. ], batch size: 282, lr: 3.67e-03, grad_scale: 32.0 2023-06-23 06:26:20,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1420572.0, ans=0.125 2023-06-23 06:26:40,179 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-23 06:26:57,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-23 06:27:06,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1420692.0, ans=0.0 2023-06-23 06:27:27,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1420752.0, ans=0.125 2023-06-23 06:27:28,040 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=15.0 2023-06-23 06:27:47,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1420812.0, ans=0.0 2023-06-23 06:27:51,268 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2023-06-23 06:27:58,437 INFO [train.py:996] (3/4) Epoch 8, batch 23350, loss[loss=0.2479, simple_loss=0.3384, pruned_loss=0.07869, over 20712.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3187, pruned_loss=0.08501, over 4274402.23 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:28:18,033 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.315e+02 4.912e+02 6.155e+02 8.820e+02 1.771e+03, threshold=1.231e+03, percent-clipped=5.0 2023-06-23 06:29:25,249 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=22.5 2023-06-23 06:29:29,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1421112.0, ans=0.0 2023-06-23 06:29:37,109 INFO [train.py:996] (3/4) Epoch 8, batch 23400, loss[loss=0.1772, simple_loss=0.2665, pruned_loss=0.04395, over 21093.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.311, pruned_loss=0.08072, over 4279501.41 frames. ], batch size: 608, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:29:54,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1421172.0, ans=0.125 2023-06-23 06:30:35,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1421292.0, ans=0.125 2023-06-23 06:31:08,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1421412.0, ans=0.09899494936611666 2023-06-23 06:31:20,205 INFO [train.py:996] (3/4) Epoch 8, batch 23450, loss[loss=0.2664, simple_loss=0.3493, pruned_loss=0.09176, over 21858.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3108, pruned_loss=0.08248, over 4284996.98 frames. ], batch size: 124, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:31:25,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1421472.0, ans=0.125 2023-06-23 06:31:38,555 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.252e+02 4.296e+02 5.237e+02 7.563e+02 1.579e+03, threshold=1.047e+03, percent-clipped=8.0 2023-06-23 06:31:45,899 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=15.0 2023-06-23 06:32:37,263 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-06-23 06:32:58,358 INFO [train.py:996] (3/4) Epoch 8, batch 23500, loss[loss=0.2308, simple_loss=0.2991, pruned_loss=0.08128, over 21884.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3112, pruned_loss=0.08324, over 4280652.19 frames. ], batch size: 124, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:33:20,866 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.12 vs. limit=6.0 2023-06-23 06:34:24,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1422012.0, ans=0.2 2023-06-23 06:34:35,782 INFO [train.py:996] (3/4) Epoch 8, batch 23550, loss[loss=0.2048, simple_loss=0.2655, pruned_loss=0.07203, over 21579.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3073, pruned_loss=0.08374, over 4290381.27 frames. ], batch size: 213, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:34:53,704 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.69 vs. limit=22.5 2023-06-23 06:34:54,226 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.322e+02 4.998e+02 7.038e+02 9.548e+02 2.153e+03, threshold=1.408e+03, percent-clipped=14.0 2023-06-23 06:35:14,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1422132.0, ans=0.125 2023-06-23 06:36:00,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1422312.0, ans=0.04949747468305833 2023-06-23 06:36:18,152 INFO [train.py:996] (3/4) Epoch 8, batch 23600, loss[loss=0.2558, simple_loss=0.3358, pruned_loss=0.08787, over 21453.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3067, pruned_loss=0.0839, over 4284530.11 frames. ], batch size: 131, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:37:09,481 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2023-06-23 06:37:10,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1422492.0, ans=0.2 2023-06-23 06:37:15,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1422552.0, ans=0.09899494936611666 2023-06-23 06:37:28,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1422552.0, ans=0.125 2023-06-23 06:37:37,303 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-23 06:37:58,171 INFO [train.py:996] (3/4) Epoch 8, batch 23650, loss[loss=0.3028, simple_loss=0.3724, pruned_loss=0.1166, over 21563.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.307, pruned_loss=0.08228, over 4276835.52 frames. ], batch size: 414, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:38:18,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1422672.0, ans=0.125 2023-06-23 06:38:22,826 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.610e+02 4.602e+02 5.917e+02 8.221e+02 1.589e+03, threshold=1.183e+03, percent-clipped=3.0 2023-06-23 06:39:17,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1422852.0, ans=0.125 2023-06-23 06:39:48,534 INFO [train.py:996] (3/4) Epoch 8, batch 23700, loss[loss=0.1994, simple_loss=0.2812, pruned_loss=0.05875, over 21745.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3093, pruned_loss=0.08205, over 4274992.72 frames. ], batch size: 282, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:40:10,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1423032.0, ans=0.2 2023-06-23 06:40:31,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1423092.0, ans=0.5 2023-06-23 06:40:44,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1423092.0, ans=0.0 2023-06-23 06:41:22,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1423212.0, ans=0.125 2023-06-23 06:41:28,734 INFO [train.py:996] (3/4) Epoch 8, batch 23750, loss[loss=0.2282, simple_loss=0.2988, pruned_loss=0.07876, over 21040.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3109, pruned_loss=0.08214, over 4272992.59 frames. ], batch size: 143, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:41:42,895 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.157e+02 4.173e+02 5.450e+02 7.281e+02 1.269e+03, threshold=1.090e+03, percent-clipped=1.0 2023-06-23 06:41:44,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1423332.0, ans=0.0 2023-06-23 06:42:30,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1423452.0, ans=0.125 2023-06-23 06:42:51,705 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 06:43:07,497 INFO [train.py:996] (3/4) Epoch 8, batch 23800, loss[loss=0.2404, simple_loss=0.3236, pruned_loss=0.07856, over 20643.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3106, pruned_loss=0.08032, over 4268768.03 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:44:34,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1423812.0, ans=0.1 2023-06-23 06:44:47,934 INFO [train.py:996] (3/4) Epoch 8, batch 23850, loss[loss=0.2234, simple_loss=0.2893, pruned_loss=0.07874, over 20034.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.319, pruned_loss=0.08265, over 4265424.24 frames. ], batch size: 702, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:45:07,739 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.341e+02 5.290e+02 6.961e+02 9.016e+02 2.497e+03, threshold=1.392e+03, percent-clipped=15.0 2023-06-23 06:45:12,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1423932.0, ans=0.0 2023-06-23 06:45:56,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1424052.0, ans=0.0 2023-06-23 06:46:22,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1424112.0, ans=0.2 2023-06-23 06:46:33,155 INFO [train.py:996] (3/4) Epoch 8, batch 23900, loss[loss=0.2742, simple_loss=0.3496, pruned_loss=0.09938, over 21751.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.326, pruned_loss=0.08561, over 4273192.17 frames. ], batch size: 351, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:46:33,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1424172.0, ans=0.125 2023-06-23 06:47:38,715 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-23 06:47:41,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1424352.0, ans=0.04949747468305833 2023-06-23 06:48:06,323 INFO [train.py:996] (3/4) Epoch 8, batch 23950, loss[loss=0.2586, simple_loss=0.3096, pruned_loss=0.1038, over 15008.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.321, pruned_loss=0.08542, over 4270610.89 frames. ], batch size: 60, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:48:25,217 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.604e+02 5.747e+02 7.946e+02 1.092e+03 1.988e+03, threshold=1.589e+03, percent-clipped=11.0 2023-06-23 06:48:29,168 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.27 vs. limit=10.0 2023-06-23 06:48:48,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1424532.0, ans=0.1 2023-06-23 06:48:48,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1424532.0, ans=0.1 2023-06-23 06:49:23,772 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-06-23 06:49:45,374 INFO [train.py:996] (3/4) Epoch 8, batch 24000, loss[loss=0.2577, simple_loss=0.33, pruned_loss=0.09272, over 21396.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3215, pruned_loss=0.08767, over 4273208.93 frames. ], batch size: 549, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:49:45,375 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 06:50:04,123 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2639, simple_loss=0.3603, pruned_loss=0.08376, over 1796401.00 frames. 2023-06-23 06:50:04,123 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-23 06:50:18,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1424772.0, ans=10.0 2023-06-23 06:50:21,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1424772.0, ans=0.125 2023-06-23 06:50:28,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1424772.0, ans=0.0 2023-06-23 06:50:34,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1424832.0, ans=0.2 2023-06-23 06:50:46,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1424892.0, ans=0.1 2023-06-23 06:50:53,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1424892.0, ans=0.1 2023-06-23 06:51:25,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1425012.0, ans=0.0 2023-06-23 06:51:28,867 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=15.0 2023-06-23 06:51:34,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1425012.0, ans=0.1 2023-06-23 06:51:43,399 INFO [train.py:996] (3/4) Epoch 8, batch 24050, loss[loss=0.2189, simple_loss=0.3103, pruned_loss=0.06377, over 21895.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3235, pruned_loss=0.08816, over 4276574.17 frames. ], batch size: 316, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:51:47,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1425072.0, ans=0.125 2023-06-23 06:52:08,104 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.279e+02 4.739e+02 5.574e+02 8.138e+02 1.478e+03, threshold=1.115e+03, percent-clipped=0.0 2023-06-23 06:52:10,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1425132.0, ans=0.125 2023-06-23 06:52:54,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1425252.0, ans=0.125 2023-06-23 06:52:54,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=22.5 2023-06-23 06:53:02,412 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=8.245e-03 2023-06-23 06:53:28,940 INFO [train.py:996] (3/4) Epoch 8, batch 24100, loss[loss=0.2706, simple_loss=0.3419, pruned_loss=0.0996, over 21635.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3235, pruned_loss=0.08678, over 4267760.85 frames. ], batch size: 230, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:54:23,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1425552.0, ans=0.1 2023-06-23 06:54:48,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1425612.0, ans=0.125 2023-06-23 06:54:53,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1425612.0, ans=0.1 2023-06-23 06:55:07,234 INFO [train.py:996] (3/4) Epoch 8, batch 24150, loss[loss=0.2808, simple_loss=0.359, pruned_loss=0.1013, over 20626.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3255, pruned_loss=0.08902, over 4276583.94 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:55:15,271 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.50 vs. limit=10.0 2023-06-23 06:55:22,333 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.625e+02 4.851e+02 6.515e+02 9.296e+02 1.728e+03, threshold=1.303e+03, percent-clipped=14.0 2023-06-23 06:55:26,827 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.44 vs. limit=15.0 2023-06-23 06:55:29,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1425732.0, ans=0.125 2023-06-23 06:55:37,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1425732.0, ans=0.2 2023-06-23 06:56:43,546 INFO [train.py:996] (3/4) Epoch 8, batch 24200, loss[loss=0.2662, simple_loss=0.358, pruned_loss=0.08721, over 21614.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3281, pruned_loss=0.09019, over 4277772.44 frames. ], batch size: 389, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:57:24,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1426092.0, ans=0.07 2023-06-23 06:57:30,460 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-06-23 06:57:31,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1426092.0, ans=0.2 2023-06-23 06:58:14,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1426212.0, ans=0.0 2023-06-23 06:58:24,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1426272.0, ans=0.125 2023-06-23 06:58:25,332 INFO [train.py:996] (3/4) Epoch 8, batch 24250, loss[loss=0.206, simple_loss=0.2978, pruned_loss=0.05709, over 21457.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3247, pruned_loss=0.08363, over 4285825.26 frames. ], batch size: 194, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:58:44,555 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.260e+02 4.495e+02 7.277e+02 1.167e+03 2.451e+03, threshold=1.455e+03, percent-clipped=16.0 2023-06-23 06:58:46,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1426332.0, ans=0.1 2023-06-23 06:58:53,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1426332.0, ans=0.0 2023-06-23 06:59:27,680 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-23 07:00:04,110 INFO [train.py:996] (3/4) Epoch 8, batch 24300, loss[loss=0.184, simple_loss=0.2647, pruned_loss=0.05164, over 21773.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3189, pruned_loss=0.07828, over 4284647.37 frames. ], batch size: 298, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 07:00:30,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1426632.0, ans=0.125 2023-06-23 07:01:47,229 INFO [train.py:996] (3/4) Epoch 8, batch 24350, loss[loss=0.2519, simple_loss=0.3183, pruned_loss=0.0927, over 21549.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3154, pruned_loss=0.07815, over 4286982.89 frames. ], batch size: 548, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 07:02:03,691 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.048e+02 4.784e+02 6.670e+02 9.592e+02 1.817e+03, threshold=1.334e+03, percent-clipped=7.0 2023-06-23 07:02:25,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1426992.0, ans=0.125 2023-06-23 07:02:58,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1427052.0, ans=0.125 2023-06-23 07:03:27,452 INFO [train.py:996] (3/4) Epoch 8, batch 24400, loss[loss=0.2954, simple_loss=0.3575, pruned_loss=0.1167, over 21604.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3185, pruned_loss=0.08184, over 4291317.73 frames. ], batch size: 441, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 07:03:31,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1427172.0, ans=0.025 2023-06-23 07:03:43,090 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=22.5 2023-06-23 07:03:50,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1427232.0, ans=0.2 2023-06-23 07:04:30,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1427352.0, ans=0.125 2023-06-23 07:04:42,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1427352.0, ans=0.125 2023-06-23 07:04:53,545 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-23 07:05:00,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1427412.0, ans=0.125 2023-06-23 07:05:07,060 INFO [train.py:996] (3/4) Epoch 8, batch 24450, loss[loss=0.3499, simple_loss=0.431, pruned_loss=0.1344, over 21445.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3206, pruned_loss=0.084, over 4280895.73 frames. ], batch size: 471, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 07:05:18,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1427472.0, ans=0.125 2023-06-23 07:05:23,008 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.696e+02 5.464e+02 7.459e+02 1.124e+03 2.090e+03, threshold=1.492e+03, percent-clipped=14.0 2023-06-23 07:05:42,899 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-23 07:05:53,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1427592.0, ans=0.125 2023-06-23 07:06:41,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1427712.0, ans=0.0 2023-06-23 07:06:44,605 INFO [train.py:996] (3/4) Epoch 8, batch 24500, loss[loss=0.2887, simple_loss=0.341, pruned_loss=0.1182, over 21731.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3207, pruned_loss=0.08384, over 4289132.72 frames. ], batch size: 508, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 07:06:46,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1427772.0, ans=0.125 2023-06-23 07:06:48,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1427772.0, ans=0.125 2023-06-23 07:06:54,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1427772.0, ans=0.125 2023-06-23 07:08:24,393 INFO [train.py:996] (3/4) Epoch 8, batch 24550, loss[loss=0.2359, simple_loss=0.3069, pruned_loss=0.08244, over 21575.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3223, pruned_loss=0.08579, over 4286831.87 frames. ], batch size: 263, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 07:08:24,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1428072.0, ans=0.1 2023-06-23 07:08:41,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1428072.0, ans=0.0 2023-06-23 07:08:49,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1428132.0, ans=0.0 2023-06-23 07:08:50,837 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.336e+02 4.677e+02 6.091e+02 7.782e+02 1.609e+03, threshold=1.218e+03, percent-clipped=3.0 2023-06-23 07:08:59,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1428132.0, ans=0.125 2023-06-23 07:09:20,882 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-23 07:09:52,084 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.14 vs. limit=22.5 2023-06-23 07:10:01,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1428372.0, ans=0.2 2023-06-23 07:10:02,236 INFO [train.py:996] (3/4) Epoch 8, batch 24600, loss[loss=0.2323, simple_loss=0.2868, pruned_loss=0.08887, over 21743.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3165, pruned_loss=0.0857, over 4289321.39 frames. ], batch size: 124, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 07:11:06,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1428492.0, ans=0.125 2023-06-23 07:11:14,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1428552.0, ans=0.125 2023-06-23 07:11:17,797 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-23 07:11:26,945 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 07:11:36,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1428612.0, ans=0.0 2023-06-23 07:11:40,891 INFO [train.py:996] (3/4) Epoch 8, batch 24650, loss[loss=0.2277, simple_loss=0.2798, pruned_loss=0.08779, over 21280.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3104, pruned_loss=0.08489, over 4277816.76 frames. ], batch size: 160, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:12:02,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1428672.0, ans=0.125 2023-06-23 07:12:02,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1428672.0, ans=0.2 2023-06-23 07:12:13,320 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.435e+02 5.561e+02 8.132e+02 1.139e+03 1.963e+03, threshold=1.626e+03, percent-clipped=16.0 2023-06-23 07:12:23,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1428732.0, ans=0.125 2023-06-23 07:13:15,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1428912.0, ans=0.125 2023-06-23 07:13:19,835 INFO [train.py:996] (3/4) Epoch 8, batch 24700, loss[loss=0.2369, simple_loss=0.3035, pruned_loss=0.08519, over 21591.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3082, pruned_loss=0.08329, over 4276224.19 frames. ], batch size: 332, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:13:54,613 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=22.5 2023-06-23 07:13:58,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1429032.0, ans=0.1 2023-06-23 07:14:22,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1429152.0, ans=0.1 2023-06-23 07:14:34,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1429152.0, ans=0.125 2023-06-23 07:14:36,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1429212.0, ans=0.07 2023-06-23 07:14:52,851 INFO [train.py:996] (3/4) Epoch 8, batch 24750, loss[loss=0.2146, simple_loss=0.2705, pruned_loss=0.07939, over 21440.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3017, pruned_loss=0.08008, over 4274704.07 frames. ], batch size: 212, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:14:58,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1429272.0, ans=0.2 2023-06-23 07:15:19,821 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.105e+02 4.833e+02 6.692e+02 9.106e+02 2.171e+03, threshold=1.338e+03, percent-clipped=2.0 2023-06-23 07:15:51,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1429392.0, ans=0.125 2023-06-23 07:16:28,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1429512.0, ans=0.125 2023-06-23 07:16:31,253 INFO [train.py:996] (3/4) Epoch 8, batch 24800, loss[loss=0.2077, simple_loss=0.2584, pruned_loss=0.07849, over 20767.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2971, pruned_loss=0.08046, over 4279077.95 frames. ], batch size: 609, lr: 3.65e-03, grad_scale: 32.0 2023-06-23 07:16:44,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1429572.0, ans=0.0 2023-06-23 07:16:55,586 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=22.5 2023-06-23 07:17:00,622 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-23 07:17:03,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1429632.0, ans=0.125 2023-06-23 07:17:25,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1429692.0, ans=0.0 2023-06-23 07:17:36,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1429752.0, ans=0.0 2023-06-23 07:17:47,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1429812.0, ans=0.125 2023-06-23 07:18:04,054 INFO [train.py:996] (3/4) Epoch 8, batch 24850, loss[loss=0.2284, simple_loss=0.3036, pruned_loss=0.07658, over 21820.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.2974, pruned_loss=0.08141, over 4276917.22 frames. ], batch size: 316, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:18:33,264 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.302e+02 4.718e+02 6.141e+02 8.581e+02 1.389e+03, threshold=1.228e+03, percent-clipped=1.0 2023-06-23 07:19:20,582 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-06-23 07:19:44,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1430112.0, ans=0.125 2023-06-23 07:19:49,208 INFO [train.py:996] (3/4) Epoch 8, batch 24900, loss[loss=0.2371, simple_loss=0.3123, pruned_loss=0.08092, over 21492.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3019, pruned_loss=0.08304, over 4273574.11 frames. ], batch size: 194, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:20:50,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1430352.0, ans=0.125 2023-06-23 07:21:34,303 INFO [train.py:996] (3/4) Epoch 8, batch 24950, loss[loss=0.2832, simple_loss=0.3549, pruned_loss=0.1057, over 21522.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3105, pruned_loss=0.08723, over 4279951.77 frames. ], batch size: 414, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:21:56,335 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-23 07:22:03,158 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.572e+02 4.703e+02 5.868e+02 8.505e+02 2.192e+03, threshold=1.174e+03, percent-clipped=6.0 2023-06-23 07:23:21,888 INFO [train.py:996] (3/4) Epoch 8, batch 25000, loss[loss=0.2337, simple_loss=0.306, pruned_loss=0.08071, over 21525.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.316, pruned_loss=0.08778, over 4281872.61 frames. ], batch size: 389, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:24:04,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1430892.0, ans=0.125 2023-06-23 07:24:19,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1430952.0, ans=0.1 2023-06-23 07:24:53,447 INFO [train.py:996] (3/4) Epoch 8, batch 25050, loss[loss=0.2418, simple_loss=0.2803, pruned_loss=0.1016, over 21453.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3104, pruned_loss=0.08733, over 4268214.67 frames. ], batch size: 510, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:24:53,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1431072.0, ans=0.0 2023-06-23 07:25:09,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1431072.0, ans=0.0 2023-06-23 07:25:17,632 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.995e+02 4.487e+02 5.838e+02 7.912e+02 1.332e+03, threshold=1.168e+03, percent-clipped=3.0 2023-06-23 07:25:19,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1431132.0, ans=0.1 2023-06-23 07:25:35,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1431192.0, ans=0.0 2023-06-23 07:25:38,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1431192.0, ans=0.125 2023-06-23 07:25:42,206 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-23 07:25:48,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1431252.0, ans=0.035 2023-06-23 07:26:10,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1431312.0, ans=0.0 2023-06-23 07:26:24,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1431312.0, ans=0.2 2023-06-23 07:26:33,612 INFO [train.py:996] (3/4) Epoch 8, batch 25100, loss[loss=0.2082, simple_loss=0.2897, pruned_loss=0.06337, over 21574.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3044, pruned_loss=0.08555, over 4276381.08 frames. ], batch size: 195, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:26:33,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1431372.0, ans=0.125 2023-06-23 07:26:35,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1431372.0, ans=0.2 2023-06-23 07:26:56,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1431432.0, ans=0.2 2023-06-23 07:27:26,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-23 07:27:57,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1431612.0, ans=0.125 2023-06-23 07:28:10,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1431672.0, ans=0.0 2023-06-23 07:28:11,755 INFO [train.py:996] (3/4) Epoch 8, batch 25150, loss[loss=0.2269, simple_loss=0.3133, pruned_loss=0.07026, over 21398.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3082, pruned_loss=0.08282, over 4267940.19 frames. ], batch size: 211, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:28:23,227 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 07:28:34,953 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.013e+02 4.444e+02 6.518e+02 1.039e+03 2.142e+03, threshold=1.304e+03, percent-clipped=17.0 2023-06-23 07:28:49,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1431792.0, ans=0.2 2023-06-23 07:29:03,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1431852.0, ans=0.05 2023-06-23 07:29:13,633 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=22.5 2023-06-23 07:29:48,547 INFO [train.py:996] (3/4) Epoch 8, batch 25200, loss[loss=0.2413, simple_loss=0.3326, pruned_loss=0.07498, over 21846.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3077, pruned_loss=0.08089, over 4267235.25 frames. ], batch size: 371, lr: 3.65e-03, grad_scale: 32.0 2023-06-23 07:29:49,481 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-06-23 07:30:03,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1431972.0, ans=0.1 2023-06-23 07:30:24,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1432092.0, ans=0.125 2023-06-23 07:30:30,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1432092.0, ans=0.1 2023-06-23 07:30:36,364 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-23 07:30:57,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1432152.0, ans=0.125 2023-06-23 07:31:13,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1432212.0, ans=0.05 2023-06-23 07:31:26,048 INFO [train.py:996] (3/4) Epoch 8, batch 25250, loss[loss=0.2394, simple_loss=0.2946, pruned_loss=0.09211, over 21366.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3054, pruned_loss=0.08045, over 4255475.84 frames. ], batch size: 144, lr: 3.65e-03, grad_scale: 32.0 2023-06-23 07:31:32,915 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 07:31:46,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1432332.0, ans=0.125 2023-06-23 07:31:49,876 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.012e+02 4.456e+02 5.347e+02 9.796e+02 2.256e+03, threshold=1.069e+03, percent-clipped=12.0 2023-06-23 07:32:13,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1432392.0, ans=0.125 2023-06-23 07:32:15,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1432392.0, ans=0.125 2023-06-23 07:32:16,002 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-23 07:32:51,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1432512.0, ans=0.125 2023-06-23 07:32:58,680 INFO [train.py:996] (3/4) Epoch 8, batch 25300, loss[loss=0.2601, simple_loss=0.3413, pruned_loss=0.08945, over 21346.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3036, pruned_loss=0.0797, over 4256044.14 frames. ], batch size: 131, lr: 3.65e-03, grad_scale: 32.0 2023-06-23 07:33:20,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1432572.0, ans=0.0 2023-06-23 07:33:20,723 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-23 07:33:25,980 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-23 07:33:28,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1432632.0, ans=0.0 2023-06-23 07:33:35,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1432632.0, ans=0.0 2023-06-23 07:33:38,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1432632.0, ans=0.0 2023-06-23 07:34:06,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1432752.0, ans=0.125 2023-06-23 07:34:07,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1432752.0, ans=0.0 2023-06-23 07:34:32,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1432812.0, ans=0.1 2023-06-23 07:34:33,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1432812.0, ans=0.125 2023-06-23 07:34:37,958 INFO [train.py:996] (3/4) Epoch 8, batch 25350, loss[loss=0.1966, simple_loss=0.2893, pruned_loss=0.05199, over 21731.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3065, pruned_loss=0.07953, over 4248000.71 frames. ], batch size: 351, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:35:02,941 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.301e+02 4.631e+02 6.587e+02 1.003e+03 1.652e+03, threshold=1.317e+03, percent-clipped=14.0 2023-06-23 07:35:04,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1432932.0, ans=0.125 2023-06-23 07:35:06,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1432932.0, ans=0.1 2023-06-23 07:35:21,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-23 07:35:22,453 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 07:35:46,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1433052.0, ans=0.2 2023-06-23 07:35:48,582 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 07:36:14,594 INFO [train.py:996] (3/4) Epoch 8, batch 25400, loss[loss=0.2609, simple_loss=0.3256, pruned_loss=0.09816, over 21599.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3015, pruned_loss=0.07794, over 4251976.35 frames. ], batch size: 389, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:36:18,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=22.5 2023-06-23 07:36:38,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1433232.0, ans=0.0 2023-06-23 07:36:41,751 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-23 07:37:01,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1433292.0, ans=0.125 2023-06-23 07:37:29,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1433412.0, ans=0.0 2023-06-23 07:37:45,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1433472.0, ans=0.0 2023-06-23 07:37:51,413 INFO [train.py:996] (3/4) Epoch 8, batch 25450, loss[loss=0.2179, simple_loss=0.3142, pruned_loss=0.06075, over 21324.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3015, pruned_loss=0.07966, over 4247276.91 frames. ], batch size: 548, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:38:11,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1433532.0, ans=0.1 2023-06-23 07:38:11,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1433532.0, ans=0.125 2023-06-23 07:38:11,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1433532.0, ans=0.1 2023-06-23 07:38:17,195 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.052e+02 4.125e+02 5.251e+02 6.939e+02 1.396e+03, threshold=1.050e+03, percent-clipped=1.0 2023-06-23 07:39:17,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1433712.0, ans=0.125 2023-06-23 07:39:32,038 INFO [train.py:996] (3/4) Epoch 8, batch 25500, loss[loss=0.2526, simple_loss=0.3297, pruned_loss=0.0878, over 21713.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3014, pruned_loss=0.07591, over 4242496.91 frames. ], batch size: 298, lr: 3.65e-03, grad_scale: 8.0 2023-06-23 07:39:49,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1433772.0, ans=0.125 2023-06-23 07:40:33,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1433952.0, ans=0.1 2023-06-23 07:41:09,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1434072.0, ans=0.0 2023-06-23 07:41:11,147 INFO [train.py:996] (3/4) Epoch 8, batch 25550, loss[loss=0.2075, simple_loss=0.3101, pruned_loss=0.05249, over 21629.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3081, pruned_loss=0.07622, over 4237843.67 frames. ], batch size: 230, lr: 3.65e-03, grad_scale: 8.0 2023-06-23 07:41:27,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1434072.0, ans=0.0 2023-06-23 07:41:38,523 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.786e+02 4.210e+02 5.304e+02 7.417e+02 2.336e+03, threshold=1.061e+03, percent-clipped=9.0 2023-06-23 07:41:50,900 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-23 07:41:53,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1434192.0, ans=0.125 2023-06-23 07:42:55,550 INFO [train.py:996] (3/4) Epoch 8, batch 25600, loss[loss=0.255, simple_loss=0.3254, pruned_loss=0.09225, over 21782.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3113, pruned_loss=0.07692, over 4235509.67 frames. ], batch size: 332, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:43:25,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1434492.0, ans=0.125 2023-06-23 07:43:37,120 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-23 07:43:38,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1434492.0, ans=0.125 2023-06-23 07:43:40,275 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.93 vs. limit=10.0 2023-06-23 07:44:31,963 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.54 vs. limit=15.0 2023-06-23 07:44:33,947 INFO [train.py:996] (3/4) Epoch 8, batch 25650, loss[loss=0.2361, simple_loss=0.2968, pruned_loss=0.08768, over 21825.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3119, pruned_loss=0.07905, over 4237096.68 frames. ], batch size: 107, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:44:43,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1434672.0, ans=0.125 2023-06-23 07:44:45,898 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-23 07:44:51,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1434732.0, ans=0.0 2023-06-23 07:44:55,720 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.690e+02 5.926e+02 8.067e+02 1.090e+03 2.033e+03, threshold=1.613e+03, percent-clipped=28.0 2023-06-23 07:45:10,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1434792.0, ans=0.0 2023-06-23 07:45:17,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1434792.0, ans=0.1 2023-06-23 07:45:43,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1434852.0, ans=0.0 2023-06-23 07:45:51,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1434912.0, ans=0.04949747468305833 2023-06-23 07:46:11,853 INFO [train.py:996] (3/4) Epoch 8, batch 25700, loss[loss=0.2124, simple_loss=0.2907, pruned_loss=0.06707, over 21787.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3088, pruned_loss=0.08042, over 4240602.00 frames. ], batch size: 282, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:47:09,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1435092.0, ans=0.1 2023-06-23 07:47:29,041 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-23 07:47:44,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1435212.0, ans=0.0 2023-06-23 07:47:46,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1435212.0, ans=0.0 2023-06-23 07:47:52,531 INFO [train.py:996] (3/4) Epoch 8, batch 25750, loss[loss=0.2934, simple_loss=0.3535, pruned_loss=0.1166, over 21420.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.315, pruned_loss=0.08471, over 4250275.10 frames. ], batch size: 471, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:48:15,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1435332.0, ans=0.0 2023-06-23 07:48:25,284 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.333e+02 5.092e+02 6.488e+02 8.589e+02 2.442e+03, threshold=1.298e+03, percent-clipped=2.0 2023-06-23 07:48:29,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1435332.0, ans=0.2 2023-06-23 07:48:50,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1435392.0, ans=0.0 2023-06-23 07:48:51,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1435392.0, ans=0.0 2023-06-23 07:49:01,981 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=22.5 2023-06-23 07:49:21,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1435512.0, ans=0.1 2023-06-23 07:49:38,473 INFO [train.py:996] (3/4) Epoch 8, batch 25800, loss[loss=0.2581, simple_loss=0.3308, pruned_loss=0.09275, over 21582.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3265, pruned_loss=0.08922, over 4255249.90 frames. ], batch size: 263, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:50:02,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1435632.0, ans=0.0 2023-06-23 07:50:02,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1435632.0, ans=0.1 2023-06-23 07:50:09,753 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.75 vs. limit=10.0 2023-06-23 07:50:27,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1435692.0, ans=0.0 2023-06-23 07:50:47,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1435752.0, ans=0.0 2023-06-23 07:51:22,163 INFO [train.py:996] (3/4) Epoch 8, batch 25850, loss[loss=0.2364, simple_loss=0.3006, pruned_loss=0.08614, over 21452.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3279, pruned_loss=0.08912, over 4265680.71 frames. ], batch size: 131, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:51:42,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1435932.0, ans=0.1 2023-06-23 07:51:45,485 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.179e+02 4.988e+02 6.409e+02 1.000e+03 3.081e+03, threshold=1.282e+03, percent-clipped=14.0 2023-06-23 07:52:04,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1435992.0, ans=0.0 2023-06-23 07:52:21,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1436052.0, ans=0.1 2023-06-23 07:52:27,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1436052.0, ans=0.125 2023-06-23 07:52:36,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1436112.0, ans=0.125 2023-06-23 07:52:52,360 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.69 vs. limit=5.0 2023-06-23 07:52:54,999 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=22.5 2023-06-23 07:52:57,079 INFO [train.py:996] (3/4) Epoch 8, batch 25900, loss[loss=0.2422, simple_loss=0.3339, pruned_loss=0.07529, over 21571.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3291, pruned_loss=0.08971, over 4274957.22 frames. ], batch size: 230, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:52:57,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1436172.0, ans=0.125 2023-06-23 07:53:02,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1436172.0, ans=0.0 2023-06-23 07:53:10,990 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-23 07:53:21,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1436232.0, ans=0.125 2023-06-23 07:53:49,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1436352.0, ans=0.04949747468305833 2023-06-23 07:54:27,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1436412.0, ans=0.125 2023-06-23 07:54:36,056 INFO [train.py:996] (3/4) Epoch 8, batch 25950, loss[loss=0.2593, simple_loss=0.3358, pruned_loss=0.09145, over 21583.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3354, pruned_loss=0.0923, over 4274142.62 frames. ], batch size: 389, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 07:54:52,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1436472.0, ans=0.125 2023-06-23 07:54:54,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1436472.0, ans=0.2 2023-06-23 07:55:03,601 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.640e+02 4.822e+02 6.504e+02 9.167e+02 2.432e+03, threshold=1.301e+03, percent-clipped=14.0 2023-06-23 07:56:06,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1436712.0, ans=0.1 2023-06-23 07:56:14,687 INFO [train.py:996] (3/4) Epoch 8, batch 26000, loss[loss=0.3248, simple_loss=0.3823, pruned_loss=0.1337, over 21731.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3339, pruned_loss=0.09001, over 4273017.52 frames. ], batch size: 441, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 07:57:36,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1437012.0, ans=0.1 2023-06-23 07:57:57,840 INFO [train.py:996] (3/4) Epoch 8, batch 26050, loss[loss=0.2155, simple_loss=0.2873, pruned_loss=0.0718, over 21861.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.333, pruned_loss=0.09096, over 4271134.13 frames. ], batch size: 118, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 07:58:00,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-23 07:58:09,595 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-23 07:58:12,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1437132.0, ans=0.2 2023-06-23 07:58:12,757 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=22.5 2023-06-23 07:58:19,765 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.609e+02 4.589e+02 6.004e+02 7.871e+02 1.709e+03, threshold=1.201e+03, percent-clipped=5.0 2023-06-23 07:58:25,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1437132.0, ans=0.125 2023-06-23 07:58:31,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1437192.0, ans=0.05 2023-06-23 07:58:55,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1437252.0, ans=0.125 2023-06-23 07:59:36,457 INFO [train.py:996] (3/4) Epoch 8, batch 26100, loss[loss=0.2713, simple_loss=0.319, pruned_loss=0.1118, over 21823.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3285, pruned_loss=0.09072, over 4278592.18 frames. ], batch size: 508, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 07:59:58,143 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=22.5 2023-06-23 08:00:23,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1437492.0, ans=0.0 2023-06-23 08:00:23,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1437492.0, ans=0.125 2023-06-23 08:00:45,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1437552.0, ans=0.0 2023-06-23 08:01:16,852 INFO [train.py:996] (3/4) Epoch 8, batch 26150, loss[loss=0.2441, simple_loss=0.3123, pruned_loss=0.08791, over 21658.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3259, pruned_loss=0.08969, over 4279563.20 frames. ], batch size: 230, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:01:45,497 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.490e+02 4.992e+02 6.219e+02 9.688e+02 1.983e+03, threshold=1.244e+03, percent-clipped=15.0 2023-06-23 08:02:35,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1437912.0, ans=0.125 2023-06-23 08:02:49,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1437912.0, ans=0.125 2023-06-23 08:02:55,503 INFO [train.py:996] (3/4) Epoch 8, batch 26200, loss[loss=0.2395, simple_loss=0.3246, pruned_loss=0.07724, over 21253.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3262, pruned_loss=0.08707, over 4282068.79 frames. ], batch size: 159, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:04:34,802 INFO [train.py:996] (3/4) Epoch 8, batch 26250, loss[loss=0.2663, simple_loss=0.3427, pruned_loss=0.09493, over 21589.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3307, pruned_loss=0.08636, over 4287554.61 frames. ], batch size: 471, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:05:07,702 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.778e+02 4.875e+02 6.519e+02 1.074e+03 2.423e+03, threshold=1.304e+03, percent-clipped=19.0 2023-06-23 08:05:47,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1438452.0, ans=0.0 2023-06-23 08:05:58,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1438512.0, ans=0.1 2023-06-23 08:06:12,246 INFO [train.py:996] (3/4) Epoch 8, batch 26300, loss[loss=0.2605, simple_loss=0.3261, pruned_loss=0.09748, over 21335.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3276, pruned_loss=0.08695, over 4292706.20 frames. ], batch size: 143, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:06:52,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1438632.0, ans=0.1 2023-06-23 08:07:05,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1438692.0, ans=0.2 2023-06-23 08:07:50,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1438812.0, ans=0.0 2023-06-23 08:08:01,838 INFO [train.py:996] (3/4) Epoch 8, batch 26350, loss[loss=0.2946, simple_loss=0.3587, pruned_loss=0.1153, over 21437.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3266, pruned_loss=0.08832, over 4293466.91 frames. ], batch size: 471, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:08:30,344 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.621e+02 4.985e+02 6.232e+02 7.669e+02 1.189e+03, threshold=1.246e+03, percent-clipped=0.0 2023-06-23 08:09:12,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1439052.0, ans=0.1 2023-06-23 08:09:15,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1439112.0, ans=0.125 2023-06-23 08:09:40,175 INFO [train.py:996] (3/4) Epoch 8, batch 26400, loss[loss=0.2235, simple_loss=0.2849, pruned_loss=0.08101, over 21779.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3205, pruned_loss=0.08817, over 4289719.02 frames. ], batch size: 317, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:09:45,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1439172.0, ans=0.125 2023-06-23 08:10:07,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1439232.0, ans=0.05 2023-06-23 08:10:44,390 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.05 vs. limit=15.0 2023-06-23 08:11:20,557 INFO [train.py:996] (3/4) Epoch 8, batch 26450, loss[loss=0.2607, simple_loss=0.3568, pruned_loss=0.08227, over 21674.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3181, pruned_loss=0.0873, over 4283298.51 frames. ], batch size: 298, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:11:33,844 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2023-06-23 08:11:41,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1439532.0, ans=0.125 2023-06-23 08:11:50,684 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.705e+02 6.327e+02 8.779e+02 1.313e+03 2.472e+03, threshold=1.756e+03, percent-clipped=25.0 2023-06-23 08:12:37,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1439652.0, ans=0.0 2023-06-23 08:13:01,267 INFO [train.py:996] (3/4) Epoch 8, batch 26500, loss[loss=0.2438, simple_loss=0.3245, pruned_loss=0.08158, over 21833.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.321, pruned_loss=0.08589, over 4278198.63 frames. ], batch size: 316, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:13:06,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1439772.0, ans=0.1 2023-06-23 08:13:29,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1439832.0, ans=0.0 2023-06-23 08:14:39,342 INFO [train.py:996] (3/4) Epoch 8, batch 26550, loss[loss=0.17, simple_loss=0.2317, pruned_loss=0.05415, over 21289.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3177, pruned_loss=0.08297, over 4270660.20 frames. ], batch size: 131, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:15:12,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1440132.0, ans=0.1 2023-06-23 08:15:20,120 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.436e+02 5.263e+02 8.028e+02 1.102e+03 2.204e+03, threshold=1.606e+03, percent-clipped=5.0 2023-06-23 08:15:23,844 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:15:23,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1440132.0, ans=0.125 2023-06-23 08:15:48,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1440252.0, ans=0.025 2023-06-23 08:15:54,202 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.93 vs. limit=10.0 2023-06-23 08:16:23,283 INFO [train.py:996] (3/4) Epoch 8, batch 26600, loss[loss=0.2117, simple_loss=0.283, pruned_loss=0.07015, over 21393.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3169, pruned_loss=0.08016, over 4269164.28 frames. ], batch size: 211, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:16:25,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1440372.0, ans=0.0 2023-06-23 08:16:54,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1440432.0, ans=0.1 2023-06-23 08:17:11,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1440492.0, ans=0.2 2023-06-23 08:17:32,002 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.27 vs. limit=15.0 2023-06-23 08:18:02,012 INFO [train.py:996] (3/4) Epoch 8, batch 26650, loss[loss=0.1738, simple_loss=0.2603, pruned_loss=0.04362, over 21695.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3097, pruned_loss=0.07879, over 4271001.36 frames. ], batch size: 298, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:18:36,392 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.026e+02 4.294e+02 5.616e+02 7.721e+02 1.631e+03, threshold=1.123e+03, percent-clipped=1.0 2023-06-23 08:19:00,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=1440792.0, ans=0.1 2023-06-23 08:19:20,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1440912.0, ans=0.1 2023-06-23 08:19:39,922 INFO [train.py:996] (3/4) Epoch 8, batch 26700, loss[loss=0.2606, simple_loss=0.3228, pruned_loss=0.09919, over 21914.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3034, pruned_loss=0.07673, over 4262771.30 frames. ], batch size: 316, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:19:58,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1440972.0, ans=0.1 2023-06-23 08:20:06,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1441032.0, ans=0.0 2023-06-23 08:20:24,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1441032.0, ans=0.125 2023-06-23 08:20:40,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1441092.0, ans=0.0 2023-06-23 08:21:25,361 INFO [train.py:996] (3/4) Epoch 8, batch 26750, loss[loss=0.3083, simple_loss=0.3775, pruned_loss=0.1195, over 21410.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3035, pruned_loss=0.07569, over 4268999.87 frames. ], batch size: 471, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:21:56,098 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.690e+02 4.314e+02 5.876e+02 8.992e+02 1.662e+03, threshold=1.175e+03, percent-clipped=13.0 2023-06-23 08:22:11,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1441392.0, ans=0.125 2023-06-23 08:23:11,021 INFO [train.py:996] (3/4) Epoch 8, batch 26800, loss[loss=0.2895, simple_loss=0.3673, pruned_loss=0.1058, over 21406.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3124, pruned_loss=0.08122, over 4278869.57 frames. ], batch size: 131, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:23:14,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1441572.0, ans=0.125 2023-06-23 08:23:37,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1441632.0, ans=0.0 2023-06-23 08:23:43,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1441692.0, ans=0.0 2023-06-23 08:23:58,631 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:24:02,011 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.81 vs. limit=15.0 2023-06-23 08:24:03,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1441752.0, ans=0.125 2023-06-23 08:24:25,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1441752.0, ans=10.0 2023-06-23 08:24:48,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1441872.0, ans=0.0 2023-06-23 08:24:49,280 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=12.0 2023-06-23 08:24:49,789 INFO [train.py:996] (3/4) Epoch 8, batch 26850, loss[loss=0.2172, simple_loss=0.2737, pruned_loss=0.08036, over 20628.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3137, pruned_loss=0.08384, over 4274249.10 frames. ], batch size: 607, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:25:02,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1441872.0, ans=0.1 2023-06-23 08:25:05,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1441932.0, ans=0.125 2023-06-23 08:25:14,923 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.645e+02 5.050e+02 6.196e+02 9.210e+02 1.737e+03, threshold=1.239e+03, percent-clipped=8.0 2023-06-23 08:26:18,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1442112.0, ans=0.125 2023-06-23 08:26:18,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1442112.0, ans=0.2 2023-06-23 08:26:22,981 INFO [train.py:996] (3/4) Epoch 8, batch 26900, loss[loss=0.2013, simple_loss=0.2677, pruned_loss=0.06748, over 21540.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3062, pruned_loss=0.08353, over 4272830.00 frames. ], batch size: 132, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:26:33,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1442172.0, ans=10.0 2023-06-23 08:27:02,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1442292.0, ans=0.04949747468305833 2023-06-23 08:27:41,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1442352.0, ans=0.125 2023-06-23 08:27:54,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1442412.0, ans=0.125 2023-06-23 08:28:02,543 INFO [train.py:996] (3/4) Epoch 8, batch 26950, loss[loss=0.2425, simple_loss=0.3151, pruned_loss=0.08495, over 21263.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3049, pruned_loss=0.08289, over 4274186.56 frames. ], batch size: 159, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:28:21,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1442532.0, ans=0.125 2023-06-23 08:28:33,460 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.377e+02 4.817e+02 6.890e+02 1.132e+03 2.322e+03, threshold=1.378e+03, percent-clipped=18.0 2023-06-23 08:29:22,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1442652.0, ans=0.125 2023-06-23 08:29:29,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1442712.0, ans=0.125 2023-06-23 08:29:46,676 INFO [train.py:996] (3/4) Epoch 8, batch 27000, loss[loss=0.2297, simple_loss=0.3247, pruned_loss=0.06729, over 21599.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3055, pruned_loss=0.08063, over 4270573.08 frames. ], batch size: 389, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:29:46,677 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 08:30:02,863 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2419, simple_loss=0.3397, pruned_loss=0.07206, over 1796401.00 frames. 2023-06-23 08:30:02,863 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-23 08:30:04,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1442772.0, ans=0.0 2023-06-23 08:31:01,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1442892.0, ans=0.125 2023-06-23 08:31:32,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1443012.0, ans=0.1 2023-06-23 08:31:42,022 INFO [train.py:996] (3/4) Epoch 8, batch 27050, loss[loss=0.2183, simple_loss=0.3249, pruned_loss=0.05582, over 21586.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3064, pruned_loss=0.07663, over 4268456.62 frames. ], batch size: 263, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:32:18,720 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.931e+02 4.261e+02 5.762e+02 7.370e+02 1.710e+03, threshold=1.152e+03, percent-clipped=3.0 2023-06-23 08:32:38,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1443192.0, ans=0.125 2023-06-23 08:33:20,864 INFO [train.py:996] (3/4) Epoch 8, batch 27100, loss[loss=0.2082, simple_loss=0.3105, pruned_loss=0.05298, over 20952.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3093, pruned_loss=0.07787, over 4274632.40 frames. ], batch size: 607, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:33:59,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1443432.0, ans=0.125 2023-06-23 08:34:38,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1443552.0, ans=0.0 2023-06-23 08:34:49,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1443612.0, ans=0.1 2023-06-23 08:34:49,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1443612.0, ans=0.0 2023-06-23 08:35:01,717 INFO [train.py:996] (3/4) Epoch 8, batch 27150, loss[loss=0.2261, simple_loss=0.3166, pruned_loss=0.0678, over 20990.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3204, pruned_loss=0.08125, over 4269929.05 frames. ], batch size: 607, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:35:43,343 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 5.713e+02 7.787e+02 1.225e+03 2.393e+03, threshold=1.557e+03, percent-clipped=28.0 2023-06-23 08:35:43,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1443732.0, ans=0.0 2023-06-23 08:35:45,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1443732.0, ans=0.125 2023-06-23 08:36:09,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1443852.0, ans=0.0 2023-06-23 08:36:27,684 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:36:46,638 INFO [train.py:996] (3/4) Epoch 8, batch 27200, loss[loss=0.2713, simple_loss=0.3446, pruned_loss=0.09904, over 21527.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3281, pruned_loss=0.08461, over 4276354.33 frames. ], batch size: 194, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:36:53,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1443972.0, ans=0.1 2023-06-23 08:37:53,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1444152.0, ans=0.0 2023-06-23 08:37:59,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1444152.0, ans=0.125 2023-06-23 08:38:01,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1444152.0, ans=0.125 2023-06-23 08:38:36,507 INFO [train.py:996] (3/4) Epoch 8, batch 27250, loss[loss=0.2635, simple_loss=0.332, pruned_loss=0.09749, over 21382.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3315, pruned_loss=0.08847, over 4274971.60 frames. ], batch size: 176, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:38:58,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1444332.0, ans=0.0 2023-06-23 08:39:08,926 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-23 08:39:09,368 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.379e+02 5.510e+02 6.974e+02 9.879e+02 1.721e+03, threshold=1.395e+03, percent-clipped=1.0 2023-06-23 08:39:10,506 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-06-23 08:40:00,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1444512.0, ans=0.0 2023-06-23 08:40:11,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1444512.0, ans=0.2 2023-06-23 08:40:17,634 INFO [train.py:996] (3/4) Epoch 8, batch 27300, loss[loss=0.2526, simple_loss=0.3482, pruned_loss=0.07854, over 21722.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3327, pruned_loss=0.08905, over 4272906.47 frames. ], batch size: 441, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:41:17,132 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.00 vs. limit=22.5 2023-06-23 08:41:25,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1444752.0, ans=0.125 2023-06-23 08:41:40,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1444812.0, ans=0.2 2023-06-23 08:41:57,257 INFO [train.py:996] (3/4) Epoch 8, batch 27350, loss[loss=0.2998, simple_loss=0.3644, pruned_loss=0.1177, over 21527.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.336, pruned_loss=0.08988, over 4275866.21 frames. ], batch size: 507, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:42:05,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1444872.0, ans=0.5 2023-06-23 08:42:28,128 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.496e+02 4.684e+02 5.886e+02 7.664e+02 1.698e+03, threshold=1.177e+03, percent-clipped=3.0 2023-06-23 08:42:31,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1444992.0, ans=0.1 2023-06-23 08:42:54,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1444992.0, ans=0.125 2023-06-23 08:43:08,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1445052.0, ans=0.125 2023-06-23 08:43:31,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1445112.0, ans=0.0 2023-06-23 08:43:40,292 INFO [train.py:996] (3/4) Epoch 8, batch 27400, loss[loss=0.2275, simple_loss=0.2958, pruned_loss=0.0796, over 21532.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3306, pruned_loss=0.08912, over 4281013.10 frames. ], batch size: 548, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:43:41,314 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-23 08:44:39,569 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=15.0 2023-06-23 08:44:51,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1445352.0, ans=0.125 2023-06-23 08:45:19,300 INFO [train.py:996] (3/4) Epoch 8, batch 27450, loss[loss=0.235, simple_loss=0.3169, pruned_loss=0.07655, over 21768.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3249, pruned_loss=0.08787, over 4276404.94 frames. ], batch size: 282, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:45:50,316 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.374e+02 5.145e+02 6.858e+02 8.934e+02 1.227e+03, threshold=1.372e+03, percent-clipped=2.0 2023-06-23 08:46:55,690 INFO [train.py:996] (3/4) Epoch 8, batch 27500, loss[loss=0.2546, simple_loss=0.3164, pruned_loss=0.09646, over 21511.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3248, pruned_loss=0.08888, over 4281548.55 frames. ], batch size: 194, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:47:56,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1445952.0, ans=0.1 2023-06-23 08:48:03,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-23 08:48:12,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1446012.0, ans=0.125 2023-06-23 08:48:17,299 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:48:31,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1446012.0, ans=0.125 2023-06-23 08:48:34,500 INFO [train.py:996] (3/4) Epoch 8, batch 27550, loss[loss=0.182, simple_loss=0.2534, pruned_loss=0.05535, over 21491.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3174, pruned_loss=0.08468, over 4279346.62 frames. ], batch size: 194, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:49:06,585 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.272e+02 4.176e+02 5.018e+02 7.145e+02 2.103e+03, threshold=1.004e+03, percent-clipped=5.0 2023-06-23 08:49:32,962 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-23 08:49:34,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1446252.0, ans=0.0 2023-06-23 08:50:07,962 INFO [train.py:996] (3/4) Epoch 8, batch 27600, loss[loss=0.2195, simple_loss=0.2869, pruned_loss=0.07602, over 21365.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3094, pruned_loss=0.08306, over 4273971.03 frames. ], batch size: 194, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:50:19,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1446372.0, ans=0.1 2023-06-23 08:51:45,610 INFO [train.py:996] (3/4) Epoch 8, batch 27650, loss[loss=0.213, simple_loss=0.2822, pruned_loss=0.07188, over 21624.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.304, pruned_loss=0.08276, over 4272752.71 frames. ], batch size: 263, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:52:19,284 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.399e+02 4.844e+02 6.403e+02 8.598e+02 1.573e+03, threshold=1.281e+03, percent-clipped=18.0 2023-06-23 08:52:31,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1446792.0, ans=0.125 2023-06-23 08:53:01,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1446912.0, ans=0.125 2023-06-23 08:53:11,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1446912.0, ans=0.1 2023-06-23 08:53:21,689 INFO [train.py:996] (3/4) Epoch 8, batch 27700, loss[loss=0.2787, simple_loss=0.3713, pruned_loss=0.09301, over 20886.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3048, pruned_loss=0.08099, over 4269776.11 frames. ], batch size: 608, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 08:53:28,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1446972.0, ans=0.1 2023-06-23 08:53:55,442 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.31 vs. limit=10.0 2023-06-23 08:54:29,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1447152.0, ans=0.125 2023-06-23 08:54:30,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1447152.0, ans=0.1 2023-06-23 08:54:44,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1447212.0, ans=0.1 2023-06-23 08:54:56,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=22.5 2023-06-23 08:55:00,509 INFO [train.py:996] (3/4) Epoch 8, batch 27750, loss[loss=0.1992, simple_loss=0.2813, pruned_loss=0.05857, over 21478.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3089, pruned_loss=0.08123, over 4271341.24 frames. ], batch size: 194, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 08:55:09,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1447272.0, ans=0.0 2023-06-23 08:55:31,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1447332.0, ans=0.1 2023-06-23 08:55:32,817 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.317e+02 5.055e+02 6.711e+02 8.629e+02 1.749e+03, threshold=1.342e+03, percent-clipped=9.0 2023-06-23 08:55:33,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2023-06-23 08:55:49,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1447392.0, ans=0.0 2023-06-23 08:55:56,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1447392.0, ans=0.125 2023-06-23 08:56:07,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1447452.0, ans=0.125 2023-06-23 08:56:10,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1447452.0, ans=0.05 2023-06-23 08:56:24,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-23 08:56:25,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1447512.0, ans=0.1 2023-06-23 08:56:33,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1447512.0, ans=0.125 2023-06-23 08:56:35,724 INFO [train.py:996] (3/4) Epoch 8, batch 27800, loss[loss=0.2168, simple_loss=0.2841, pruned_loss=0.07469, over 21399.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3084, pruned_loss=0.08164, over 4284668.76 frames. ], batch size: 176, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 08:56:39,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1447572.0, ans=0.125 2023-06-23 08:57:08,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1447632.0, ans=0.5 2023-06-23 08:57:21,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1447692.0, ans=0.035 2023-06-23 08:57:27,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1447692.0, ans=0.025 2023-06-23 08:57:27,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1447692.0, ans=0.125 2023-06-23 08:57:42,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1447752.0, ans=0.5 2023-06-23 08:58:15,605 INFO [train.py:996] (3/4) Epoch 8, batch 27850, loss[loss=0.2416, simple_loss=0.318, pruned_loss=0.08259, over 21824.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3084, pruned_loss=0.08308, over 4287169.31 frames. ], batch size: 112, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 08:58:50,983 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.261e+02 4.347e+02 5.210e+02 6.936e+02 1.592e+03, threshold=1.042e+03, percent-clipped=2.0 2023-06-23 08:58:51,355 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:59:47,540 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.19 vs. limit=15.0 2023-06-23 08:59:57,465 INFO [train.py:996] (3/4) Epoch 8, batch 27900, loss[loss=0.2466, simple_loss=0.3479, pruned_loss=0.07265, over 21698.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3198, pruned_loss=0.0849, over 4289614.34 frames. ], batch size: 414, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 08:59:58,591 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-23 09:00:01,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1448172.0, ans=0.125 2023-06-23 09:00:03,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1448172.0, ans=0.07 2023-06-23 09:01:06,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1448352.0, ans=0.1 2023-06-23 09:01:20,230 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=22.5 2023-06-23 09:01:21,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1448412.0, ans=0.0 2023-06-23 09:01:26,768 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-23 09:01:34,329 INFO [train.py:996] (3/4) Epoch 8, batch 27950, loss[loss=0.2398, simple_loss=0.3363, pruned_loss=0.07161, over 21605.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3188, pruned_loss=0.08108, over 4288874.98 frames. ], batch size: 414, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:01:58,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1448532.0, ans=0.125 2023-06-23 09:02:08,156 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.302e+02 4.594e+02 6.671e+02 9.534e+02 1.876e+03, threshold=1.334e+03, percent-clipped=19.0 2023-06-23 09:02:29,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1448592.0, ans=0.1 2023-06-23 09:02:52,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1448712.0, ans=0.125 2023-06-23 09:02:53,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1448712.0, ans=0.0 2023-06-23 09:03:00,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1448712.0, ans=0.0 2023-06-23 09:03:07,995 INFO [train.py:996] (3/4) Epoch 8, batch 28000, loss[loss=0.2286, simple_loss=0.2937, pruned_loss=0.08174, over 21476.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3157, pruned_loss=0.07847, over 4294189.79 frames. ], batch size: 194, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:04:16,904 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.17 vs. limit=22.5 2023-06-23 09:04:29,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1449012.0, ans=0.125 2023-06-23 09:04:52,708 INFO [train.py:996] (3/4) Epoch 8, batch 28050, loss[loss=0.2752, simple_loss=0.3323, pruned_loss=0.1091, over 20007.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3127, pruned_loss=0.08002, over 4289619.88 frames. ], batch size: 702, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:05:02,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1449072.0, ans=0.125 2023-06-23 09:05:26,538 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.011e+02 4.954e+02 6.052e+02 8.048e+02 2.120e+03, threshold=1.210e+03, percent-clipped=2.0 2023-06-23 09:05:46,896 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.63 vs. limit=15.0 2023-06-23 09:06:27,340 INFO [train.py:996] (3/4) Epoch 8, batch 28100, loss[loss=0.2176, simple_loss=0.289, pruned_loss=0.07308, over 21159.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3094, pruned_loss=0.07992, over 4285860.62 frames. ], batch size: 548, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:06:58,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1449432.0, ans=0.125 2023-06-23 09:07:47,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1449612.0, ans=0.125 2023-06-23 09:07:54,550 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=22.5 2023-06-23 09:08:06,360 INFO [train.py:996] (3/4) Epoch 8, batch 28150, loss[loss=0.2245, simple_loss=0.2825, pruned_loss=0.08324, over 21580.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3032, pruned_loss=0.07936, over 4282315.03 frames. ], batch size: 415, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:08:08,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1449672.0, ans=0.0 2023-06-23 09:08:39,164 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.413e+02 4.985e+02 7.502e+02 1.116e+03 2.390e+03, threshold=1.500e+03, percent-clipped=18.0 2023-06-23 09:09:44,395 INFO [train.py:996] (3/4) Epoch 8, batch 28200, loss[loss=0.2505, simple_loss=0.3101, pruned_loss=0.09545, over 21396.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3011, pruned_loss=0.08124, over 4282449.59 frames. ], batch size: 211, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:10:44,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1450152.0, ans=0.95 2023-06-23 09:11:13,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1450212.0, ans=0.2 2023-06-23 09:11:27,079 INFO [train.py:996] (3/4) Epoch 8, batch 28250, loss[loss=0.2251, simple_loss=0.2939, pruned_loss=0.07817, over 16570.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3064, pruned_loss=0.08373, over 4263868.58 frames. ], batch size: 60, lr: 3.63e-03, grad_scale: 8.0 2023-06-23 09:11:29,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1450272.0, ans=0.2 2023-06-23 09:11:32,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1450272.0, ans=0.125 2023-06-23 09:12:04,945 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.687e+02 5.319e+02 7.100e+02 8.712e+02 1.908e+03, threshold=1.420e+03, percent-clipped=3.0 2023-06-23 09:12:17,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1450392.0, ans=0.025 2023-06-23 09:12:46,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1450512.0, ans=0.125 2023-06-23 09:13:06,465 INFO [train.py:996] (3/4) Epoch 8, batch 28300, loss[loss=0.2092, simple_loss=0.2812, pruned_loss=0.06857, over 21258.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3061, pruned_loss=0.08184, over 4258120.70 frames. ], batch size: 160, lr: 3.63e-03, grad_scale: 8.0 2023-06-23 09:13:54,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1450692.0, ans=0.125 2023-06-23 09:14:44,998 INFO [train.py:996] (3/4) Epoch 8, batch 28350, loss[loss=0.2169, simple_loss=0.3161, pruned_loss=0.05886, over 21746.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3024, pruned_loss=0.07551, over 4266896.63 frames. ], batch size: 332, lr: 3.63e-03, grad_scale: 8.0 2023-06-23 09:14:58,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1450872.0, ans=10.0 2023-06-23 09:15:04,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1450932.0, ans=0.0 2023-06-23 09:15:21,047 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.828e+02 5.599e+02 8.860e+02 1.294e+03 2.489e+03, threshold=1.772e+03, percent-clipped=23.0 2023-06-23 09:15:31,575 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-23 09:15:33,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1450992.0, ans=0.0 2023-06-23 09:16:09,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1451112.0, ans=0.1 2023-06-23 09:16:23,414 INFO [train.py:996] (3/4) Epoch 8, batch 28400, loss[loss=0.2531, simple_loss=0.3123, pruned_loss=0.09696, over 21198.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2996, pruned_loss=0.07571, over 4264814.18 frames. ], batch size: 143, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:17:03,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1451232.0, ans=0.125 2023-06-23 09:18:03,287 INFO [train.py:996] (3/4) Epoch 8, batch 28450, loss[loss=0.2331, simple_loss=0.2976, pruned_loss=0.08429, over 21638.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3056, pruned_loss=0.07966, over 4259539.02 frames. ], batch size: 230, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:18:18,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1451532.0, ans=0.2 2023-06-23 09:18:41,077 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.722e+02 5.622e+02 7.842e+02 1.295e+03 2.358e+03, threshold=1.568e+03, percent-clipped=7.0 2023-06-23 09:18:42,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1451592.0, ans=0.025 2023-06-23 09:19:22,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1451712.0, ans=0.0 2023-06-23 09:19:39,541 INFO [train.py:996] (3/4) Epoch 8, batch 28500, loss[loss=0.2471, simple_loss=0.3119, pruned_loss=0.09116, over 21922.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3069, pruned_loss=0.08135, over 4265813.66 frames. ], batch size: 351, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:20:46,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1451952.0, ans=0.125 2023-06-23 09:21:16,152 INFO [train.py:996] (3/4) Epoch 8, batch 28550, loss[loss=0.3601, simple_loss=0.438, pruned_loss=0.1411, over 21532.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3155, pruned_loss=0.08456, over 4270893.30 frames. ], batch size: 471, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:22:02,860 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.502e+02 4.663e+02 6.262e+02 9.623e+02 1.798e+03, threshold=1.252e+03, percent-clipped=1.0 2023-06-23 09:22:19,104 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-23 09:23:01,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=12.0 2023-06-23 09:23:02,374 INFO [train.py:996] (3/4) Epoch 8, batch 28600, loss[loss=0.2712, simple_loss=0.3481, pruned_loss=0.09718, over 21366.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3213, pruned_loss=0.08602, over 4274667.89 frames. ], batch size: 131, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:23:06,590 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-23 09:23:43,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1452492.0, ans=0.0 2023-06-23 09:24:41,369 INFO [train.py:996] (3/4) Epoch 8, batch 28650, loss[loss=0.2412, simple_loss=0.2931, pruned_loss=0.09463, over 21208.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3168, pruned_loss=0.08655, over 4270332.77 frames. ], batch size: 176, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:24:57,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1452672.0, ans=0.035 2023-06-23 09:25:23,235 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.145e+02 4.493e+02 5.758e+02 7.930e+02 1.580e+03, threshold=1.152e+03, percent-clipped=4.0 2023-06-23 09:25:30,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1452792.0, ans=0.0 2023-06-23 09:25:30,529 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-23 09:25:44,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1452852.0, ans=0.1 2023-06-23 09:25:51,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1452852.0, ans=0.09899494936611666 2023-06-23 09:26:05,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1452912.0, ans=0.125 2023-06-23 09:26:13,829 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=15.0 2023-06-23 09:26:26,173 INFO [train.py:996] (3/4) Epoch 8, batch 28700, loss[loss=0.2375, simple_loss=0.3099, pruned_loss=0.08261, over 21635.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3165, pruned_loss=0.0883, over 4267695.64 frames. ], batch size: 263, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:26:29,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1452972.0, ans=0.0 2023-06-23 09:26:56,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1453032.0, ans=0.125 2023-06-23 09:27:05,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1453092.0, ans=0.125 2023-06-23 09:27:18,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1453152.0, ans=0.125 2023-06-23 09:27:22,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1453152.0, ans=0.0 2023-06-23 09:27:47,712 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=22.5 2023-06-23 09:28:05,309 INFO [train.py:996] (3/4) Epoch 8, batch 28750, loss[loss=0.2523, simple_loss=0.3309, pruned_loss=0.08687, over 21819.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3158, pruned_loss=0.08829, over 4269573.58 frames. ], batch size: 414, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:28:41,963 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.805e+02 4.984e+02 6.274e+02 9.092e+02 1.737e+03, threshold=1.255e+03, percent-clipped=10.0 2023-06-23 09:29:49,437 INFO [train.py:996] (3/4) Epoch 8, batch 28800, loss[loss=0.2283, simple_loss=0.3094, pruned_loss=0.07366, over 21629.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3205, pruned_loss=0.08945, over 4274614.54 frames. ], batch size: 263, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:29:49,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1453572.0, ans=0.2 2023-06-23 09:30:10,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1453632.0, ans=0.125 2023-06-23 09:30:23,309 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-06-23 09:31:26,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1453812.0, ans=0.0 2023-06-23 09:31:29,192 INFO [train.py:996] (3/4) Epoch 8, batch 28850, loss[loss=0.2507, simple_loss=0.312, pruned_loss=0.09467, over 21361.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3206, pruned_loss=0.09056, over 4283299.75 frames. ], batch size: 159, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:32:03,474 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.517e+02 4.921e+02 6.393e+02 7.769e+02 1.909e+03, threshold=1.279e+03, percent-clipped=3.0 2023-06-23 09:32:14,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.13 vs. limit=15.0 2023-06-23 09:32:17,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1453992.0, ans=0.125 2023-06-23 09:32:26,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1454052.0, ans=0.125 2023-06-23 09:32:31,442 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:33:11,071 INFO [train.py:996] (3/4) Epoch 8, batch 28900, loss[loss=0.3014, simple_loss=0.3705, pruned_loss=0.1162, over 21499.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3241, pruned_loss=0.09214, over 4282495.83 frames. ], batch size: 508, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:33:11,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1454172.0, ans=0.0 2023-06-23 09:33:24,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1454172.0, ans=0.025 2023-06-23 09:33:27,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1454232.0, ans=0.1 2023-06-23 09:33:50,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1454292.0, ans=0.125 2023-06-23 09:34:28,620 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-06-23 09:34:37,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1454412.0, ans=0.125 2023-06-23 09:34:52,373 INFO [train.py:996] (3/4) Epoch 8, batch 28950, loss[loss=0.2298, simple_loss=0.2952, pruned_loss=0.08215, over 21624.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.325, pruned_loss=0.09109, over 4278001.96 frames. ], batch size: 230, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:35:01,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1454472.0, ans=10.0 2023-06-23 09:35:41,154 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.677e+02 4.837e+02 6.969e+02 9.888e+02 2.996e+03, threshold=1.394e+03, percent-clipped=10.0 2023-06-23 09:36:04,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1454652.0, ans=0.1 2023-06-23 09:36:13,277 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-06-23 09:36:32,714 INFO [train.py:996] (3/4) Epoch 8, batch 29000, loss[loss=0.273, simple_loss=0.3455, pruned_loss=0.1003, over 21349.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3273, pruned_loss=0.09003, over 4275437.76 frames. ], batch size: 549, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:36:39,983 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-06-23 09:36:52,594 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-23 09:37:36,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1454952.0, ans=0.0 2023-06-23 09:37:41,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1454952.0, ans=0.0 2023-06-23 09:37:56,813 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=15.0 2023-06-23 09:38:10,314 INFO [train.py:996] (3/4) Epoch 8, batch 29050, loss[loss=0.2639, simple_loss=0.3262, pruned_loss=0.1008, over 21821.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3255, pruned_loss=0.09068, over 4280984.50 frames. ], batch size: 441, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:39:01,854 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.533e+02 4.892e+02 6.390e+02 8.567e+02 1.270e+03, threshold=1.278e+03, percent-clipped=0.0 2023-06-23 09:39:46,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1455372.0, ans=0.125 2023-06-23 09:39:48,116 INFO [train.py:996] (3/4) Epoch 8, batch 29100, loss[loss=0.2244, simple_loss=0.2768, pruned_loss=0.08597, over 21320.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3161, pruned_loss=0.08814, over 4280461.86 frames. ], batch size: 473, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:40:52,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1455492.0, ans=0.2 2023-06-23 09:41:09,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1455612.0, ans=0.125 2023-06-23 09:41:26,815 INFO [train.py:996] (3/4) Epoch 8, batch 29150, loss[loss=0.2397, simple_loss=0.3118, pruned_loss=0.08382, over 20009.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3156, pruned_loss=0.08639, over 4276749.80 frames. ], batch size: 702, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:42:14,900 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.362e+02 4.552e+02 5.999e+02 9.339e+02 2.396e+03, threshold=1.200e+03, percent-clipped=6.0 2023-06-23 09:42:17,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1455792.0, ans=0.125 2023-06-23 09:42:17,503 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-23 09:42:20,047 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:43:01,020 INFO [train.py:996] (3/4) Epoch 8, batch 29200, loss[loss=0.2139, simple_loss=0.2817, pruned_loss=0.07306, over 21224.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3111, pruned_loss=0.08567, over 4274200.39 frames. ], batch size: 159, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:43:01,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1455972.0, ans=0.04949747468305833 2023-06-23 09:43:17,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1455972.0, ans=0.125 2023-06-23 09:44:08,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1456152.0, ans=0.125 2023-06-23 09:44:29,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1456212.0, ans=0.0 2023-06-23 09:44:50,156 INFO [train.py:996] (3/4) Epoch 8, batch 29250, loss[loss=0.2323, simple_loss=0.3166, pruned_loss=0.07398, over 21576.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3099, pruned_loss=0.08296, over 4275053.91 frames. ], batch size: 230, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:45:11,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1456272.0, ans=0.2 2023-06-23 09:45:13,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1456272.0, ans=0.125 2023-06-23 09:45:27,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1456332.0, ans=0.125 2023-06-23 09:45:33,749 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.540e+02 4.714e+02 6.019e+02 9.609e+02 2.170e+03, threshold=1.204e+03, percent-clipped=18.0 2023-06-23 09:46:02,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1456512.0, ans=0.0 2023-06-23 09:46:33,647 INFO [train.py:996] (3/4) Epoch 8, batch 29300, loss[loss=0.2436, simple_loss=0.3226, pruned_loss=0.08234, over 21241.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3115, pruned_loss=0.08251, over 4269872.75 frames. ], batch size: 176, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:47:00,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1456632.0, ans=0.1 2023-06-23 09:47:17,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1456692.0, ans=0.125 2023-06-23 09:47:28,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1456752.0, ans=0.125 2023-06-23 09:47:57,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1456812.0, ans=0.1 2023-06-23 09:48:00,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1456812.0, ans=0.125 2023-06-23 09:48:17,852 INFO [train.py:996] (3/4) Epoch 8, batch 29350, loss[loss=0.2335, simple_loss=0.308, pruned_loss=0.07953, over 21537.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3078, pruned_loss=0.08174, over 4267753.80 frames. ], batch size: 230, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:48:18,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1456872.0, ans=0.125 2023-06-23 09:48:27,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1456872.0, ans=0.2 2023-06-23 09:48:37,807 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=15.0 2023-06-23 09:48:48,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1456992.0, ans=0.0 2023-06-23 09:48:52,796 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.303e+02 4.822e+02 6.215e+02 9.294e+02 1.604e+03, threshold=1.243e+03, percent-clipped=12.0 2023-06-23 09:49:07,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1457052.0, ans=0.125 2023-06-23 09:49:56,820 INFO [train.py:996] (3/4) Epoch 8, batch 29400, loss[loss=0.1235, simple_loss=0.1663, pruned_loss=0.04037, over 16104.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.307, pruned_loss=0.07929, over 4262905.11 frames. ], batch size: 60, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:50:22,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1457232.0, ans=0.025 2023-06-23 09:50:22,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1457232.0, ans=0.125 2023-06-23 09:50:44,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1457352.0, ans=0.0 2023-06-23 09:50:55,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1457352.0, ans=0.125 2023-06-23 09:51:35,724 INFO [train.py:996] (3/4) Epoch 8, batch 29450, loss[loss=0.258, simple_loss=0.3328, pruned_loss=0.0916, over 21375.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3053, pruned_loss=0.07814, over 4260985.33 frames. ], batch size: 549, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:51:49,873 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=22.5 2023-06-23 09:51:57,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1457532.0, ans=0.015 2023-06-23 09:52:05,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1457532.0, ans=0.1 2023-06-23 09:52:11,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.254e+02 6.189e+02 1.171e+03 1.650e+03 2.483e+03, threshold=2.343e+03, percent-clipped=48.0 2023-06-23 09:52:15,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1457592.0, ans=0.1 2023-06-23 09:52:47,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1457652.0, ans=0.125 2023-06-23 09:53:12,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1457772.0, ans=0.1 2023-06-23 09:53:14,220 INFO [train.py:996] (3/4) Epoch 8, batch 29500, loss[loss=0.2301, simple_loss=0.2896, pruned_loss=0.08528, over 21257.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3114, pruned_loss=0.08251, over 4270166.76 frames. ], batch size: 159, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:53:55,407 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-23 09:53:56,393 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:54:52,623 INFO [train.py:996] (3/4) Epoch 8, batch 29550, loss[loss=0.2454, simple_loss=0.3193, pruned_loss=0.08575, over 22058.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.31, pruned_loss=0.08359, over 4281023.32 frames. ], batch size: 119, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:55:15,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1458132.0, ans=0.05 2023-06-23 09:55:22,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1458132.0, ans=0.125 2023-06-23 09:55:28,667 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.866e+02 5.921e+02 7.933e+02 1.124e+03 2.184e+03, threshold=1.587e+03, percent-clipped=0.0 2023-06-23 09:55:36,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1458192.0, ans=0.0 2023-06-23 09:56:16,316 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:56:17,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1458312.0, ans=0.0 2023-06-23 09:56:27,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1458372.0, ans=0.125 2023-06-23 09:56:28,481 INFO [train.py:996] (3/4) Epoch 8, batch 29600, loss[loss=0.2936, simple_loss=0.3804, pruned_loss=0.1035, over 21641.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3172, pruned_loss=0.08647, over 4288484.23 frames. ], batch size: 389, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:56:46,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1458432.0, ans=0.125 2023-06-23 09:57:12,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1458492.0, ans=0.125 2023-06-23 09:57:48,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1458552.0, ans=0.025 2023-06-23 09:57:53,747 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=12.0 2023-06-23 09:58:06,712 INFO [train.py:996] (3/4) Epoch 8, batch 29650, loss[loss=0.2173, simple_loss=0.2828, pruned_loss=0.07592, over 21873.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3168, pruned_loss=0.0832, over 4282555.89 frames. ], batch size: 124, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:58:07,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1458672.0, ans=0.0 2023-06-23 09:58:10,598 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-23 09:58:46,216 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.053e+02 5.356e+02 7.008e+02 1.123e+03 3.687e+03, threshold=1.402e+03, percent-clipped=10.0 2023-06-23 09:59:03,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1458792.0, ans=0.05 2023-06-23 09:59:33,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1458912.0, ans=0.1 2023-06-23 09:59:45,969 INFO [train.py:996] (3/4) Epoch 8, batch 29700, loss[loss=0.2912, simple_loss=0.358, pruned_loss=0.1122, over 21742.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3175, pruned_loss=0.08327, over 4283169.03 frames. ], batch size: 112, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 10:00:54,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1459152.0, ans=0.0 2023-06-23 10:01:15,870 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=12.0 2023-06-23 10:01:19,983 INFO [train.py:996] (3/4) Epoch 8, batch 29750, loss[loss=0.2414, simple_loss=0.3572, pruned_loss=0.06282, over 19737.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3228, pruned_loss=0.08283, over 4283590.79 frames. ], batch size: 702, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 10:01:25,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1459272.0, ans=0.125 2023-06-23 10:01:48,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1459332.0, ans=0.04949747468305833 2023-06-23 10:01:57,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1459392.0, ans=0.04949747468305833 2023-06-23 10:02:05,842 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.467e+02 4.705e+02 6.606e+02 1.008e+03 2.185e+03, threshold=1.321e+03, percent-clipped=11.0 2023-06-23 10:02:34,936 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-23 10:02:58,152 INFO [train.py:996] (3/4) Epoch 8, batch 29800, loss[loss=0.2951, simple_loss=0.3483, pruned_loss=0.1209, over 21789.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3232, pruned_loss=0.08363, over 4287670.46 frames. ], batch size: 508, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 10:03:29,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1459632.0, ans=0.07 2023-06-23 10:03:31,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1459632.0, ans=0.125 2023-06-23 10:03:34,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1459692.0, ans=0.2 2023-06-23 10:04:30,506 INFO [train.py:996] (3/4) Epoch 8, batch 29850, loss[loss=0.2138, simple_loss=0.2854, pruned_loss=0.07114, over 21599.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3183, pruned_loss=0.08122, over 4274882.08 frames. ], batch size: 263, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 10:04:38,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1459872.0, ans=0.1 2023-06-23 10:05:16,478 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.132e+02 5.162e+02 6.765e+02 9.039e+02 1.623e+03, threshold=1.353e+03, percent-clipped=3.0 2023-06-23 10:06:08,503 INFO [train.py:996] (3/4) Epoch 8, batch 29900, loss[loss=0.2726, simple_loss=0.3318, pruned_loss=0.1067, over 21664.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3171, pruned_loss=0.08268, over 4282567.17 frames. ], batch size: 230, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 10:06:31,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1460232.0, ans=0.5 2023-06-23 10:06:45,196 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-06-23 10:07:36,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1460412.0, ans=0.125 2023-06-23 10:07:48,912 INFO [train.py:996] (3/4) Epoch 8, batch 29950, loss[loss=0.2821, simple_loss=0.3523, pruned_loss=0.106, over 21284.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3197, pruned_loss=0.08602, over 4277505.35 frames. ], batch size: 143, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:07:49,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1460472.0, ans=0.125 2023-06-23 10:07:54,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1460472.0, ans=0.125 2023-06-23 10:08:03,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1460472.0, ans=0.0 2023-06-23 10:08:08,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1460532.0, ans=0.0 2023-06-23 10:08:45,487 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.472e+02 5.581e+02 7.305e+02 1.013e+03 2.177e+03, threshold=1.461e+03, percent-clipped=7.0 2023-06-23 10:09:34,048 INFO [train.py:996] (3/4) Epoch 8, batch 30000, loss[loss=0.2342, simple_loss=0.3215, pruned_loss=0.0734, over 21610.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.321, pruned_loss=0.08616, over 4276541.81 frames. ], batch size: 230, lr: 3.61e-03, grad_scale: 32.0 2023-06-23 10:09:34,049 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 10:09:44,034 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.1437, 2.1373, 1.9649, 2.8967], device='cuda:3') 2023-06-23 10:09:54,202 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.244, simple_loss=0.3443, pruned_loss=0.07188, over 1796401.00 frames. 2023-06-23 10:09:54,202 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-23 10:09:56,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1460772.0, ans=0.0 2023-06-23 10:10:12,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1460772.0, ans=0.125 2023-06-23 10:11:19,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1461012.0, ans=0.0 2023-06-23 10:11:46,626 INFO [train.py:996] (3/4) Epoch 8, batch 30050, loss[loss=0.2941, simple_loss=0.3944, pruned_loss=0.09689, over 21820.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3244, pruned_loss=0.08348, over 4277249.10 frames. ], batch size: 371, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:11:53,262 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.66 vs. limit=15.0 2023-06-23 10:12:25,994 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.441e+02 4.815e+02 7.305e+02 9.705e+02 3.214e+03, threshold=1.461e+03, percent-clipped=9.0 2023-06-23 10:12:37,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1461192.0, ans=0.2 2023-06-23 10:13:26,295 INFO [train.py:996] (3/4) Epoch 8, batch 30100, loss[loss=0.1937, simple_loss=0.2584, pruned_loss=0.06446, over 21330.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3229, pruned_loss=0.08332, over 4278404.65 frames. ], batch size: 211, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:13:27,201 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-23 10:13:28,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1461372.0, ans=0.1 2023-06-23 10:13:34,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1461372.0, ans=0.125 2023-06-23 10:13:36,665 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=22.5 2023-06-23 10:13:58,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1461492.0, ans=0.125 2023-06-23 10:14:37,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1461552.0, ans=0.1 2023-06-23 10:15:05,448 INFO [train.py:996] (3/4) Epoch 8, batch 30150, loss[loss=0.2717, simple_loss=0.3317, pruned_loss=0.1059, over 21591.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.319, pruned_loss=0.08487, over 4267288.39 frames. ], batch size: 230, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:15:07,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1461672.0, ans=0.0 2023-06-23 10:15:34,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1461732.0, ans=0.0 2023-06-23 10:15:59,652 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.389e+02 4.498e+02 5.519e+02 7.633e+02 1.440e+03, threshold=1.104e+03, percent-clipped=0.0 2023-06-23 10:16:26,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.39 vs. limit=10.0 2023-06-23 10:16:47,016 INFO [train.py:996] (3/4) Epoch 8, batch 30200, loss[loss=0.2669, simple_loss=0.35, pruned_loss=0.09189, over 21767.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3195, pruned_loss=0.08319, over 4262163.23 frames. ], batch size: 441, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:16:57,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1461972.0, ans=0.0 2023-06-23 10:17:28,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1462032.0, ans=0.025 2023-06-23 10:18:13,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1462212.0, ans=0.0 2023-06-23 10:18:20,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=22.5 2023-06-23 10:18:21,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1462212.0, ans=0.2 2023-06-23 10:18:27,125 INFO [train.py:996] (3/4) Epoch 8, batch 30250, loss[loss=0.2282, simple_loss=0.3058, pruned_loss=0.07525, over 20118.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3273, pruned_loss=0.08558, over 4264541.09 frames. ], batch size: 702, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:18:46,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1462272.0, ans=0.2 2023-06-23 10:19:18,269 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-23 10:19:25,063 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.543e+02 5.739e+02 8.338e+02 1.276e+03 3.132e+03, threshold=1.668e+03, percent-clipped=33.0 2023-06-23 10:19:30,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1462392.0, ans=0.125 2023-06-23 10:19:36,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1462452.0, ans=0.2 2023-06-23 10:19:38,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1462452.0, ans=0.2 2023-06-23 10:20:10,513 INFO [train.py:996] (3/4) Epoch 8, batch 30300, loss[loss=0.1968, simple_loss=0.2635, pruned_loss=0.06508, over 21880.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3234, pruned_loss=0.08441, over 4267412.35 frames. ], batch size: 373, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:20:17,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1462572.0, ans=0.125 2023-06-23 10:20:32,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1462572.0, ans=0.07 2023-06-23 10:20:55,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-23 10:20:56,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1462692.0, ans=0.125 2023-06-23 10:21:16,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1462752.0, ans=0.0 2023-06-23 10:21:34,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1462812.0, ans=0.0 2023-06-23 10:22:08,466 INFO [train.py:996] (3/4) Epoch 8, batch 30350, loss[loss=0.2147, simple_loss=0.2884, pruned_loss=0.07049, over 21665.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3239, pruned_loss=0.0855, over 4272033.27 frames. ], batch size: 247, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:22:33,425 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:22:37,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1462992.0, ans=0.04949747468305833 2023-06-23 10:22:43,063 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.418e+02 5.113e+02 8.159e+02 1.296e+03 2.782e+03, threshold=1.632e+03, percent-clipped=10.0 2023-06-23 10:22:53,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1463052.0, ans=0.125 2023-06-23 10:23:13,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-06-23 10:23:21,506 INFO [train.py:996] (3/4) Epoch 8, batch 30400, loss[loss=0.2154, simple_loss=0.2751, pruned_loss=0.07782, over 20191.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3193, pruned_loss=0.08437, over 4258622.48 frames. ], batch size: 702, lr: 3.61e-03, grad_scale: 32.0 2023-06-23 10:23:22,353 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.57 vs. limit=15.0 2023-06-23 10:23:28,616 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-23 10:24:25,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1463352.0, ans=0.2 2023-06-23 10:24:30,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1463412.0, ans=0.0 2023-06-23 10:24:45,422 INFO [train.py:996] (3/4) Epoch 8, batch 30450, loss[loss=0.2708, simple_loss=0.3966, pruned_loss=0.0725, over 19883.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3204, pruned_loss=0.08325, over 4200107.87 frames. ], batch size: 702, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:25:05,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1463532.0, ans=0.125 2023-06-23 10:25:08,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1463532.0, ans=0.07 2023-06-23 10:25:16,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1463592.0, ans=0.125 2023-06-23 10:25:24,739 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.983e+02 7.886e+02 1.299e+03 2.180e+03 7.301e+03, threshold=2.598e+03, percent-clipped=35.0 2023-06-23 10:27:25,953 INFO [train.py:996] (3/4) Epoch 9, batch 0, loss[loss=0.2467, simple_loss=0.3037, pruned_loss=0.09483, over 21357.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3037, pruned_loss=0.09483, over 21357.00 frames. ], batch size: 473, lr: 3.39e-03, grad_scale: 32.0 2023-06-23 10:27:25,954 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 10:27:41,474 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2407, simple_loss=0.3498, pruned_loss=0.06579, over 1796401.00 frames. 2023-06-23 10:27:41,475 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-23 10:28:11,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1463802.0, ans=0.1 2023-06-23 10:29:00,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1463922.0, ans=0.0 2023-06-23 10:29:21,242 INFO [train.py:996] (3/4) Epoch 9, batch 50, loss[loss=0.2779, simple_loss=0.358, pruned_loss=0.09888, over 21627.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.328, pruned_loss=0.0854, over 965733.42 frames. ], batch size: 414, lr: 3.39e-03, grad_scale: 32.0 2023-06-23 10:30:00,075 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:30:21,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1464222.0, ans=0.125 2023-06-23 10:30:22,204 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.385e+02 5.823e+02 9.334e+02 1.610e+03 5.016e+03, threshold=1.867e+03, percent-clipped=15.0 2023-06-23 10:30:25,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1464222.0, ans=0.125 2023-06-23 10:30:38,738 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2023-06-23 10:30:58,860 INFO [train.py:996] (3/4) Epoch 9, batch 100, loss[loss=0.2832, simple_loss=0.3663, pruned_loss=0.1001, over 21249.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.344, pruned_loss=0.08861, over 1707375.60 frames. ], batch size: 143, lr: 3.39e-03, grad_scale: 32.0 2023-06-23 10:31:00,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1464342.0, ans=0.0 2023-06-23 10:31:00,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1464342.0, ans=0.0 2023-06-23 10:31:50,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1464462.0, ans=0.125 2023-06-23 10:31:58,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1464522.0, ans=0.125 2023-06-23 10:32:20,761 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-23 10:32:28,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=1464582.0, ans=0.2 2023-06-23 10:32:30,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1464582.0, ans=0.0 2023-06-23 10:32:35,503 INFO [train.py:996] (3/4) Epoch 9, batch 150, loss[loss=0.2748, simple_loss=0.3482, pruned_loss=0.1007, over 21859.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3451, pruned_loss=0.08901, over 2283809.99 frames. ], batch size: 118, lr: 3.39e-03, grad_scale: 16.0 2023-06-23 10:32:51,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1464702.0, ans=0.125 2023-06-23 10:33:18,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1464762.0, ans=0.2 2023-06-23 10:33:38,729 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.525e+02 5.241e+02 6.632e+02 9.762e+02 2.000e+03, threshold=1.326e+03, percent-clipped=1.0 2023-06-23 10:33:49,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1464822.0, ans=0.125 2023-06-23 10:34:05,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1464882.0, ans=0.1 2023-06-23 10:34:12,621 INFO [train.py:996] (3/4) Epoch 9, batch 200, loss[loss=0.2522, simple_loss=0.3109, pruned_loss=0.0968, over 21906.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3398, pruned_loss=0.08753, over 2727907.36 frames. ], batch size: 98, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:34:29,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1465002.0, ans=0.125 2023-06-23 10:34:43,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1465002.0, ans=0.125 2023-06-23 10:34:56,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1465062.0, ans=0.125 2023-06-23 10:35:20,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1465122.0, ans=0.04949747468305833 2023-06-23 10:35:50,367 INFO [train.py:996] (3/4) Epoch 9, batch 250, loss[loss=0.3062, simple_loss=0.3736, pruned_loss=0.1194, over 21593.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3324, pruned_loss=0.08589, over 3066796.35 frames. ], batch size: 389, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:36:08,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1465302.0, ans=0.0 2023-06-23 10:36:29,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1465362.0, ans=0.125 2023-06-23 10:36:47,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1465422.0, ans=0.0 2023-06-23 10:36:55,172 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.614e+02 4.793e+02 6.542e+02 9.601e+02 1.948e+03, threshold=1.308e+03, percent-clipped=7.0 2023-06-23 10:37:08,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1465422.0, ans=0.2 2023-06-23 10:37:10,469 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-23 10:37:25,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1465482.0, ans=0.125 2023-06-23 10:37:29,651 INFO [train.py:996] (3/4) Epoch 9, batch 300, loss[loss=0.2163, simple_loss=0.2793, pruned_loss=0.07667, over 21301.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3261, pruned_loss=0.08473, over 3333990.08 frames. ], batch size: 131, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:37:31,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1465542.0, ans=0.125 2023-06-23 10:38:33,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1465722.0, ans=0.125 2023-06-23 10:38:37,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1465722.0, ans=0.2 2023-06-23 10:39:04,018 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-23 10:39:11,145 INFO [train.py:996] (3/4) Epoch 9, batch 350, loss[loss=0.2617, simple_loss=0.3329, pruned_loss=0.0953, over 20771.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3194, pruned_loss=0.08184, over 3540550.15 frames. ], batch size: 609, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:39:13,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1465842.0, ans=0.125 2023-06-23 10:39:16,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1465842.0, ans=0.125 2023-06-23 10:39:42,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1465902.0, ans=0.1 2023-06-23 10:39:52,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1465962.0, ans=0.0 2023-06-23 10:40:13,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1466022.0, ans=0.0 2023-06-23 10:40:16,453 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.272e+02 5.700e+02 8.166e+02 1.374e+03 3.481e+03, threshold=1.633e+03, percent-clipped=26.0 2023-06-23 10:40:30,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1466022.0, ans=0.0 2023-06-23 10:40:52,630 INFO [train.py:996] (3/4) Epoch 9, batch 400, loss[loss=0.2324, simple_loss=0.2895, pruned_loss=0.08762, over 21263.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3134, pruned_loss=0.08095, over 3708314.66 frames. ], batch size: 471, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 10:41:02,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1466142.0, ans=0.2 2023-06-23 10:41:04,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1466142.0, ans=0.1 2023-06-23 10:41:06,172 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.05 vs. limit=10.0 2023-06-23 10:41:08,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1466202.0, ans=0.0 2023-06-23 10:41:39,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1466262.0, ans=0.1 2023-06-23 10:41:52,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1466262.0, ans=0.0 2023-06-23 10:41:56,244 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-06-23 10:41:59,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1466322.0, ans=0.125 2023-06-23 10:42:16,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1466382.0, ans=0.125 2023-06-23 10:42:18,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1466382.0, ans=0.1 2023-06-23 10:42:35,009 INFO [train.py:996] (3/4) Epoch 9, batch 450, loss[loss=0.1952, simple_loss=0.2601, pruned_loss=0.06522, over 21433.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3099, pruned_loss=0.07998, over 3838389.32 frames. ], batch size: 212, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 10:42:49,494 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.71 vs. limit=12.0 2023-06-23 10:42:53,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1466502.0, ans=0.1 2023-06-23 10:43:23,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1466562.0, ans=0.125 2023-06-23 10:43:40,415 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.611e+02 7.104e+02 9.718e+02 1.338e+03 3.704e+03, threshold=1.944e+03, percent-clipped=17.0 2023-06-23 10:44:09,224 INFO [train.py:996] (3/4) Epoch 9, batch 500, loss[loss=0.2137, simple_loss=0.271, pruned_loss=0.07818, over 20815.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3089, pruned_loss=0.07878, over 3929967.05 frames. ], batch size: 608, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:44:17,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1466742.0, ans=0.0 2023-06-23 10:44:21,306 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-06-23 10:44:29,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1466802.0, ans=0.0 2023-06-23 10:45:03,466 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-06-23 10:45:18,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1466922.0, ans=0.0 2023-06-23 10:45:39,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1466982.0, ans=0.0 2023-06-23 10:45:47,782 INFO [train.py:996] (3/4) Epoch 9, batch 550, loss[loss=0.23, simple_loss=0.3003, pruned_loss=0.07992, over 21888.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3129, pruned_loss=0.07922, over 4004298.65 frames. ], batch size: 118, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:45:55,504 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-06-23 10:46:23,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1467102.0, ans=0.125 2023-06-23 10:46:53,870 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.317e+02 4.598e+02 6.514e+02 1.038e+03 2.454e+03, threshold=1.303e+03, percent-clipped=6.0 2023-06-23 10:47:02,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1467222.0, ans=0.125 2023-06-23 10:47:22,482 INFO [train.py:996] (3/4) Epoch 9, batch 600, loss[loss=0.218, simple_loss=0.2878, pruned_loss=0.07414, over 21736.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3176, pruned_loss=0.0798, over 4067289.01 frames. ], batch size: 316, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:48:04,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1467402.0, ans=0.0 2023-06-23 10:48:04,998 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-23 10:48:11,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1467462.0, ans=0.125 2023-06-23 10:48:12,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1467462.0, ans=0.1 2023-06-23 10:48:13,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1467462.0, ans=0.125 2023-06-23 10:49:00,645 INFO [train.py:996] (3/4) Epoch 9, batch 650, loss[loss=0.2874, simple_loss=0.3888, pruned_loss=0.09295, over 21647.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3209, pruned_loss=0.08155, over 4109309.89 frames. ], batch size: 441, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:50:06,996 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.298e+02 4.773e+02 6.710e+02 1.032e+03 2.196e+03, threshold=1.342e+03, percent-clipped=13.0 2023-06-23 10:50:34,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1467942.0, ans=0.0 2023-06-23 10:50:36,051 INFO [train.py:996] (3/4) Epoch 9, batch 700, loss[loss=0.2234, simple_loss=0.2866, pruned_loss=0.08015, over 21661.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3191, pruned_loss=0.0823, over 4152321.04 frames. ], batch size: 230, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:52:09,587 INFO [train.py:996] (3/4) Epoch 9, batch 750, loss[loss=0.2873, simple_loss=0.37, pruned_loss=0.1023, over 21724.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3169, pruned_loss=0.08316, over 4192561.77 frames. ], batch size: 298, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:52:12,267 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=22.5 2023-06-23 10:52:13,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1468242.0, ans=0.0 2023-06-23 10:52:58,019 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.405e-03 2023-06-23 10:53:15,063 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.543e+02 5.132e+02 8.530e+02 1.237e+03 2.839e+03, threshold=1.706e+03, percent-clipped=17.0 2023-06-23 10:53:33,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1468482.0, ans=0.1 2023-06-23 10:53:38,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1468482.0, ans=0.125 2023-06-23 10:53:44,249 INFO [train.py:996] (3/4) Epoch 9, batch 800, loss[loss=0.2418, simple_loss=0.3036, pruned_loss=0.09004, over 21780.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3152, pruned_loss=0.08348, over 4219659.08 frames. ], batch size: 351, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 10:54:07,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1468602.0, ans=0.0 2023-06-23 10:55:19,762 INFO [train.py:996] (3/4) Epoch 9, batch 850, loss[loss=0.2517, simple_loss=0.3176, pruned_loss=0.09289, over 21909.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3139, pruned_loss=0.08345, over 4245519.32 frames. ], batch size: 316, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 10:56:00,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1468902.0, ans=0.2 2023-06-23 10:56:14,420 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2023-06-23 10:56:16,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1468962.0, ans=0.125 2023-06-23 10:56:20,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1468962.0, ans=0.125 2023-06-23 10:56:31,569 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.626e+02 5.922e+02 9.431e+02 1.406e+03 2.564e+03, threshold=1.886e+03, percent-clipped=15.0 2023-06-23 10:56:45,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1469082.0, ans=0.0 2023-06-23 10:57:05,006 INFO [train.py:996] (3/4) Epoch 9, batch 900, loss[loss=0.2432, simple_loss=0.3178, pruned_loss=0.08429, over 21825.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3111, pruned_loss=0.08324, over 4257080.66 frames. ], batch size: 282, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:57:16,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1469142.0, ans=0.125 2023-06-23 10:57:53,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1469262.0, ans=0.1 2023-06-23 10:57:58,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1469262.0, ans=0.125 2023-06-23 10:58:14,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1469322.0, ans=0.0 2023-06-23 10:58:45,641 INFO [train.py:996] (3/4) Epoch 9, batch 950, loss[loss=0.2706, simple_loss=0.3402, pruned_loss=0.1005, over 21924.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3102, pruned_loss=0.08269, over 4266129.07 frames. ], batch size: 316, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:59:06,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1469442.0, ans=0.125 2023-06-23 10:59:53,223 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.620e+02 5.465e+02 8.299e+02 1.252e+03 2.692e+03, threshold=1.660e+03, percent-clipped=4.0 2023-06-23 10:59:56,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1469622.0, ans=0.125 2023-06-23 11:00:25,485 INFO [train.py:996] (3/4) Epoch 9, batch 1000, loss[loss=0.212, simple_loss=0.2816, pruned_loss=0.07118, over 21286.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3097, pruned_loss=0.0821, over 4273276.18 frames. ], batch size: 144, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:00:51,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1469802.0, ans=0.125 2023-06-23 11:01:01,962 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-23 11:01:09,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1469802.0, ans=0.09899494936611666 2023-06-23 11:02:11,530 INFO [train.py:996] (3/4) Epoch 9, batch 1050, loss[loss=0.2773, simple_loss=0.3534, pruned_loss=0.1006, over 21612.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3116, pruned_loss=0.0826, over 4276925.23 frames. ], batch size: 389, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:02:13,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1470042.0, ans=0.125 2023-06-23 11:02:42,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1470102.0, ans=0.125 2023-06-23 11:03:15,046 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.334e+02 4.887e+02 6.781e+02 8.518e+02 2.404e+03, threshold=1.356e+03, percent-clipped=1.0 2023-06-23 11:03:58,862 INFO [train.py:996] (3/4) Epoch 9, batch 1100, loss[loss=0.2246, simple_loss=0.3045, pruned_loss=0.07233, over 21641.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3122, pruned_loss=0.082, over 4277617.59 frames. ], batch size: 441, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:04:02,685 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=15.0 2023-06-23 11:04:33,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1470402.0, ans=0.125 2023-06-23 11:05:11,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1470582.0, ans=0.125 2023-06-23 11:05:43,341 INFO [train.py:996] (3/4) Epoch 9, batch 1150, loss[loss=0.2648, simple_loss=0.3531, pruned_loss=0.08824, over 21659.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3135, pruned_loss=0.08212, over 4278934.05 frames. ], batch size: 389, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:06:43,161 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.848e+02 5.352e+02 7.597e+02 1.030e+03 2.056e+03, threshold=1.519e+03, percent-clipped=12.0 2023-06-23 11:07:25,427 INFO [train.py:996] (3/4) Epoch 9, batch 1200, loss[loss=0.2404, simple_loss=0.3123, pruned_loss=0.08428, over 21673.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3139, pruned_loss=0.08241, over 4278142.92 frames. ], batch size: 231, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 11:07:37,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1470942.0, ans=0.125 2023-06-23 11:07:38,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1470942.0, ans=0.2 2023-06-23 11:07:48,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1471002.0, ans=0.04949747468305833 2023-06-23 11:08:56,041 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=22.5 2023-06-23 11:09:01,888 INFO [train.py:996] (3/4) Epoch 9, batch 1250, loss[loss=0.2175, simple_loss=0.2814, pruned_loss=0.07686, over 21185.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3138, pruned_loss=0.08225, over 4275510.06 frames. ], batch size: 608, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 11:09:16,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1471302.0, ans=0.0 2023-06-23 11:10:01,577 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.789e+02 4.897e+02 6.693e+02 9.449e+02 1.847e+03, threshold=1.339e+03, percent-clipped=0.0 2023-06-23 11:10:40,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1471542.0, ans=0.0 2023-06-23 11:10:41,427 INFO [train.py:996] (3/4) Epoch 9, batch 1300, loss[loss=0.2259, simple_loss=0.3021, pruned_loss=0.07481, over 21482.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.315, pruned_loss=0.08289, over 4280319.58 frames. ], batch size: 211, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:10:45,329 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=22.5 2023-06-23 11:10:54,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1471542.0, ans=0.125 2023-06-23 11:10:57,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1471602.0, ans=0.1 2023-06-23 11:11:37,252 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=22.5 2023-06-23 11:12:14,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1471782.0, ans=0.0 2023-06-23 11:12:15,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1471842.0, ans=0.2 2023-06-23 11:12:16,952 INFO [train.py:996] (3/4) Epoch 9, batch 1350, loss[loss=0.2707, simple_loss=0.3287, pruned_loss=0.1064, over 21471.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3163, pruned_loss=0.08295, over 4280472.05 frames. ], batch size: 131, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:12:55,370 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.12 vs. limit=6.0 2023-06-23 11:12:59,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1471962.0, ans=0.125 2023-06-23 11:13:16,508 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.586e+02 4.783e+02 6.688e+02 9.049e+02 1.938e+03, threshold=1.338e+03, percent-clipped=9.0 2023-06-23 11:13:34,633 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.60 vs. limit=15.0 2023-06-23 11:13:57,420 INFO [train.py:996] (3/4) Epoch 9, batch 1400, loss[loss=0.2411, simple_loss=0.3287, pruned_loss=0.07669, over 21816.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3157, pruned_loss=0.0827, over 4286494.72 frames. ], batch size: 298, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:14:37,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1472262.0, ans=0.0 2023-06-23 11:14:52,950 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.59 vs. limit=15.0 2023-06-23 11:15:24,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1472382.0, ans=0.0 2023-06-23 11:15:39,588 INFO [train.py:996] (3/4) Epoch 9, batch 1450, loss[loss=0.2643, simple_loss=0.3374, pruned_loss=0.09559, over 21898.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3149, pruned_loss=0.08318, over 4283859.33 frames. ], batch size: 316, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:15:44,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1472442.0, ans=0.0 2023-06-23 11:15:49,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1472442.0, ans=0.125 2023-06-23 11:15:54,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1472502.0, ans=0.125 2023-06-23 11:15:59,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1472502.0, ans=0.07 2023-06-23 11:16:04,819 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-23 11:16:16,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1472562.0, ans=0.125 2023-06-23 11:16:37,128 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=22.5 2023-06-23 11:16:44,338 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.412e+02 5.469e+02 7.594e+02 1.040e+03 1.854e+03, threshold=1.519e+03, percent-clipped=12.0 2023-06-23 11:17:04,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1472682.0, ans=0.0 2023-06-23 11:17:20,408 INFO [train.py:996] (3/4) Epoch 9, batch 1500, loss[loss=0.2565, simple_loss=0.3273, pruned_loss=0.09286, over 21672.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3164, pruned_loss=0.0844, over 4277946.22 frames. ], batch size: 298, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:18:53,756 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.12 vs. limit=10.0 2023-06-23 11:19:02,987 INFO [train.py:996] (3/4) Epoch 9, batch 1550, loss[loss=0.2319, simple_loss=0.2991, pruned_loss=0.08233, over 21637.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.314, pruned_loss=0.08293, over 4282399.41 frames. ], batch size: 263, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:19:31,888 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-23 11:20:14,282 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.435e+02 5.299e+02 6.765e+02 1.096e+03 1.841e+03, threshold=1.353e+03, percent-clipped=3.0 2023-06-23 11:20:19,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1473222.0, ans=0.1 2023-06-23 11:20:36,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1473282.0, ans=0.2 2023-06-23 11:20:40,598 INFO [train.py:996] (3/4) Epoch 9, batch 1600, loss[loss=0.1974, simple_loss=0.2704, pruned_loss=0.06218, over 21817.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3122, pruned_loss=0.08203, over 4287651.44 frames. ], batch size: 282, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 11:21:20,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1473462.0, ans=0.0 2023-06-23 11:22:20,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1473582.0, ans=0.125 2023-06-23 11:22:23,014 INFO [train.py:996] (3/4) Epoch 9, batch 1650, loss[loss=0.2398, simple_loss=0.3185, pruned_loss=0.08053, over 21779.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3115, pruned_loss=0.08151, over 4274109.84 frames. ], batch size: 282, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:22:24,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1473642.0, ans=0.035 2023-06-23 11:23:41,187 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.489e+02 5.690e+02 7.589e+02 1.047e+03 2.202e+03, threshold=1.518e+03, percent-clipped=10.0 2023-06-23 11:23:46,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1473822.0, ans=0.2 2023-06-23 11:24:06,460 INFO [train.py:996] (3/4) Epoch 9, batch 1700, loss[loss=0.2457, simple_loss=0.3181, pruned_loss=0.08666, over 21742.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3134, pruned_loss=0.08306, over 4281587.43 frames. ], batch size: 298, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:25:16,357 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=22.5 2023-06-23 11:25:54,638 INFO [train.py:996] (3/4) Epoch 9, batch 1750, loss[loss=0.2423, simple_loss=0.3447, pruned_loss=0.06994, over 19856.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3151, pruned_loss=0.08149, over 4280511.69 frames. ], batch size: 703, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:26:01,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1474242.0, ans=0.125 2023-06-23 11:26:03,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1474242.0, ans=0.0 2023-06-23 11:26:46,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1474362.0, ans=0.0 2023-06-23 11:27:13,751 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.110e+02 6.517e+02 8.827e+02 1.421e+03 2.550e+03, threshold=1.765e+03, percent-clipped=23.0 2023-06-23 11:27:36,569 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2023-06-23 11:27:43,580 INFO [train.py:996] (3/4) Epoch 9, batch 1800, loss[loss=0.239, simple_loss=0.3366, pruned_loss=0.07067, over 21638.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.315, pruned_loss=0.07997, over 4274103.17 frames. ], batch size: 414, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:28:31,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1474662.0, ans=0.0 2023-06-23 11:28:36,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1474662.0, ans=0.125 2023-06-23 11:28:46,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1474722.0, ans=0.025 2023-06-23 11:28:52,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1474722.0, ans=0.0 2023-06-23 11:29:25,344 INFO [train.py:996] (3/4) Epoch 9, batch 1850, loss[loss=0.2465, simple_loss=0.3526, pruned_loss=0.07024, over 21528.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3173, pruned_loss=0.0784, over 4279068.02 frames. ], batch size: 471, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:29:35,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1474842.0, ans=0.0 2023-06-23 11:29:53,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=1474902.0, ans=0.1 2023-06-23 11:30:37,021 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.456e+02 5.587e+02 8.108e+02 1.184e+03 2.810e+03, threshold=1.622e+03, percent-clipped=5.0 2023-06-23 11:30:47,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1475082.0, ans=0.2 2023-06-23 11:30:54,769 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:31:11,665 INFO [train.py:996] (3/4) Epoch 9, batch 1900, loss[loss=0.2061, simple_loss=0.2772, pruned_loss=0.06746, over 21410.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3168, pruned_loss=0.07886, over 4274176.43 frames. ], batch size: 194, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:32:12,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1475322.0, ans=0.125 2023-06-23 11:32:23,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1475322.0, ans=0.05 2023-06-23 11:32:25,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1475382.0, ans=0.125 2023-06-23 11:32:25,333 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:32:42,137 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-23 11:32:58,407 INFO [train.py:996] (3/4) Epoch 9, batch 1950, loss[loss=0.2411, simple_loss=0.3411, pruned_loss=0.07051, over 21651.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3135, pruned_loss=0.07842, over 4277740.30 frames. ], batch size: 414, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:33:38,497 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=22.5 2023-06-23 11:33:42,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1475562.0, ans=0.2 2023-06-23 11:33:46,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1475562.0, ans=0.125 2023-06-23 11:33:49,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1475562.0, ans=0.1 2023-06-23 11:33:57,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1475622.0, ans=0.0 2023-06-23 11:34:00,317 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.571e+02 6.151e+02 9.472e+02 1.342e+03 2.834e+03, threshold=1.894e+03, percent-clipped=13.0 2023-06-23 11:34:10,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1475682.0, ans=0.5 2023-06-23 11:34:20,125 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=12.0 2023-06-23 11:34:40,440 INFO [train.py:996] (3/4) Epoch 9, batch 2000, loss[loss=0.2342, simple_loss=0.3234, pruned_loss=0.07248, over 20008.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3088, pruned_loss=0.07748, over 4269006.82 frames. ], batch size: 702, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:35:12,195 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=22.5 2023-06-23 11:35:40,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1475922.0, ans=0.125 2023-06-23 11:36:16,262 INFO [train.py:996] (3/4) Epoch 9, batch 2050, loss[loss=0.2289, simple_loss=0.3182, pruned_loss=0.06978, over 21606.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3099, pruned_loss=0.07728, over 4272123.55 frames. ], batch size: 263, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:36:18,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1476042.0, ans=0.125 2023-06-23 11:36:36,958 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-23 11:37:17,614 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.689e+02 5.522e+02 6.873e+02 9.848e+02 2.030e+03, threshold=1.375e+03, percent-clipped=1.0 2023-06-23 11:37:56,776 INFO [train.py:996] (3/4) Epoch 9, batch 2100, loss[loss=0.2705, simple_loss=0.3375, pruned_loss=0.1017, over 21311.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3102, pruned_loss=0.07855, over 4272633.80 frames. ], batch size: 176, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:38:06,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1476342.0, ans=0.0 2023-06-23 11:38:39,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1476462.0, ans=0.1 2023-06-23 11:39:27,564 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:39:33,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1476582.0, ans=0.0 2023-06-23 11:39:37,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1476582.0, ans=0.125 2023-06-23 11:39:39,879 INFO [train.py:996] (3/4) Epoch 9, batch 2150, loss[loss=0.2558, simple_loss=0.3308, pruned_loss=0.09043, over 21226.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3125, pruned_loss=0.08057, over 4264714.30 frames. ], batch size: 176, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:40:20,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1476762.0, ans=0.0 2023-06-23 11:40:42,321 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.083e+02 6.104e+02 8.851e+02 1.376e+03 2.645e+03, threshold=1.770e+03, percent-clipped=25.0 2023-06-23 11:40:53,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1476822.0, ans=0.125 2023-06-23 11:41:21,792 INFO [train.py:996] (3/4) Epoch 9, batch 2200, loss[loss=0.2513, simple_loss=0.3029, pruned_loss=0.09983, over 21683.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3133, pruned_loss=0.08075, over 4265932.36 frames. ], batch size: 417, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:41:42,160 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-23 11:41:44,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1477002.0, ans=0.1 2023-06-23 11:41:49,813 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:43:02,281 INFO [train.py:996] (3/4) Epoch 9, batch 2250, loss[loss=0.2347, simple_loss=0.2988, pruned_loss=0.08534, over 21737.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3116, pruned_loss=0.07895, over 4264996.92 frames. ], batch size: 351, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:43:23,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1477302.0, ans=0.125 2023-06-23 11:43:39,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1477362.0, ans=0.125 2023-06-23 11:43:51,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1477362.0, ans=0.0 2023-06-23 11:43:56,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1477422.0, ans=0.0 2023-06-23 11:43:56,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1477422.0, ans=0.1 2023-06-23 11:44:09,156 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.634e+02 5.493e+02 8.283e+02 1.333e+03 2.509e+03, threshold=1.657e+03, percent-clipped=6.0 2023-06-23 11:44:37,504 INFO [train.py:996] (3/4) Epoch 9, batch 2300, loss[loss=0.252, simple_loss=0.2952, pruned_loss=0.1044, over 21538.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3095, pruned_loss=0.07944, over 4271530.36 frames. ], batch size: 512, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:46:02,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1477782.0, ans=0.125 2023-06-23 11:46:18,219 INFO [train.py:996] (3/4) Epoch 9, batch 2350, loss[loss=0.2423, simple_loss=0.3284, pruned_loss=0.07811, over 19804.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3069, pruned_loss=0.07926, over 4266963.83 frames. ], batch size: 702, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:46:28,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1477842.0, ans=0.0 2023-06-23 11:46:49,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=22.5 2023-06-23 11:47:08,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1477962.0, ans=0.2 2023-06-23 11:47:12,385 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=12.0 2023-06-23 11:47:36,704 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.626e+02 5.310e+02 7.234e+02 1.027e+03 2.720e+03, threshold=1.447e+03, percent-clipped=6.0 2023-06-23 11:47:45,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1478082.0, ans=0.0 2023-06-23 11:48:06,153 INFO [train.py:996] (3/4) Epoch 9, batch 2400, loss[loss=0.3497, simple_loss=0.3942, pruned_loss=0.1526, over 21437.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3111, pruned_loss=0.08161, over 4265469.64 frames. ], batch size: 471, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:49:39,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1478382.0, ans=0.125 2023-06-23 11:49:43,570 INFO [train.py:996] (3/4) Epoch 9, batch 2450, loss[loss=0.2989, simple_loss=0.3621, pruned_loss=0.1179, over 21236.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3165, pruned_loss=0.08537, over 4263002.46 frames. ], batch size: 159, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:49:45,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1478442.0, ans=0.125 2023-06-23 11:49:53,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1478442.0, ans=0.125 2023-06-23 11:50:49,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1478622.0, ans=0.0 2023-06-23 11:50:49,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1478622.0, ans=0.125 2023-06-23 11:50:55,448 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 5.454e+02 8.636e+02 1.143e+03 3.101e+03, threshold=1.727e+03, percent-clipped=10.0 2023-06-23 11:51:10,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1478682.0, ans=0.1 2023-06-23 11:51:24,309 INFO [train.py:996] (3/4) Epoch 9, batch 2500, loss[loss=0.2004, simple_loss=0.2682, pruned_loss=0.06626, over 21612.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3126, pruned_loss=0.08433, over 4262664.92 frames. ], batch size: 263, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:51:41,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1478802.0, ans=15.0 2023-06-23 11:51:52,856 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.57 vs. limit=10.0 2023-06-23 11:51:55,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1478862.0, ans=0.09899494936611666 2023-06-23 11:51:56,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1478862.0, ans=0.125 2023-06-23 11:53:01,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1478982.0, ans=0.125 2023-06-23 11:53:02,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.93 vs. limit=15.0 2023-06-23 11:53:04,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1479042.0, ans=0.125 2023-06-23 11:53:05,809 INFO [train.py:996] (3/4) Epoch 9, batch 2550, loss[loss=0.199, simple_loss=0.3154, pruned_loss=0.04128, over 20777.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3103, pruned_loss=0.08219, over 4270159.56 frames. ], batch size: 608, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:53:25,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1479102.0, ans=0.125 2023-06-23 11:53:48,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1479162.0, ans=0.125 2023-06-23 11:54:19,354 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.829e+02 7.204e+02 9.571e+02 1.455e+03 2.660e+03, threshold=1.914e+03, percent-clipped=10.0 2023-06-23 11:54:22,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.98 vs. limit=8.0 2023-06-23 11:54:46,745 INFO [train.py:996] (3/4) Epoch 9, batch 2600, loss[loss=0.256, simple_loss=0.3284, pruned_loss=0.09179, over 21862.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3131, pruned_loss=0.08306, over 4273164.99 frames. ], batch size: 107, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:54:56,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1479342.0, ans=0.0 2023-06-23 11:55:25,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1479462.0, ans=0.0 2023-06-23 11:55:45,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1479522.0, ans=0.1 2023-06-23 11:56:03,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1479522.0, ans=0.125 2023-06-23 11:56:10,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1479582.0, ans=0.125 2023-06-23 11:56:15,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1479582.0, ans=0.125 2023-06-23 11:56:18,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.39 vs. limit=22.5 2023-06-23 11:56:28,270 INFO [train.py:996] (3/4) Epoch 9, batch 2650, loss[loss=0.2396, simple_loss=0.3246, pruned_loss=0.07731, over 21350.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3126, pruned_loss=0.08286, over 4274981.06 frames. ], batch size: 211, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:56:36,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1479642.0, ans=0.125 2023-06-23 11:56:46,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1479702.0, ans=0.125 2023-06-23 11:56:47,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1479702.0, ans=0.0 2023-06-23 11:57:37,854 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.783e+02 6.164e+02 7.850e+02 1.193e+03 2.220e+03, threshold=1.570e+03, percent-clipped=3.0 2023-06-23 11:58:05,281 INFO [train.py:996] (3/4) Epoch 9, batch 2700, loss[loss=0.2943, simple_loss=0.3606, pruned_loss=0.114, over 21554.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.311, pruned_loss=0.08265, over 4282730.81 frames. ], batch size: 509, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:58:46,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1480062.0, ans=0.1 2023-06-23 11:59:17,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1480122.0, ans=0.125 2023-06-23 11:59:37,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1480182.0, ans=0.125 2023-06-23 11:59:43,040 INFO [train.py:996] (3/4) Epoch 9, batch 2750, loss[loss=0.2591, simple_loss=0.3195, pruned_loss=0.09937, over 21938.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.311, pruned_loss=0.08261, over 4287892.36 frames. ], batch size: 316, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:00:27,771 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.32 vs. limit=10.0 2023-06-23 12:00:43,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1480422.0, ans=0.0 2023-06-23 12:00:58,179 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.035e+02 5.584e+02 7.738e+02 1.130e+03 2.409e+03, threshold=1.548e+03, percent-clipped=8.0 2023-06-23 12:01:23,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.43 vs. limit=22.5 2023-06-23 12:01:27,090 INFO [train.py:996] (3/4) Epoch 9, batch 2800, loss[loss=0.2588, simple_loss=0.3273, pruned_loss=0.09516, over 21623.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.315, pruned_loss=0.08362, over 4289056.83 frames. ], batch size: 263, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 12:01:44,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1480602.0, ans=0.2 2023-06-23 12:02:24,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1480662.0, ans=0.125 2023-06-23 12:03:05,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1480782.0, ans=0.0 2023-06-23 12:03:09,764 INFO [train.py:996] (3/4) Epoch 9, batch 2850, loss[loss=0.2182, simple_loss=0.289, pruned_loss=0.07372, over 21740.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3158, pruned_loss=0.08425, over 4289747.44 frames. ], batch size: 282, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:04:04,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1480962.0, ans=0.0 2023-06-23 12:04:25,143 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.732e+02 5.955e+02 8.768e+02 1.383e+03 2.997e+03, threshold=1.754e+03, percent-clipped=21.0 2023-06-23 12:04:38,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1481082.0, ans=0.2 2023-06-23 12:04:39,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1481082.0, ans=0.0 2023-06-23 12:04:50,670 INFO [train.py:996] (3/4) Epoch 9, batch 2900, loss[loss=0.2311, simple_loss=0.3393, pruned_loss=0.06149, over 19772.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3124, pruned_loss=0.08363, over 4289655.27 frames. ], batch size: 702, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:05:01,517 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=22.5 2023-06-23 12:05:12,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1481202.0, ans=0.125 2023-06-23 12:05:56,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1481322.0, ans=0.2 2023-06-23 12:06:13,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1481322.0, ans=6.0 2023-06-23 12:06:27,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1481382.0, ans=0.125 2023-06-23 12:06:27,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1481382.0, ans=0.125 2023-06-23 12:06:30,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1481442.0, ans=0.05 2023-06-23 12:06:31,728 INFO [train.py:996] (3/4) Epoch 9, batch 2950, loss[loss=0.2378, simple_loss=0.3314, pruned_loss=0.07206, over 21819.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3128, pruned_loss=0.08324, over 4288810.25 frames. ], batch size: 282, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:06:32,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1481442.0, ans=0.0 2023-06-23 12:06:32,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1481442.0, ans=0.2 2023-06-23 12:06:43,923 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.76 vs. limit=15.0 2023-06-23 12:06:46,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-23 12:06:47,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1481502.0, ans=0.2 2023-06-23 12:06:47,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1481502.0, ans=0.2 2023-06-23 12:06:53,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1481502.0, ans=0.1 2023-06-23 12:06:55,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1481502.0, ans=0.125 2023-06-23 12:07:06,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1481502.0, ans=0.1 2023-06-23 12:07:45,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1481622.0, ans=0.0 2023-06-23 12:07:48,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.698e+02 5.526e+02 7.206e+02 1.005e+03 1.804e+03, threshold=1.441e+03, percent-clipped=1.0 2023-06-23 12:07:53,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1481682.0, ans=0.0 2023-06-23 12:07:59,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1481682.0, ans=0.125 2023-06-23 12:07:59,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1481682.0, ans=0.125 2023-06-23 12:08:08,687 INFO [train.py:996] (3/4) Epoch 9, batch 3000, loss[loss=0.2597, simple_loss=0.3338, pruned_loss=0.09283, over 21751.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3175, pruned_loss=0.08422, over 4291906.92 frames. ], batch size: 298, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:08:08,688 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 12:08:24,857 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2522, simple_loss=0.3459, pruned_loss=0.07924, over 1796401.00 frames. 2023-06-23 12:08:24,858 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-23 12:08:53,849 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-23 12:09:05,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1481802.0, ans=0.125 2023-06-23 12:09:53,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1481982.0, ans=0.125 2023-06-23 12:10:08,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1482042.0, ans=0.125 2023-06-23 12:10:09,600 INFO [train.py:996] (3/4) Epoch 9, batch 3050, loss[loss=0.2416, simple_loss=0.3371, pruned_loss=0.07304, over 21178.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3174, pruned_loss=0.08236, over 4291975.39 frames. ], batch size: 548, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:10:11,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1482042.0, ans=0.0 2023-06-23 12:10:13,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1482042.0, ans=0.125 2023-06-23 12:11:05,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1482162.0, ans=0.1 2023-06-23 12:11:24,317 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.590e+02 6.164e+02 8.184e+02 1.174e+03 2.237e+03, threshold=1.637e+03, percent-clipped=13.0 2023-06-23 12:11:44,957 INFO [train.py:996] (3/4) Epoch 9, batch 3100, loss[loss=0.2241, simple_loss=0.3212, pruned_loss=0.06347, over 21697.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3199, pruned_loss=0.08309, over 4290788.86 frames. ], batch size: 263, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:13:02,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1482522.0, ans=0.125 2023-06-23 12:13:36,678 INFO [train.py:996] (3/4) Epoch 9, batch 3150, loss[loss=0.3034, simple_loss=0.3678, pruned_loss=0.1195, over 21422.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.321, pruned_loss=0.08334, over 4292777.60 frames. ], batch size: 471, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:14:38,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1482822.0, ans=0.0 2023-06-23 12:14:47,703 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.159e+02 5.985e+02 8.535e+02 1.297e+03 2.485e+03, threshold=1.707e+03, percent-clipped=14.0 2023-06-23 12:15:24,468 INFO [train.py:996] (3/4) Epoch 9, batch 3200, loss[loss=0.2457, simple_loss=0.3175, pruned_loss=0.08697, over 21240.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3227, pruned_loss=0.08347, over 4295050.12 frames. ], batch size: 143, lr: 3.36e-03, grad_scale: 32.0 2023-06-23 12:15:57,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1483002.0, ans=0.125 2023-06-23 12:16:35,001 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.51 vs. limit=15.0 2023-06-23 12:16:49,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1483182.0, ans=0.125 2023-06-23 12:16:59,990 INFO [train.py:996] (3/4) Epoch 9, batch 3250, loss[loss=0.2675, simple_loss=0.3181, pruned_loss=0.1084, over 21598.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3239, pruned_loss=0.08589, over 4297071.91 frames. ], batch size: 415, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:17:06,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1483242.0, ans=0.5 2023-06-23 12:17:16,233 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:17:56,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1483422.0, ans=0.0 2023-06-23 12:18:19,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.506e+02 4.912e+02 6.771e+02 1.025e+03 2.208e+03, threshold=1.354e+03, percent-clipped=1.0 2023-06-23 12:18:21,012 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.44 vs. limit=15.0 2023-06-23 12:18:22,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1483482.0, ans=0.0 2023-06-23 12:18:23,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1483482.0, ans=0.015 2023-06-23 12:18:32,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1483482.0, ans=0.2 2023-06-23 12:18:41,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1483482.0, ans=0.125 2023-06-23 12:18:44,452 INFO [train.py:996] (3/4) Epoch 9, batch 3300, loss[loss=0.2534, simple_loss=0.3489, pruned_loss=0.079, over 21594.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3189, pruned_loss=0.08494, over 4295266.80 frames. ], batch size: 414, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:19:41,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1483722.0, ans=0.0 2023-06-23 12:20:03,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1483782.0, ans=0.07 2023-06-23 12:20:16,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=22.5 2023-06-23 12:20:25,114 INFO [train.py:996] (3/4) Epoch 9, batch 3350, loss[loss=0.2438, simple_loss=0.3164, pruned_loss=0.08564, over 21783.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3212, pruned_loss=0.08568, over 4290647.24 frames. ], batch size: 247, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:20:38,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1483842.0, ans=0.125 2023-06-23 12:20:40,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1483902.0, ans=0.1 2023-06-23 12:20:51,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1483902.0, ans=0.05 2023-06-23 12:20:54,827 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=22.5 2023-06-23 12:21:05,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1483962.0, ans=0.125 2023-06-23 12:21:18,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1483962.0, ans=0.125 2023-06-23 12:21:31,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1484022.0, ans=0.0 2023-06-23 12:21:40,401 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.733e+02 6.031e+02 9.696e+02 1.341e+03 2.502e+03, threshold=1.939e+03, percent-clipped=21.0 2023-06-23 12:22:04,012 INFO [train.py:996] (3/4) Epoch 9, batch 3400, loss[loss=0.2449, simple_loss=0.3406, pruned_loss=0.07466, over 21808.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3214, pruned_loss=0.08629, over 4288512.97 frames. ], batch size: 351, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:22:15,991 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-23 12:23:18,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1484322.0, ans=0.1 2023-06-23 12:23:37,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1484382.0, ans=0.0 2023-06-23 12:23:44,206 INFO [train.py:996] (3/4) Epoch 9, batch 3450, loss[loss=0.2344, simple_loss=0.3021, pruned_loss=0.08335, over 21361.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3162, pruned_loss=0.08531, over 4286832.97 frames. ], batch size: 211, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:23:44,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1484442.0, ans=0.07 2023-06-23 12:23:59,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1484502.0, ans=0.125 2023-06-23 12:25:01,672 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.990e+02 5.635e+02 8.079e+02 1.246e+03 2.546e+03, threshold=1.616e+03, percent-clipped=4.0 2023-06-23 12:25:21,096 INFO [train.py:996] (3/4) Epoch 9, batch 3500, loss[loss=0.2684, simple_loss=0.3532, pruned_loss=0.09184, over 21577.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3236, pruned_loss=0.08808, over 4288575.65 frames. ], batch size: 230, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:25:34,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1484742.0, ans=0.1 2023-06-23 12:25:43,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1484802.0, ans=0.125 2023-06-23 12:25:46,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=12.0 2023-06-23 12:25:47,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1484802.0, ans=0.2 2023-06-23 12:26:10,969 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-23 12:26:26,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1484922.0, ans=0.1 2023-06-23 12:26:34,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1484922.0, ans=0.125 2023-06-23 12:26:51,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1484982.0, ans=0.125 2023-06-23 12:26:54,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1485042.0, ans=0.025 2023-06-23 12:26:55,736 INFO [train.py:996] (3/4) Epoch 9, batch 3550, loss[loss=0.2386, simple_loss=0.299, pruned_loss=0.08907, over 21753.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3256, pruned_loss=0.08962, over 4291573.04 frames. ], batch size: 102, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:27:21,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1485102.0, ans=0.0 2023-06-23 12:28:10,183 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.810e+02 5.500e+02 7.334e+02 1.032e+03 1.807e+03, threshold=1.467e+03, percent-clipped=3.0 2023-06-23 12:28:20,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1485282.0, ans=0.1 2023-06-23 12:28:29,466 INFO [train.py:996] (3/4) Epoch 9, batch 3600, loss[loss=0.2128, simple_loss=0.265, pruned_loss=0.08031, over 21511.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3187, pruned_loss=0.08832, over 4291765.88 frames. ], batch size: 213, lr: 3.36e-03, grad_scale: 32.0 2023-06-23 12:28:31,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1485342.0, ans=0.125 2023-06-23 12:28:48,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1485342.0, ans=0.125 2023-06-23 12:29:47,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=12.0 2023-06-23 12:29:48,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1485522.0, ans=0.2 2023-06-23 12:30:04,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1485582.0, ans=0.2 2023-06-23 12:30:11,159 INFO [train.py:996] (3/4) Epoch 9, batch 3650, loss[loss=0.2257, simple_loss=0.2805, pruned_loss=0.0854, over 20264.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.321, pruned_loss=0.08872, over 4285236.62 frames. ], batch size: 703, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:30:53,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1485702.0, ans=0.05 2023-06-23 12:31:32,423 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.09 vs. limit=22.5 2023-06-23 12:31:34,326 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.982e+02 5.571e+02 7.883e+02 1.166e+03 2.519e+03, threshold=1.577e+03, percent-clipped=13.0 2023-06-23 12:31:44,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1485882.0, ans=0.0 2023-06-23 12:31:52,185 INFO [train.py:996] (3/4) Epoch 9, batch 3700, loss[loss=0.2401, simple_loss=0.3052, pruned_loss=0.08746, over 21797.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3224, pruned_loss=0.08891, over 4285251.55 frames. ], batch size: 247, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:31:54,512 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2023-06-23 12:33:02,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1486122.0, ans=0.125 2023-06-23 12:33:05,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1486122.0, ans=0.1 2023-06-23 12:33:41,725 INFO [train.py:996] (3/4) Epoch 9, batch 3750, loss[loss=0.2042, simple_loss=0.2795, pruned_loss=0.06445, over 21778.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3198, pruned_loss=0.08783, over 4289026.89 frames. ], batch size: 282, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:33:50,822 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.82 vs. limit=10.0 2023-06-23 12:34:32,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1486362.0, ans=0.07 2023-06-23 12:34:54,204 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.489e+02 5.333e+02 7.689e+02 1.174e+03 2.476e+03, threshold=1.538e+03, percent-clipped=10.0 2023-06-23 12:35:13,677 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=22.5 2023-06-23 12:35:22,199 INFO [train.py:996] (3/4) Epoch 9, batch 3800, loss[loss=0.232, simple_loss=0.3085, pruned_loss=0.07776, over 21696.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3158, pruned_loss=0.08497, over 4289048.97 frames. ], batch size: 351, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:35:25,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1486542.0, ans=0.125 2023-06-23 12:35:25,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1486542.0, ans=0.07 2023-06-23 12:35:41,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1486602.0, ans=0.0 2023-06-23 12:36:09,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1486662.0, ans=0.125 2023-06-23 12:36:13,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1486722.0, ans=0.0 2023-06-23 12:36:38,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1486782.0, ans=0.125 2023-06-23 12:36:56,376 INFO [train.py:996] (3/4) Epoch 9, batch 3850, loss[loss=0.1998, simple_loss=0.2644, pruned_loss=0.06763, over 20201.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3145, pruned_loss=0.08608, over 4274641.52 frames. ], batch size: 703, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:37:18,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1486902.0, ans=0.125 2023-06-23 12:37:26,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1486902.0, ans=0.125 2023-06-23 12:38:03,151 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.513e+02 4.770e+02 6.158e+02 8.423e+02 1.897e+03, threshold=1.232e+03, percent-clipped=2.0 2023-06-23 12:38:25,958 INFO [train.py:996] (3/4) Epoch 9, batch 3900, loss[loss=0.2593, simple_loss=0.3352, pruned_loss=0.09173, over 21363.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3103, pruned_loss=0.08539, over 4272214.13 frames. ], batch size: 548, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:38:36,869 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-23 12:38:50,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1487202.0, ans=0.125 2023-06-23 12:39:11,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1487262.0, ans=0.0 2023-06-23 12:39:14,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1487262.0, ans=0.04949747468305833 2023-06-23 12:40:09,696 INFO [train.py:996] (3/4) Epoch 9, batch 3950, loss[loss=0.2225, simple_loss=0.3111, pruned_loss=0.067, over 21636.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.311, pruned_loss=0.08321, over 4280286.09 frames. ], batch size: 441, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:40:15,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1487442.0, ans=0.015 2023-06-23 12:41:02,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1487622.0, ans=0.125 2023-06-23 12:41:16,566 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.422e+02 5.095e+02 8.161e+02 1.017e+03 2.071e+03, threshold=1.632e+03, percent-clipped=17.0 2023-06-23 12:41:17,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1487682.0, ans=0.2 2023-06-23 12:41:49,063 INFO [train.py:996] (3/4) Epoch 9, batch 4000, loss[loss=0.198, simple_loss=0.2668, pruned_loss=0.06455, over 21555.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3045, pruned_loss=0.07955, over 4273961.17 frames. ], batch size: 391, lr: 3.36e-03, grad_scale: 32.0 2023-06-23 12:42:41,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1487922.0, ans=0.0 2023-06-23 12:42:44,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1487922.0, ans=0.125 2023-06-23 12:42:52,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1487922.0, ans=0.05 2023-06-23 12:43:29,585 INFO [train.py:996] (3/4) Epoch 9, batch 4050, loss[loss=0.2421, simple_loss=0.3223, pruned_loss=0.08094, over 21446.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3035, pruned_loss=0.07782, over 4267429.82 frames. ], batch size: 194, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:44:20,275 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:44:25,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1488222.0, ans=0.1 2023-06-23 12:44:31,442 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:44:48,442 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=22.5 2023-06-23 12:44:48,945 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.596e+02 4.877e+02 6.874e+02 9.034e+02 2.185e+03, threshold=1.375e+03, percent-clipped=7.0 2023-06-23 12:44:52,841 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-06-23 12:45:09,743 INFO [train.py:996] (3/4) Epoch 9, batch 4100, loss[loss=0.2238, simple_loss=0.3029, pruned_loss=0.07231, over 21801.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.306, pruned_loss=0.07884, over 4279505.55 frames. ], batch size: 332, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:45:21,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1488342.0, ans=0.07 2023-06-23 12:45:39,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1488402.0, ans=0.0 2023-06-23 12:45:39,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=1488402.0, ans=0.02 2023-06-23 12:46:23,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1488522.0, ans=0.1 2023-06-23 12:46:31,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1488582.0, ans=0.1 2023-06-23 12:46:37,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1488582.0, ans=0.04949747468305833 2023-06-23 12:46:51,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1488582.0, ans=0.125 2023-06-23 12:46:54,899 INFO [train.py:996] (3/4) Epoch 9, batch 4150, loss[loss=0.1928, simple_loss=0.2563, pruned_loss=0.06469, over 21299.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3073, pruned_loss=0.07666, over 4279739.20 frames. ], batch size: 551, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:46:55,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1488642.0, ans=0.2 2023-06-23 12:46:57,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1488642.0, ans=0.04949747468305833 2023-06-23 12:46:58,417 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:47:04,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1488642.0, ans=0.0 2023-06-23 12:47:14,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1488702.0, ans=0.09899494936611666 2023-06-23 12:47:27,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1488762.0, ans=0.0 2023-06-23 12:47:34,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1488762.0, ans=0.125 2023-06-23 12:47:39,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1488762.0, ans=0.1 2023-06-23 12:47:49,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1488822.0, ans=0.0 2023-06-23 12:48:10,692 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.593e+02 5.949e+02 7.437e+02 1.328e+03 3.049e+03, threshold=1.487e+03, percent-clipped=21.0 2023-06-23 12:48:32,515 INFO [train.py:996] (3/4) Epoch 9, batch 4200, loss[loss=0.2302, simple_loss=0.2865, pruned_loss=0.08699, over 21699.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3067, pruned_loss=0.07535, over 4269104.10 frames. ], batch size: 112, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:48:49,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1489002.0, ans=0.125 2023-06-23 12:48:49,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1489002.0, ans=0.125 2023-06-23 12:48:49,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1489002.0, ans=0.2 2023-06-23 12:49:02,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1489002.0, ans=0.125 2023-06-23 12:49:32,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1489122.0, ans=0.1 2023-06-23 12:50:08,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1489182.0, ans=0.125 2023-06-23 12:50:13,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1489242.0, ans=0.025 2023-06-23 12:50:14,809 INFO [train.py:996] (3/4) Epoch 9, batch 4250, loss[loss=0.2453, simple_loss=0.3203, pruned_loss=0.08518, over 21593.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3136, pruned_loss=0.07752, over 4268869.78 frames. ], batch size: 230, lr: 3.36e-03, grad_scale: 8.0 2023-06-23 12:50:15,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1489242.0, ans=0.1 2023-06-23 12:50:58,768 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-23 12:51:43,117 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.972e+02 6.108e+02 8.578e+02 1.174e+03 2.664e+03, threshold=1.716e+03, percent-clipped=12.0 2023-06-23 12:51:54,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1489482.0, ans=0.125 2023-06-23 12:51:58,831 INFO [train.py:996] (3/4) Epoch 9, batch 4300, loss[loss=0.2422, simple_loss=0.3479, pruned_loss=0.0682, over 21629.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3198, pruned_loss=0.07962, over 4274909.97 frames. ], batch size: 389, lr: 3.36e-03, grad_scale: 8.0 2023-06-23 12:52:00,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1489542.0, ans=0.0 2023-06-23 12:52:52,494 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:53:40,717 INFO [train.py:996] (3/4) Epoch 9, batch 4350, loss[loss=0.2344, simple_loss=0.2968, pruned_loss=0.08605, over 21850.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3193, pruned_loss=0.07988, over 4262297.83 frames. ], batch size: 98, lr: 3.36e-03, grad_scale: 8.0 2023-06-23 12:53:48,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1489842.0, ans=6.0 2023-06-23 12:54:01,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1489842.0, ans=0.2 2023-06-23 12:54:16,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1489902.0, ans=0.1 2023-06-23 12:55:01,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1490022.0, ans=0.1 2023-06-23 12:55:04,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1490082.0, ans=0.2 2023-06-23 12:55:05,705 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.397e+02 5.616e+02 8.582e+02 1.448e+03 3.184e+03, threshold=1.716e+03, percent-clipped=15.0 2023-06-23 12:55:30,740 INFO [train.py:996] (3/4) Epoch 9, batch 4400, loss[loss=0.2179, simple_loss=0.3065, pruned_loss=0.06467, over 21652.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3159, pruned_loss=0.07927, over 4269955.66 frames. ], batch size: 247, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:55:40,367 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-23 12:56:26,451 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=22.5 2023-06-23 12:56:29,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1490322.0, ans=0.0 2023-06-23 12:57:12,837 INFO [train.py:996] (3/4) Epoch 9, batch 4450, loss[loss=0.3584, simple_loss=0.4732, pruned_loss=0.1218, over 19711.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.324, pruned_loss=0.0811, over 4268568.68 frames. ], batch size: 702, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:58:39,553 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.863e+02 6.461e+02 1.015e+03 1.659e+03 5.524e+03, threshold=2.029e+03, percent-clipped=20.0 2023-06-23 12:58:49,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1490682.0, ans=0.125 2023-06-23 12:58:49,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1490682.0, ans=0.125 2023-06-23 12:58:54,092 INFO [train.py:996] (3/4) Epoch 9, batch 4500, loss[loss=0.2538, simple_loss=0.3195, pruned_loss=0.09408, over 20805.00 frames. ], tot_loss[loss=0.247, simple_loss=0.326, pruned_loss=0.08398, over 4275231.12 frames. ], batch size: 611, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:59:24,286 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=12.0 2023-06-23 13:00:04,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.03 vs. limit=15.0 2023-06-23 13:00:42,571 INFO [train.py:996] (3/4) Epoch 9, batch 4550, loss[loss=0.2863, simple_loss=0.355, pruned_loss=0.1088, over 21359.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3274, pruned_loss=0.08409, over 4271956.08 frames. ], batch size: 548, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 13:02:03,736 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.689e+02 4.940e+02 6.297e+02 8.588e+02 1.962e+03, threshold=1.259e+03, percent-clipped=0.0 2023-06-23 13:02:27,866 INFO [train.py:996] (3/4) Epoch 9, batch 4600, loss[loss=0.2549, simple_loss=0.3231, pruned_loss=0.09329, over 21431.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3275, pruned_loss=0.08528, over 4276799.57 frames. ], batch size: 211, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:02:51,426 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=15.0 2023-06-23 13:03:35,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1491582.0, ans=0.0 2023-06-23 13:04:00,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1491642.0, ans=0.2 2023-06-23 13:04:01,916 INFO [train.py:996] (3/4) Epoch 9, batch 4650, loss[loss=0.1725, simple_loss=0.247, pruned_loss=0.04897, over 21776.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3208, pruned_loss=0.08309, over 4285528.83 frames. ], batch size: 282, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:04:33,030 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.57 vs. limit=15.0 2023-06-23 13:04:33,906 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:04:51,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1491762.0, ans=0.04949747468305833 2023-06-23 13:05:09,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1491822.0, ans=0.125 2023-06-23 13:05:16,741 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.483e+02 4.910e+02 6.100e+02 8.336e+02 1.525e+03, threshold=1.220e+03, percent-clipped=3.0 2023-06-23 13:05:17,985 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.73 vs. limit=22.5 2023-06-23 13:05:35,277 INFO [train.py:996] (3/4) Epoch 9, batch 4700, loss[loss=0.2188, simple_loss=0.2794, pruned_loss=0.07915, over 21456.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3107, pruned_loss=0.08063, over 4276040.07 frames. ], batch size: 473, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:06:02,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1492002.0, ans=0.0 2023-06-23 13:06:14,144 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-23 13:07:04,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1492182.0, ans=0.2 2023-06-23 13:07:09,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1492182.0, ans=0.125 2023-06-23 13:07:13,897 INFO [train.py:996] (3/4) Epoch 9, batch 4750, loss[loss=0.2111, simple_loss=0.2706, pruned_loss=0.07584, over 21739.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3059, pruned_loss=0.08132, over 4282682.54 frames. ], batch size: 283, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:07:25,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1492242.0, ans=0.125 2023-06-23 13:07:38,887 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-23 13:08:02,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1492362.0, ans=0.125 2023-06-23 13:08:02,860 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:08:09,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1492362.0, ans=0.125 2023-06-23 13:08:34,850 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.263e+02 4.594e+02 6.434e+02 8.978e+02 1.748e+03, threshold=1.287e+03, percent-clipped=12.0 2023-06-23 13:08:51,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1492482.0, ans=0.5 2023-06-23 13:08:54,107 INFO [train.py:996] (3/4) Epoch 9, batch 4800, loss[loss=0.2387, simple_loss=0.3233, pruned_loss=0.07705, over 21815.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3066, pruned_loss=0.08143, over 4286342.30 frames. ], batch size: 414, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:09:02,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1492542.0, ans=0.125 2023-06-23 13:09:34,239 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=22.5 2023-06-23 13:09:52,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1492722.0, ans=0.07 2023-06-23 13:09:59,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1492722.0, ans=0.125 2023-06-23 13:10:10,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1492782.0, ans=0.125 2023-06-23 13:10:21,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1492782.0, ans=0.125 2023-06-23 13:10:31,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1492842.0, ans=0.125 2023-06-23 13:10:32,157 INFO [train.py:996] (3/4) Epoch 9, batch 4850, loss[loss=0.2327, simple_loss=0.3031, pruned_loss=0.08118, over 21840.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3061, pruned_loss=0.08071, over 4293155.86 frames. ], batch size: 298, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:10:34,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1492842.0, ans=0.2 2023-06-23 13:10:42,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1492842.0, ans=0.0 2023-06-23 13:10:53,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1492902.0, ans=0.2 2023-06-23 13:11:28,107 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=22.5 2023-06-23 13:11:32,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1493022.0, ans=0.0 2023-06-23 13:11:32,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1493022.0, ans=0.2 2023-06-23 13:11:52,596 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.403e+02 5.446e+02 6.914e+02 1.031e+03 2.241e+03, threshold=1.383e+03, percent-clipped=12.0 2023-06-23 13:12:07,146 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-23 13:12:10,715 INFO [train.py:996] (3/4) Epoch 9, batch 4900, loss[loss=0.2494, simple_loss=0.3423, pruned_loss=0.07828, over 21663.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3089, pruned_loss=0.08176, over 4300165.98 frames. ], batch size: 389, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:13:06,241 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-23 13:13:30,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1493322.0, ans=0.125 2023-06-23 13:13:50,326 INFO [train.py:996] (3/4) Epoch 9, batch 4950, loss[loss=0.2085, simple_loss=0.3031, pruned_loss=0.05692, over 21605.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3136, pruned_loss=0.08103, over 4284699.33 frames. ], batch size: 230, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:14:00,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1493442.0, ans=0.125 2023-06-23 13:14:41,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1493562.0, ans=0.2 2023-06-23 13:14:41,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1493562.0, ans=0.0 2023-06-23 13:14:52,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1493622.0, ans=0.0 2023-06-23 13:14:59,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1493622.0, ans=0.125 2023-06-23 13:15:15,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1493682.0, ans=0.125 2023-06-23 13:15:16,660 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.074e+02 4.885e+02 7.048e+02 1.090e+03 2.586e+03, threshold=1.410e+03, percent-clipped=12.0 2023-06-23 13:15:29,175 INFO [train.py:996] (3/4) Epoch 9, batch 5000, loss[loss=0.235, simple_loss=0.3107, pruned_loss=0.07961, over 21808.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3127, pruned_loss=0.07762, over 4287557.74 frames. ], batch size: 298, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:15:38,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1493742.0, ans=0.125 2023-06-23 13:15:45,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1493802.0, ans=0.125 2023-06-23 13:15:51,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1493802.0, ans=0.125 2023-06-23 13:16:21,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1493862.0, ans=0.1 2023-06-23 13:16:34,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1493922.0, ans=0.125 2023-06-23 13:16:44,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1493922.0, ans=15.0 2023-06-23 13:16:47,790 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-23 13:17:00,618 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.36 vs. limit=12.0 2023-06-23 13:17:05,061 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-23 13:17:07,592 INFO [train.py:996] (3/4) Epoch 9, batch 5050, loss[loss=0.3059, simple_loss=0.3503, pruned_loss=0.1307, over 21780.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3124, pruned_loss=0.07966, over 4296513.68 frames. ], batch size: 508, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:17:31,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1494102.0, ans=0.0 2023-06-23 13:17:56,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1494162.0, ans=0.125 2023-06-23 13:18:04,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1494222.0, ans=0.125 2023-06-23 13:18:28,085 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.800e+02 5.012e+02 6.560e+02 1.020e+03 2.026e+03, threshold=1.312e+03, percent-clipped=12.0 2023-06-23 13:18:45,146 INFO [train.py:996] (3/4) Epoch 9, batch 5100, loss[loss=0.1905, simple_loss=0.2709, pruned_loss=0.05508, over 21324.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3095, pruned_loss=0.07933, over 4298763.94 frames. ], batch size: 176, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:18:47,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1494342.0, ans=0.1 2023-06-23 13:19:09,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1494402.0, ans=0.125 2023-06-23 13:19:43,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1494522.0, ans=0.0 2023-06-23 13:19:52,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1494522.0, ans=0.125 2023-06-23 13:20:25,858 INFO [train.py:996] (3/4) Epoch 9, batch 5150, loss[loss=0.287, simple_loss=0.3557, pruned_loss=0.1091, over 21705.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3085, pruned_loss=0.08022, over 4299430.07 frames. ], batch size: 441, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:20:26,322 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:20:28,786 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-23 13:20:41,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1494702.0, ans=0.2 2023-06-23 13:21:13,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1494762.0, ans=0.0 2023-06-23 13:21:13,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1494762.0, ans=0.2 2023-06-23 13:21:18,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1494762.0, ans=0.125 2023-06-23 13:21:18,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1494762.0, ans=0.125 2023-06-23 13:21:54,187 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.473e+02 5.169e+02 7.852e+02 1.261e+03 2.554e+03, threshold=1.570e+03, percent-clipped=23.0 2023-06-23 13:21:57,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1494882.0, ans=0.125 2023-06-23 13:22:07,025 INFO [train.py:996] (3/4) Epoch 9, batch 5200, loss[loss=0.2932, simple_loss=0.3895, pruned_loss=0.09848, over 21863.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3127, pruned_loss=0.08103, over 4294507.90 frames. ], batch size: 371, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:23:00,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1495062.0, ans=0.2 2023-06-23 13:23:45,408 INFO [train.py:996] (3/4) Epoch 9, batch 5250, loss[loss=0.232, simple_loss=0.301, pruned_loss=0.08156, over 21335.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3187, pruned_loss=0.08014, over 4290684.17 frames. ], batch size: 131, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:23:56,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1495242.0, ans=0.125 2023-06-23 13:24:16,310 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-23 13:24:19,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1495302.0, ans=0.125 2023-06-23 13:24:53,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1495422.0, ans=0.2 2023-06-23 13:25:10,467 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.544e+02 5.638e+02 7.794e+02 1.189e+03 2.542e+03, threshold=1.559e+03, percent-clipped=12.0 2023-06-23 13:25:11,517 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-23 13:25:27,943 INFO [train.py:996] (3/4) Epoch 9, batch 5300, loss[loss=0.2394, simple_loss=0.2993, pruned_loss=0.08978, over 21829.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3178, pruned_loss=0.08076, over 4294784.09 frames. ], batch size: 441, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:25:47,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1495542.0, ans=0.1 2023-06-23 13:26:12,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1495662.0, ans=0.125 2023-06-23 13:26:35,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1495722.0, ans=0.125 2023-06-23 13:26:49,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1495782.0, ans=0.125 2023-06-23 13:26:54,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1495782.0, ans=0.09899494936611666 2023-06-23 13:27:01,883 INFO [train.py:996] (3/4) Epoch 9, batch 5350, loss[loss=0.2866, simple_loss=0.3974, pruned_loss=0.08788, over 20722.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3168, pruned_loss=0.08148, over 4297374.72 frames. ], batch size: 607, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:27:40,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1495902.0, ans=10.0 2023-06-23 13:27:51,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1495962.0, ans=0.0 2023-06-23 13:28:17,840 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2023-06-23 13:28:28,206 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.718e+02 6.687e+02 9.729e+02 1.336e+03 3.211e+03, threshold=1.946e+03, percent-clipped=15.0 2023-06-23 13:28:36,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1496082.0, ans=0.125 2023-06-23 13:28:43,749 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.56 vs. limit=22.5 2023-06-23 13:28:46,028 INFO [train.py:996] (3/4) Epoch 9, batch 5400, loss[loss=0.2275, simple_loss=0.2883, pruned_loss=0.08333, over 21473.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3164, pruned_loss=0.08252, over 4294803.71 frames. ], batch size: 194, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:28:51,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1496142.0, ans=0.1 2023-06-23 13:29:10,879 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-23 13:29:21,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1496262.0, ans=0.125 2023-06-23 13:30:00,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1496322.0, ans=0.125 2023-06-23 13:30:03,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1496382.0, ans=0.125 2023-06-23 13:30:19,607 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=15.0 2023-06-23 13:30:30,290 INFO [train.py:996] (3/4) Epoch 9, batch 5450, loss[loss=0.2746, simple_loss=0.3653, pruned_loss=0.09194, over 21654.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3185, pruned_loss=0.08245, over 4297344.42 frames. ], batch size: 389, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:31:12,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1496562.0, ans=0.07 2023-06-23 13:31:27,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1496622.0, ans=0.1 2023-06-23 13:31:53,797 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.509e+02 5.097e+02 7.378e+02 1.209e+03 3.523e+03, threshold=1.476e+03, percent-clipped=4.0 2023-06-23 13:32:09,863 INFO [train.py:996] (3/4) Epoch 9, batch 5500, loss[loss=0.208, simple_loss=0.2976, pruned_loss=0.05926, over 21618.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3215, pruned_loss=0.07982, over 4285436.75 frames. ], batch size: 230, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:32:15,333 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:32:15,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1496742.0, ans=0.125 2023-06-23 13:32:18,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1496742.0, ans=0.125 2023-06-23 13:32:23,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1496742.0, ans=0.0 2023-06-23 13:32:31,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1496802.0, ans=0.1 2023-06-23 13:32:52,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1496862.0, ans=0.2 2023-06-23 13:33:19,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1496922.0, ans=0.0 2023-06-23 13:33:37,872 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-23 13:33:50,878 INFO [train.py:996] (3/4) Epoch 9, batch 5550, loss[loss=0.207, simple_loss=0.3051, pruned_loss=0.0544, over 21577.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3206, pruned_loss=0.07617, over 4287813.25 frames. ], batch size: 441, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:33:54,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1497042.0, ans=0.0 2023-06-23 13:34:30,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.24 vs. limit=15.0 2023-06-23 13:34:50,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1497222.0, ans=0.125 2023-06-23 13:35:17,410 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.428e+02 4.782e+02 7.322e+02 1.097e+03 2.363e+03, threshold=1.464e+03, percent-clipped=11.0 2023-06-23 13:35:38,019 INFO [train.py:996] (3/4) Epoch 9, batch 5600, loss[loss=0.2523, simple_loss=0.342, pruned_loss=0.08126, over 21611.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3168, pruned_loss=0.07335, over 4282181.16 frames. ], batch size: 263, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:36:11,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1497402.0, ans=0.125 2023-06-23 13:36:11,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1497402.0, ans=0.1 2023-06-23 13:36:55,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1497582.0, ans=10.0 2023-06-23 13:37:06,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1497582.0, ans=0.0 2023-06-23 13:37:10,928 INFO [train.py:996] (3/4) Epoch 9, batch 5650, loss[loss=0.2348, simple_loss=0.3137, pruned_loss=0.07797, over 21910.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3208, pruned_loss=0.07682, over 4290456.04 frames. ], batch size: 107, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:37:33,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1497702.0, ans=0.2 2023-06-23 13:38:34,961 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.803e+02 5.906e+02 7.888e+02 1.290e+03 2.997e+03, threshold=1.578e+03, percent-clipped=20.0 2023-06-23 13:38:46,230 INFO [train.py:996] (3/4) Epoch 9, batch 5700, loss[loss=0.2277, simple_loss=0.2944, pruned_loss=0.08047, over 20096.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3195, pruned_loss=0.07791, over 4284837.13 frames. ], batch size: 702, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:40:22,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1498182.0, ans=0.1 2023-06-23 13:40:31,245 INFO [train.py:996] (3/4) Epoch 9, batch 5750, loss[loss=0.2202, simple_loss=0.2892, pruned_loss=0.07558, over 21213.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3153, pruned_loss=0.07524, over 4286320.11 frames. ], batch size: 608, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:41:05,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1498302.0, ans=0.125 2023-06-23 13:41:52,305 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.170e+02 4.651e+02 7.542e+02 1.104e+03 3.145e+03, threshold=1.508e+03, percent-clipped=9.0 2023-06-23 13:42:06,745 INFO [train.py:996] (3/4) Epoch 9, batch 5800, loss[loss=0.2783, simple_loss=0.3826, pruned_loss=0.08706, over 20804.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3138, pruned_loss=0.0733, over 4283067.39 frames. ], batch size: 607, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:42:39,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1498602.0, ans=0.2 2023-06-23 13:43:52,444 INFO [train.py:996] (3/4) Epoch 9, batch 5850, loss[loss=0.1843, simple_loss=0.2889, pruned_loss=0.03983, over 21800.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.31, pruned_loss=0.06846, over 4270041.06 frames. ], batch size: 282, lr: 3.35e-03, grad_scale: 8.0 2023-06-23 13:44:49,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1499022.0, ans=0.125 2023-06-23 13:45:18,556 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.120e+02 4.169e+02 5.972e+02 8.890e+02 1.873e+03, threshold=1.194e+03, percent-clipped=6.0 2023-06-23 13:45:31,294 INFO [train.py:996] (3/4) Epoch 9, batch 5900, loss[loss=0.1954, simple_loss=0.2915, pruned_loss=0.04965, over 21677.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.3027, pruned_loss=0.06368, over 4265388.98 frames. ], batch size: 414, lr: 3.35e-03, grad_scale: 8.0 2023-06-23 13:46:40,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1499322.0, ans=0.125 2023-06-23 13:46:50,449 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.02 vs. limit=6.0 2023-06-23 13:47:01,920 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=22.5 2023-06-23 13:47:10,303 INFO [train.py:996] (3/4) Epoch 9, batch 5950, loss[loss=0.2166, simple_loss=0.2893, pruned_loss=0.0719, over 21842.00 frames. ], tot_loss[loss=0.217, simple_loss=0.3025, pruned_loss=0.06578, over 4267114.51 frames. ], batch size: 298, lr: 3.35e-03, grad_scale: 8.0 2023-06-23 13:47:14,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1499442.0, ans=0.125 2023-06-23 13:47:17,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1499442.0, ans=0.125 2023-06-23 13:48:01,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1499562.0, ans=0.2 2023-06-23 13:48:16,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1499622.0, ans=0.1 2023-06-23 13:48:19,936 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-23 13:48:22,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1499622.0, ans=0.125 2023-06-23 13:48:38,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1499682.0, ans=0.125 2023-06-23 13:48:40,987 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.364e+02 5.592e+02 7.986e+02 1.183e+03 2.385e+03, threshold=1.597e+03, percent-clipped=25.0 2023-06-23 13:48:48,896 INFO [train.py:996] (3/4) Epoch 9, batch 6000, loss[loss=0.2157, simple_loss=0.2753, pruned_loss=0.07799, over 21508.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2986, pruned_loss=0.06876, over 4261350.84 frames. ], batch size: 391, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:48:48,896 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 13:49:10,219 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2648, simple_loss=0.3557, pruned_loss=0.08691, over 1796401.00 frames. 2023-06-23 13:49:10,219 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-23 13:49:17,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1499742.0, ans=0.125 2023-06-23 13:49:52,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1499862.0, ans=0.125 2023-06-23 13:50:14,041 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:50:19,208 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-06-23 13:50:50,962 INFO [train.py:996] (3/4) Epoch 9, batch 6050, loss[loss=0.2178, simple_loss=0.2754, pruned_loss=0.08006, over 21206.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2943, pruned_loss=0.07098, over 4257552.64 frames. ], batch size: 159, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:50:56,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1500042.0, ans=0.0 2023-06-23 13:51:05,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1500102.0, ans=0.2 2023-06-23 13:51:13,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1500102.0, ans=0.125 2023-06-23 13:52:04,191 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=22.5 2023-06-23 13:52:15,742 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.157e+02 5.140e+02 6.887e+02 9.775e+02 3.553e+03, threshold=1.377e+03, percent-clipped=5.0 2023-06-23 13:52:28,819 INFO [train.py:996] (3/4) Epoch 9, batch 6100, loss[loss=0.2353, simple_loss=0.3004, pruned_loss=0.08506, over 21324.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2926, pruned_loss=0.06922, over 4262628.99 frames. ], batch size: 159, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:52:58,353 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.07 vs. limit=12.0 2023-06-23 13:53:23,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1500462.0, ans=0.0 2023-06-23 13:54:00,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1500642.0, ans=0.0 2023-06-23 13:54:01,334 INFO [train.py:996] (3/4) Epoch 9, batch 6150, loss[loss=0.2093, simple_loss=0.2894, pruned_loss=0.06461, over 21415.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2957, pruned_loss=0.07221, over 4263997.09 frames. ], batch size: 212, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:54:11,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1500642.0, ans=0.125 2023-06-23 13:54:15,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=12.0 2023-06-23 13:54:23,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1500702.0, ans=0.0 2023-06-23 13:54:55,237 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.04 vs. limit=6.0 2023-06-23 13:55:13,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1500822.0, ans=0.05 2023-06-23 13:55:14,579 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-23 13:55:32,766 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.545e+02 5.300e+02 7.269e+02 1.178e+03 2.947e+03, threshold=1.454e+03, percent-clipped=13.0 2023-06-23 13:55:46,197 INFO [train.py:996] (3/4) Epoch 9, batch 6200, loss[loss=0.2357, simple_loss=0.3008, pruned_loss=0.08531, over 21392.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3003, pruned_loss=0.07434, over 4270450.80 frames. ], batch size: 144, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:55:46,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1500942.0, ans=0.2 2023-06-23 13:55:55,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1500942.0, ans=0.125 2023-06-23 13:56:16,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1501002.0, ans=0.125 2023-06-23 13:57:04,926 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=22.5 2023-06-23 13:57:05,121 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.20 vs. limit=15.0 2023-06-23 13:57:08,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1501182.0, ans=0.0 2023-06-23 13:57:25,608 INFO [train.py:996] (3/4) Epoch 9, batch 6250, loss[loss=0.168, simple_loss=0.2357, pruned_loss=0.05012, over 17026.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3082, pruned_loss=0.07498, over 4271324.86 frames. ], batch size: 66, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:57:30,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1501242.0, ans=0.125 2023-06-23 13:57:35,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1501242.0, ans=0.125 2023-06-23 13:57:40,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1501302.0, ans=0.125 2023-06-23 13:57:46,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1501302.0, ans=0.125 2023-06-23 13:58:27,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1501422.0, ans=0.0 2023-06-23 13:58:50,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1501482.0, ans=0.125 2023-06-23 13:58:56,444 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.492e+02 5.754e+02 9.579e+02 1.636e+03 2.645e+03, threshold=1.916e+03, percent-clipped=27.0 2023-06-23 13:59:04,187 INFO [train.py:996] (3/4) Epoch 9, batch 6300, loss[loss=0.2693, simple_loss=0.3344, pruned_loss=0.1021, over 21721.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3119, pruned_loss=0.07431, over 4275117.92 frames. ], batch size: 507, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:59:09,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1501542.0, ans=0.125 2023-06-23 14:00:06,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1501722.0, ans=0.125 2023-06-23 14:00:07,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1501722.0, ans=0.0 2023-06-23 14:00:49,554 INFO [train.py:996] (3/4) Epoch 9, batch 6350, loss[loss=0.24, simple_loss=0.3083, pruned_loss=0.08589, over 21365.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3153, pruned_loss=0.07858, over 4280078.83 frames. ], batch size: 548, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:01:33,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1501962.0, ans=0.0 2023-06-23 14:01:56,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1502022.0, ans=0.0 2023-06-23 14:02:05,549 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=15.0 2023-06-23 14:02:17,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1502082.0, ans=0.0 2023-06-23 14:02:23,994 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.408e+02 6.300e+02 8.863e+02 1.224e+03 2.908e+03, threshold=1.773e+03, percent-clipped=5.0 2023-06-23 14:02:32,219 INFO [train.py:996] (3/4) Epoch 9, batch 6400, loss[loss=0.277, simple_loss=0.3491, pruned_loss=0.1024, over 21779.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3195, pruned_loss=0.08153, over 4281305.78 frames. ], batch size: 441, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:03:07,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1502202.0, ans=0.07 2023-06-23 14:03:13,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1502262.0, ans=0.0 2023-06-23 14:03:58,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1502382.0, ans=0.125 2023-06-23 14:04:10,688 INFO [train.py:996] (3/4) Epoch 9, batch 6450, loss[loss=0.2098, simple_loss=0.282, pruned_loss=0.06876, over 21436.00 frames. ], tot_loss[loss=0.24, simple_loss=0.32, pruned_loss=0.08005, over 4284313.93 frames. ], batch size: 131, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:04:15,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1502442.0, ans=0.125 2023-06-23 14:05:42,797 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.914e+02 5.544e+02 7.373e+02 1.174e+03 2.232e+03, threshold=1.475e+03, percent-clipped=4.0 2023-06-23 14:05:51,128 INFO [train.py:996] (3/4) Epoch 9, batch 6500, loss[loss=0.1976, simple_loss=0.3, pruned_loss=0.04763, over 21781.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3138, pruned_loss=0.07875, over 4279717.61 frames. ], batch size: 351, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:07:06,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1502922.0, ans=0.035 2023-06-23 14:07:15,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1502982.0, ans=0.0 2023-06-23 14:07:35,331 INFO [train.py:996] (3/4) Epoch 9, batch 6550, loss[loss=0.2054, simple_loss=0.2821, pruned_loss=0.0643, over 21602.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.311, pruned_loss=0.07733, over 4270284.83 frames. ], batch size: 230, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:08:03,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1503102.0, ans=0.125 2023-06-23 14:08:44,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1503222.0, ans=0.0 2023-06-23 14:09:02,144 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.667e+02 5.627e+02 7.547e+02 1.040e+03 2.189e+03, threshold=1.509e+03, percent-clipped=8.0 2023-06-23 14:09:15,124 INFO [train.py:996] (3/4) Epoch 9, batch 6600, loss[loss=0.217, simple_loss=0.2837, pruned_loss=0.07517, over 21623.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.306, pruned_loss=0.07741, over 4269701.47 frames. ], batch size: 298, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:09:17,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1503342.0, ans=0.0 2023-06-23 14:09:23,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1503342.0, ans=0.035 2023-06-23 14:09:28,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1503342.0, ans=0.95 2023-06-23 14:09:42,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1503402.0, ans=0.1 2023-06-23 14:09:45,233 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-06-23 14:09:53,977 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.41 vs. limit=22.5 2023-06-23 14:10:55,767 INFO [train.py:996] (3/4) Epoch 9, batch 6650, loss[loss=0.2098, simple_loss=0.2749, pruned_loss=0.07234, over 21558.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3006, pruned_loss=0.07497, over 4267459.34 frames. ], batch size: 391, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:10:58,233 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=15.0 2023-06-23 14:11:04,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1503642.0, ans=0.1 2023-06-23 14:11:44,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1503762.0, ans=0.0 2023-06-23 14:11:49,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1503762.0, ans=0.125 2023-06-23 14:12:15,935 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.29 vs. limit=15.0 2023-06-23 14:12:29,794 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.782e+02 5.915e+02 8.562e+02 1.227e+03 3.234e+03, threshold=1.712e+03, percent-clipped=18.0 2023-06-23 14:12:35,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1503942.0, ans=0.0 2023-06-23 14:12:36,189 INFO [train.py:996] (3/4) Epoch 9, batch 6700, loss[loss=0.2129, simple_loss=0.2715, pruned_loss=0.07713, over 21749.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2946, pruned_loss=0.07452, over 4265290.52 frames. ], batch size: 112, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:13:36,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1504122.0, ans=0.125 2023-06-23 14:13:43,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1504122.0, ans=0.2 2023-06-23 14:14:14,334 INFO [train.py:996] (3/4) Epoch 9, batch 6750, loss[loss=0.2219, simple_loss=0.2985, pruned_loss=0.07266, over 21917.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2953, pruned_loss=0.07559, over 4267132.41 frames. ], batch size: 124, lr: 3.34e-03, grad_scale: 8.0 2023-06-23 14:15:06,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1504362.0, ans=0.125 2023-06-23 14:15:25,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1504422.0, ans=0.125 2023-06-23 14:15:26,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1504422.0, ans=10.0 2023-06-23 14:15:48,504 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.896e+02 6.568e+02 9.733e+02 1.340e+03 2.605e+03, threshold=1.947e+03, percent-clipped=12.0 2023-06-23 14:15:53,556 INFO [train.py:996] (3/4) Epoch 9, batch 6800, loss[loss=0.2399, simple_loss=0.3054, pruned_loss=0.08722, over 15563.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2978, pruned_loss=0.07774, over 4261509.97 frames. ], batch size: 64, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:16:32,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1504602.0, ans=0.04949747468305833 2023-06-23 14:16:47,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1504662.0, ans=0.125 2023-06-23 14:16:50,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1504722.0, ans=0.0 2023-06-23 14:17:19,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1504782.0, ans=0.125 2023-06-23 14:17:29,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1504782.0, ans=0.1 2023-06-23 14:17:32,353 INFO [train.py:996] (3/4) Epoch 9, batch 6850, loss[loss=0.2533, simple_loss=0.3056, pruned_loss=0.1005, over 21259.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2964, pruned_loss=0.07912, over 4255277.80 frames. ], batch size: 159, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:17:34,766 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.53 vs. limit=15.0 2023-06-23 14:18:52,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1505082.0, ans=0.2 2023-06-23 14:19:07,207 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.795e+02 4.770e+02 6.261e+02 9.211e+02 1.923e+03, threshold=1.252e+03, percent-clipped=0.0 2023-06-23 14:19:12,168 INFO [train.py:996] (3/4) Epoch 9, batch 6900, loss[loss=0.2139, simple_loss=0.3065, pruned_loss=0.06063, over 21742.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2968, pruned_loss=0.0791, over 4265399.58 frames. ], batch size: 414, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:19:33,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1505202.0, ans=0.125 2023-06-23 14:20:20,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1505322.0, ans=0.0 2023-06-23 14:20:35,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1505382.0, ans=0.125 2023-06-23 14:20:46,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1505382.0, ans=0.025 2023-06-23 14:20:47,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1505382.0, ans=0.125 2023-06-23 14:20:51,924 INFO [train.py:996] (3/4) Epoch 9, batch 6950, loss[loss=0.2304, simple_loss=0.3049, pruned_loss=0.07797, over 21488.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2986, pruned_loss=0.07702, over 4266405.68 frames. ], batch size: 194, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:21:24,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1505502.0, ans=0.2 2023-06-23 14:21:30,119 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=12.0 2023-06-23 14:22:03,415 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.84 vs. limit=22.5 2023-06-23 14:22:16,033 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-06-23 14:22:26,500 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.060e+02 5.568e+02 8.095e+02 1.122e+03 2.896e+03, threshold=1.619e+03, percent-clipped=20.0 2023-06-23 14:22:27,789 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.42 vs. limit=10.0 2023-06-23 14:22:31,427 INFO [train.py:996] (3/4) Epoch 9, batch 7000, loss[loss=0.2492, simple_loss=0.3113, pruned_loss=0.0935, over 21475.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3009, pruned_loss=0.07979, over 4265675.15 frames. ], batch size: 389, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:23:13,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1505862.0, ans=0.125 2023-06-23 14:23:24,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1505862.0, ans=0.125 2023-06-23 14:23:34,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1505922.0, ans=0.0 2023-06-23 14:24:16,425 INFO [train.py:996] (3/4) Epoch 9, batch 7050, loss[loss=0.2167, simple_loss=0.3061, pruned_loss=0.06372, over 21606.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3, pruned_loss=0.07813, over 4261861.47 frames. ], batch size: 414, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:24:58,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1506162.0, ans=0.0 2023-06-23 14:25:13,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1506222.0, ans=0.0 2023-06-23 14:25:50,179 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.514e+02 5.176e+02 7.948e+02 1.176e+03 2.286e+03, threshold=1.590e+03, percent-clipped=9.0 2023-06-23 14:25:51,135 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-23 14:25:55,054 INFO [train.py:996] (3/4) Epoch 9, batch 7100, loss[loss=0.1851, simple_loss=0.2598, pruned_loss=0.05515, over 21351.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3046, pruned_loss=0.07941, over 4264589.26 frames. ], batch size: 194, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:25:55,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=15.0 2023-06-23 14:26:37,932 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.72 vs. limit=15.0 2023-06-23 14:26:45,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1506462.0, ans=0.125 2023-06-23 14:27:03,907 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2023-06-23 14:27:14,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1506522.0, ans=0.125 2023-06-23 14:27:31,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1506582.0, ans=0.04949747468305833 2023-06-23 14:27:35,237 INFO [train.py:996] (3/4) Epoch 9, batch 7150, loss[loss=0.2351, simple_loss=0.3114, pruned_loss=0.07941, over 21769.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3029, pruned_loss=0.07645, over 4256298.40 frames. ], batch size: 298, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:29:10,903 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.602e+02 5.826e+02 7.818e+02 1.087e+03 2.405e+03, threshold=1.564e+03, percent-clipped=10.0 2023-06-23 14:29:20,685 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=12.0 2023-06-23 14:29:21,004 INFO [train.py:996] (3/4) Epoch 9, batch 7200, loss[loss=0.2031, simple_loss=0.2662, pruned_loss=0.07003, over 21227.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3054, pruned_loss=0.07842, over 4257059.11 frames. ], batch size: 549, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:29:53,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1507002.0, ans=0.015 2023-06-23 14:30:00,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1507002.0, ans=0.0 2023-06-23 14:30:36,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1507122.0, ans=0.2 2023-06-23 14:31:00,739 INFO [train.py:996] (3/4) Epoch 9, batch 7250, loss[loss=0.2591, simple_loss=0.3072, pruned_loss=0.1055, over 21393.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3004, pruned_loss=0.079, over 4263218.50 frames. ], batch size: 475, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:31:10,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1507242.0, ans=0.2 2023-06-23 14:31:10,951 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-23 14:31:13,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1507242.0, ans=0.0 2023-06-23 14:31:20,273 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:32:09,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1507422.0, ans=0.125 2023-06-23 14:32:19,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1507482.0, ans=0.2 2023-06-23 14:32:23,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1507482.0, ans=0.1 2023-06-23 14:32:37,013 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.651e+02 4.870e+02 5.610e+02 7.177e+02 1.494e+03, threshold=1.122e+03, percent-clipped=0.0 2023-06-23 14:32:44,951 INFO [train.py:996] (3/4) Epoch 9, batch 7300, loss[loss=0.2079, simple_loss=0.2764, pruned_loss=0.06971, over 15782.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2958, pruned_loss=0.07936, over 4255073.34 frames. ], batch size: 60, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:32:52,148 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-06-23 14:34:00,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1507782.0, ans=0.2 2023-06-23 14:34:19,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1507782.0, ans=0.0 2023-06-23 14:34:25,783 INFO [train.py:996] (3/4) Epoch 9, batch 7350, loss[loss=0.3115, simple_loss=0.3631, pruned_loss=0.1299, over 21413.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2949, pruned_loss=0.07996, over 4254797.28 frames. ], batch size: 471, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:34:57,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1507902.0, ans=0.125 2023-06-23 14:36:02,845 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.713e+02 6.185e+02 8.335e+02 1.224e+03 2.285e+03, threshold=1.667e+03, percent-clipped=37.0 2023-06-23 14:36:03,886 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-23 14:36:06,157 INFO [train.py:996] (3/4) Epoch 9, batch 7400, loss[loss=0.1953, simple_loss=0.282, pruned_loss=0.05429, over 20784.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3008, pruned_loss=0.08126, over 4251926.48 frames. ], batch size: 607, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:36:34,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1508202.0, ans=0.125 2023-06-23 14:37:03,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1508322.0, ans=0.125 2023-06-23 14:37:09,126 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-23 14:37:42,939 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:37:47,569 INFO [train.py:996] (3/4) Epoch 9, batch 7450, loss[loss=0.2306, simple_loss=0.2823, pruned_loss=0.08944, over 21288.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.2993, pruned_loss=0.07991, over 4256075.67 frames. ], batch size: 159, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:38:01,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1508442.0, ans=0.02 2023-06-23 14:38:36,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1508562.0, ans=0.0 2023-06-23 14:39:26,564 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.876e+02 5.448e+02 8.462e+02 1.438e+03 2.608e+03, threshold=1.692e+03, percent-clipped=12.0 2023-06-23 14:39:34,019 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:39:35,270 INFO [train.py:996] (3/4) Epoch 9, batch 7500, loss[loss=0.2402, simple_loss=0.3286, pruned_loss=0.07584, over 21366.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3074, pruned_loss=0.08089, over 4254631.74 frames. ], batch size: 176, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:40:38,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1508922.0, ans=10.0 2023-06-23 14:40:39,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1508922.0, ans=0.125 2023-06-23 14:41:16,331 INFO [train.py:996] (3/4) Epoch 9, batch 7550, loss[loss=0.2171, simple_loss=0.3137, pruned_loss=0.06023, over 21446.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3141, pruned_loss=0.08013, over 4260602.95 frames. ], batch size: 211, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:41:17,362 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.75 vs. limit=12.0 2023-06-23 14:42:30,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1509222.0, ans=0.125 2023-06-23 14:42:52,862 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.575e+02 5.410e+02 7.103e+02 1.048e+03 2.085e+03, threshold=1.421e+03, percent-clipped=3.0 2023-06-23 14:42:56,181 INFO [train.py:996] (3/4) Epoch 9, batch 7600, loss[loss=0.2223, simple_loss=0.3062, pruned_loss=0.06922, over 21646.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3116, pruned_loss=0.07917, over 4272293.76 frames. ], batch size: 263, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:43:01,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1509342.0, ans=0.0 2023-06-23 14:43:06,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1509342.0, ans=0.125 2023-06-23 14:43:47,335 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.53 vs. limit=12.0 2023-06-23 14:44:37,240 INFO [train.py:996] (3/4) Epoch 9, batch 7650, loss[loss=0.2387, simple_loss=0.3063, pruned_loss=0.08549, over 21365.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3098, pruned_loss=0.0808, over 4278589.73 frames. ], batch size: 159, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:44:40,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1509642.0, ans=0.125 2023-06-23 14:44:42,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=1509642.0, ans=0.2 2023-06-23 14:46:15,407 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.771e+02 5.554e+02 6.899e+02 1.039e+03 2.407e+03, threshold=1.380e+03, percent-clipped=12.0 2023-06-23 14:46:18,613 INFO [train.py:996] (3/4) Epoch 9, batch 7700, loss[loss=0.3187, simple_loss=0.3698, pruned_loss=0.1338, over 21492.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3132, pruned_loss=0.08391, over 4275148.42 frames. ], batch size: 510, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:47:17,189 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-23 14:47:30,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1510122.0, ans=0.5 2023-06-23 14:47:31,243 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.65 vs. limit=5.0 2023-06-23 14:48:05,169 INFO [train.py:996] (3/4) Epoch 9, batch 7750, loss[loss=0.2845, simple_loss=0.3705, pruned_loss=0.09922, over 21776.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3192, pruned_loss=0.08384, over 4274351.75 frames. ], batch size: 282, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:48:22,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1510302.0, ans=0.125 2023-06-23 14:48:54,090 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=22.5 2023-06-23 14:49:18,488 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.45 vs. limit=15.0 2023-06-23 14:49:44,533 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.753e+02 6.023e+02 8.792e+02 1.462e+03 2.647e+03, threshold=1.758e+03, percent-clipped=26.0 2023-06-23 14:49:46,170 INFO [train.py:996] (3/4) Epoch 9, batch 7800, loss[loss=0.2197, simple_loss=0.3011, pruned_loss=0.06912, over 21812.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3204, pruned_loss=0.08441, over 4274811.83 frames. ], batch size: 333, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:50:08,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1510602.0, ans=0.09899494936611666 2023-06-23 14:50:48,358 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=22.5 2023-06-23 14:50:57,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1510722.0, ans=0.125 2023-06-23 14:50:58,018 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=12.0 2023-06-23 14:51:25,366 INFO [train.py:996] (3/4) Epoch 9, batch 7850, loss[loss=0.1941, simple_loss=0.2636, pruned_loss=0.06232, over 21768.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3124, pruned_loss=0.08307, over 4267900.93 frames. ], batch size: 317, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:51:28,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1510842.0, ans=0.125 2023-06-23 14:52:09,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1510962.0, ans=0.125 2023-06-23 14:53:05,535 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.699e+02 5.401e+02 8.491e+02 1.335e+03 3.211e+03, threshold=1.698e+03, percent-clipped=14.0 2023-06-23 14:53:07,029 INFO [train.py:996] (3/4) Epoch 9, batch 7900, loss[loss=0.2255, simple_loss=0.3146, pruned_loss=0.06821, over 21611.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3087, pruned_loss=0.08196, over 4270506.93 frames. ], batch size: 263, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:53:20,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1511142.0, ans=0.2 2023-06-23 14:54:01,421 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.44 vs. limit=10.0 2023-06-23 14:54:23,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1511322.0, ans=0.1 2023-06-23 14:54:48,978 INFO [train.py:996] (3/4) Epoch 9, batch 7950, loss[loss=0.2257, simple_loss=0.3037, pruned_loss=0.07379, over 21146.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3133, pruned_loss=0.08187, over 4265911.68 frames. ], batch size: 143, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:55:14,898 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=15.0 2023-06-23 14:55:55,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1511562.0, ans=0.0 2023-06-23 14:55:57,486 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-06-23 14:55:58,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1511622.0, ans=0.125 2023-06-23 14:56:03,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1511622.0, ans=0.125 2023-06-23 14:56:13,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1511682.0, ans=0.0 2023-06-23 14:56:28,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1511682.0, ans=0.1 2023-06-23 14:56:40,201 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.215e+02 6.156e+02 9.056e+02 1.636e+03 2.892e+03, threshold=1.811e+03, percent-clipped=22.0 2023-06-23 14:56:41,918 INFO [train.py:996] (3/4) Epoch 9, batch 8000, loss[loss=0.2576, simple_loss=0.349, pruned_loss=0.08308, over 21648.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.318, pruned_loss=0.08389, over 4264472.54 frames. ], batch size: 389, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:58:32,839 INFO [train.py:996] (3/4) Epoch 9, batch 8050, loss[loss=0.2689, simple_loss=0.3586, pruned_loss=0.08962, over 21655.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3188, pruned_loss=0.08303, over 4262720.83 frames. ], batch size: 389, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:58:33,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1512042.0, ans=0.125 2023-06-23 14:59:40,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1512222.0, ans=0.125 2023-06-23 15:00:10,911 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.139e+02 6.350e+02 8.777e+02 1.241e+03 2.449e+03, threshold=1.755e+03, percent-clipped=9.0 2023-06-23 15:00:12,599 INFO [train.py:996] (3/4) Epoch 9, batch 8100, loss[loss=0.1865, simple_loss=0.2399, pruned_loss=0.06657, over 20773.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3162, pruned_loss=0.08337, over 4262978.17 frames. ], batch size: 609, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:00:31,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1512342.0, ans=0.125 2023-06-23 15:01:41,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1512582.0, ans=0.2 2023-06-23 15:02:01,800 INFO [train.py:996] (3/4) Epoch 9, batch 8150, loss[loss=0.2329, simple_loss=0.3346, pruned_loss=0.06563, over 20833.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3238, pruned_loss=0.08474, over 4267186.11 frames. ], batch size: 609, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:02:34,176 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:02:46,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1512762.0, ans=0.05 2023-06-23 15:02:48,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1512762.0, ans=0.2 2023-06-23 15:03:34,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1512882.0, ans=0.1 2023-06-23 15:03:40,780 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.771e+02 6.551e+02 1.054e+03 1.725e+03 4.751e+03, threshold=2.109e+03, percent-clipped=24.0 2023-06-23 15:03:40,801 INFO [train.py:996] (3/4) Epoch 9, batch 8200, loss[loss=0.1918, simple_loss=0.2645, pruned_loss=0.05955, over 20820.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3156, pruned_loss=0.08228, over 4268333.57 frames. ], batch size: 609, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:05:22,543 INFO [train.py:996] (3/4) Epoch 9, batch 8250, loss[loss=0.2825, simple_loss=0.4007, pruned_loss=0.08212, over 20767.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3158, pruned_loss=0.08204, over 4267014.94 frames. ], batch size: 607, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:05:31,507 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:05:41,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1513242.0, ans=0.0 2023-06-23 15:05:58,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1513302.0, ans=0.2 2023-06-23 15:06:09,564 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.49 vs. limit=10.0 2023-06-23 15:06:27,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1513422.0, ans=0.125 2023-06-23 15:06:29,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1513422.0, ans=0.0 2023-06-23 15:06:51,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1513482.0, ans=0.035 2023-06-23 15:07:04,484 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.470e+02 6.606e+02 8.935e+02 1.467e+03 2.616e+03, threshold=1.787e+03, percent-clipped=8.0 2023-06-23 15:07:04,505 INFO [train.py:996] (3/4) Epoch 9, batch 8300, loss[loss=0.1881, simple_loss=0.2634, pruned_loss=0.05636, over 21314.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3132, pruned_loss=0.07971, over 4266901.61 frames. ], batch size: 131, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:07:15,053 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-23 15:07:41,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1513602.0, ans=0.125 2023-06-23 15:08:31,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1513782.0, ans=0.1 2023-06-23 15:08:43,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1513842.0, ans=0.125 2023-06-23 15:08:49,802 INFO [train.py:996] (3/4) Epoch 9, batch 8350, loss[loss=0.2378, simple_loss=0.3174, pruned_loss=0.07914, over 21482.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3115, pruned_loss=0.07804, over 4269093.19 frames. ], batch size: 389, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:09:36,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1513962.0, ans=0.125 2023-06-23 15:09:57,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1514022.0, ans=0.125 2023-06-23 15:10:30,567 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.376e+02 4.465e+02 5.586e+02 8.616e+02 2.675e+03, threshold=1.117e+03, percent-clipped=3.0 2023-06-23 15:10:30,587 INFO [train.py:996] (3/4) Epoch 9, batch 8400, loss[loss=0.2154, simple_loss=0.2965, pruned_loss=0.06709, over 21498.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3085, pruned_loss=0.07526, over 4274475.64 frames. ], batch size: 212, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:10:47,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1514142.0, ans=0.0 2023-06-23 15:11:07,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1514262.0, ans=0.0 2023-06-23 15:11:39,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1514322.0, ans=0.0 2023-06-23 15:11:46,053 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.76 vs. limit=12.0 2023-06-23 15:12:09,832 INFO [train.py:996] (3/4) Epoch 9, batch 8450, loss[loss=0.2423, simple_loss=0.312, pruned_loss=0.08625, over 21862.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3073, pruned_loss=0.07474, over 4282237.22 frames. ], batch size: 371, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:12:11,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1514442.0, ans=0.0 2023-06-23 15:12:14,367 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:12:33,805 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-23 15:12:44,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1514562.0, ans=0.125 2023-06-23 15:13:48,662 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=15.0 2023-06-23 15:13:49,143 INFO [train.py:996] (3/4) Epoch 9, batch 8500, loss[loss=0.2159, simple_loss=0.2859, pruned_loss=0.07292, over 21717.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3052, pruned_loss=0.07611, over 4282741.24 frames. ], batch size: 112, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:13:50,607 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.348e+02 5.860e+02 7.972e+02 1.284e+03 3.475e+03, threshold=1.594e+03, percent-clipped=30.0 2023-06-23 15:13:52,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1514742.0, ans=0.125 2023-06-23 15:14:29,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1514862.0, ans=0.1 2023-06-23 15:15:29,028 INFO [train.py:996] (3/4) Epoch 9, batch 8550, loss[loss=0.2879, simple_loss=0.3868, pruned_loss=0.09449, over 21254.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3104, pruned_loss=0.07948, over 4278706.33 frames. ], batch size: 548, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:15:29,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1515042.0, ans=0.2 2023-06-23 15:17:16,066 INFO [train.py:996] (3/4) Epoch 9, batch 8600, loss[loss=0.2177, simple_loss=0.3437, pruned_loss=0.04587, over 19813.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3162, pruned_loss=0.0808, over 4276359.23 frames. ], batch size: 702, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:17:17,696 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.589e+02 6.156e+02 8.850e+02 1.190e+03 2.823e+03, threshold=1.770e+03, percent-clipped=15.0 2023-06-23 15:17:39,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1515402.0, ans=0.2 2023-06-23 15:17:47,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1515402.0, ans=0.05 2023-06-23 15:18:14,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1515522.0, ans=0.04949747468305833 2023-06-23 15:18:58,337 INFO [train.py:996] (3/4) Epoch 9, batch 8650, loss[loss=0.2581, simple_loss=0.3502, pruned_loss=0.08304, over 21436.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3213, pruned_loss=0.08105, over 4275318.67 frames. ], batch size: 507, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:18:59,406 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.06 vs. limit=22.5 2023-06-23 15:19:13,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1515702.0, ans=0.2 2023-06-23 15:19:33,005 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-23 15:20:34,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1515882.0, ans=0.125 2023-06-23 15:20:37,550 INFO [train.py:996] (3/4) Epoch 9, batch 8700, loss[loss=0.2073, simple_loss=0.2728, pruned_loss=0.07092, over 21826.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3124, pruned_loss=0.07841, over 4264894.76 frames. ], batch size: 112, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:20:39,029 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.609e+02 5.219e+02 7.580e+02 1.289e+03 2.063e+03, threshold=1.516e+03, percent-clipped=5.0 2023-06-23 15:21:30,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1516062.0, ans=0.1 2023-06-23 15:22:16,386 INFO [train.py:996] (3/4) Epoch 9, batch 8750, loss[loss=0.2298, simple_loss=0.295, pruned_loss=0.08233, over 20120.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3082, pruned_loss=0.07947, over 4265587.06 frames. ], batch size: 703, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:22:18,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1516242.0, ans=0.0 2023-06-23 15:23:16,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1516422.0, ans=0.2 2023-06-23 15:23:59,297 INFO [train.py:996] (3/4) Epoch 9, batch 8800, loss[loss=0.2615, simple_loss=0.3395, pruned_loss=0.09179, over 21681.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3184, pruned_loss=0.08279, over 4270612.28 frames. ], batch size: 351, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:24:00,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.692e+02 5.630e+02 7.362e+02 1.054e+03 2.858e+03, threshold=1.472e+03, percent-clipped=8.0 2023-06-23 15:24:01,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1516542.0, ans=0.05 2023-06-23 15:25:43,309 INFO [train.py:996] (3/4) Epoch 9, batch 8850, loss[loss=0.267, simple_loss=0.3577, pruned_loss=0.08817, over 20956.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3254, pruned_loss=0.08508, over 4278419.46 frames. ], batch size: 607, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:25:47,188 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=22.5 2023-06-23 15:26:12,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1516902.0, ans=0.125 2023-06-23 15:26:20,759 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-23 15:27:09,770 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=22.5 2023-06-23 15:27:23,368 INFO [train.py:996] (3/4) Epoch 9, batch 8900, loss[loss=0.2187, simple_loss=0.2871, pruned_loss=0.07517, over 21417.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3196, pruned_loss=0.08366, over 4279306.10 frames. ], batch size: 194, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:27:30,224 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.986e+02 5.861e+02 8.789e+02 1.394e+03 2.613e+03, threshold=1.758e+03, percent-clipped=19.0 2023-06-23 15:27:42,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1517142.0, ans=0.125 2023-06-23 15:27:42,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1517142.0, ans=0.125 2023-06-23 15:28:37,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1517322.0, ans=0.09899494936611666 2023-06-23 15:28:56,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=12.0 2023-06-23 15:29:06,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1517382.0, ans=0.125 2023-06-23 15:29:10,570 INFO [train.py:996] (3/4) Epoch 9, batch 8950, loss[loss=0.2952, simple_loss=0.3776, pruned_loss=0.1064, over 21624.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3214, pruned_loss=0.08302, over 4282329.74 frames. ], batch size: 441, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:30:17,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1517622.0, ans=0.1 2023-06-23 15:30:49,403 INFO [train.py:996] (3/4) Epoch 9, batch 9000, loss[loss=0.2311, simple_loss=0.2926, pruned_loss=0.08478, over 21722.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3148, pruned_loss=0.08265, over 4283770.65 frames. ], batch size: 300, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:30:49,404 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 15:31:06,584 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.258, simple_loss=0.3541, pruned_loss=0.08091, over 1796401.00 frames. 2023-06-23 15:31:06,585 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-23 15:31:08,192 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.792e+02 6.929e+02 1.126e+03 1.882e+03 3.988e+03, threshold=2.252e+03, percent-clipped=24.0 2023-06-23 15:31:25,565 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.19 vs. limit=12.0 2023-06-23 15:31:41,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1517802.0, ans=0.125 2023-06-23 15:31:44,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1517802.0, ans=0.125 2023-06-23 15:32:26,357 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=22.5 2023-06-23 15:32:53,762 INFO [train.py:996] (3/4) Epoch 9, batch 9050, loss[loss=0.1982, simple_loss=0.2827, pruned_loss=0.05682, over 21667.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.312, pruned_loss=0.0791, over 4276788.91 frames. ], batch size: 298, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:34:10,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1518222.0, ans=0.0 2023-06-23 15:34:10,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1518222.0, ans=0.125 2023-06-23 15:34:39,757 INFO [train.py:996] (3/4) Epoch 9, batch 9100, loss[loss=0.2482, simple_loss=0.3447, pruned_loss=0.07586, over 21603.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3184, pruned_loss=0.08217, over 4279613.74 frames. ], batch size: 414, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:34:42,875 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.536e+02 5.248e+02 7.167e+02 1.150e+03 2.223e+03, threshold=1.433e+03, percent-clipped=0.0 2023-06-23 15:34:48,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1518342.0, ans=0.0 2023-06-23 15:36:14,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1518582.0, ans=0.05 2023-06-23 15:36:18,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1518582.0, ans=0.125 2023-06-23 15:36:21,747 INFO [train.py:996] (3/4) Epoch 9, batch 9150, loss[loss=0.2225, simple_loss=0.3105, pruned_loss=0.06722, over 21374.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3209, pruned_loss=0.07994, over 4276319.20 frames. ], batch size: 194, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:36:56,349 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-23 15:36:57,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1518702.0, ans=0.04949747468305833 2023-06-23 15:37:57,523 INFO [train.py:996] (3/4) Epoch 9, batch 9200, loss[loss=0.2882, simple_loss=0.3619, pruned_loss=0.1072, over 21819.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3228, pruned_loss=0.07972, over 4283059.26 frames. ], batch size: 118, lr: 3.32e-03, grad_scale: 32.0 2023-06-23 15:38:01,464 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.429e+02 6.542e+02 9.064e+02 1.359e+03 2.938e+03, threshold=1.813e+03, percent-clipped=21.0 2023-06-23 15:38:29,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1519002.0, ans=10.0 2023-06-23 15:38:40,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1519062.0, ans=0.1 2023-06-23 15:38:52,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1519062.0, ans=0.0 2023-06-23 15:39:29,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1519182.0, ans=0.125 2023-06-23 15:39:33,632 INFO [train.py:996] (3/4) Epoch 9, batch 9250, loss[loss=0.2759, simple_loss=0.3371, pruned_loss=0.1074, over 21490.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3262, pruned_loss=0.08329, over 4278633.75 frames. ], batch size: 389, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:40:00,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1519302.0, ans=0.2 2023-06-23 15:40:14,160 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-06-23 15:40:41,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1519422.0, ans=0.2 2023-06-23 15:41:16,019 INFO [train.py:996] (3/4) Epoch 9, batch 9300, loss[loss=0.2391, simple_loss=0.2969, pruned_loss=0.0906, over 21111.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3217, pruned_loss=0.08349, over 4273697.19 frames. ], batch size: 176, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:41:16,965 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.41 vs. limit=10.0 2023-06-23 15:41:17,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1519542.0, ans=0.125 2023-06-23 15:41:20,598 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.633e+02 6.631e+02 9.639e+02 1.652e+03 4.303e+03, threshold=1.928e+03, percent-clipped=19.0 2023-06-23 15:41:42,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1519602.0, ans=0.125 2023-06-23 15:42:37,032 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:42:43,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1519782.0, ans=0.125 2023-06-23 15:43:02,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1519842.0, ans=0.125 2023-06-23 15:43:03,514 INFO [train.py:996] (3/4) Epoch 9, batch 9350, loss[loss=0.2632, simple_loss=0.3419, pruned_loss=0.09229, over 21536.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3265, pruned_loss=0.08375, over 4274357.34 frames. ], batch size: 194, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:44:48,693 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-06-23 15:44:50,878 INFO [train.py:996] (3/4) Epoch 9, batch 9400, loss[loss=0.2003, simple_loss=0.2719, pruned_loss=0.06435, over 21592.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3267, pruned_loss=0.08436, over 4273979.70 frames. ], batch size: 298, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 15:44:57,837 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.099e+02 5.147e+02 6.319e+02 1.049e+03 2.062e+03, threshold=1.264e+03, percent-clipped=1.0 2023-06-23 15:45:23,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1520202.0, ans=0.2 2023-06-23 15:45:38,359 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=22.5 2023-06-23 15:45:55,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1520322.0, ans=0.0 2023-06-23 15:46:30,673 INFO [train.py:996] (3/4) Epoch 9, batch 9450, loss[loss=0.216, simple_loss=0.2791, pruned_loss=0.07649, over 21579.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3177, pruned_loss=0.08338, over 4277860.58 frames. ], batch size: 415, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 15:46:32,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1520442.0, ans=0.125 2023-06-23 15:46:40,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1520442.0, ans=0.125 2023-06-23 15:46:54,634 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=22.5 2023-06-23 15:47:32,215 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:47:32,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1520622.0, ans=0.125 2023-06-23 15:48:06,752 INFO [train.py:996] (3/4) Epoch 9, batch 9500, loss[loss=0.1705, simple_loss=0.2564, pruned_loss=0.0423, over 21667.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3116, pruned_loss=0.08101, over 4265950.81 frames. ], batch size: 247, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 15:48:13,335 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.451e+02 6.669e+02 1.059e+03 1.542e+03 2.765e+03, threshold=2.119e+03, percent-clipped=38.0 2023-06-23 15:48:22,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1520802.0, ans=0.125 2023-06-23 15:48:46,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1520802.0, ans=0.0 2023-06-23 15:48:58,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1520862.0, ans=0.125 2023-06-23 15:49:13,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1520922.0, ans=0.2 2023-06-23 15:49:48,341 INFO [train.py:996] (3/4) Epoch 9, batch 9550, loss[loss=0.2269, simple_loss=0.3273, pruned_loss=0.0633, over 19773.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3153, pruned_loss=0.08351, over 4267971.59 frames. ], batch size: 703, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 15:50:07,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1521102.0, ans=0.2 2023-06-23 15:51:28,274 INFO [train.py:996] (3/4) Epoch 9, batch 9600, loss[loss=0.2431, simple_loss=0.3093, pruned_loss=0.08842, over 21905.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3176, pruned_loss=0.08554, over 4275538.28 frames. ], batch size: 107, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:51:35,128 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.691e+02 5.650e+02 7.031e+02 8.940e+02 1.543e+03, threshold=1.406e+03, percent-clipped=0.0 2023-06-23 15:51:47,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1521402.0, ans=0.2 2023-06-23 15:52:33,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1521522.0, ans=0.125 2023-06-23 15:53:10,503 INFO [train.py:996] (3/4) Epoch 9, batch 9650, loss[loss=0.2783, simple_loss=0.3542, pruned_loss=0.1012, over 21515.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3172, pruned_loss=0.08471, over 4280197.92 frames. ], batch size: 131, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:53:50,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1521702.0, ans=0.1 2023-06-23 15:53:53,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1521762.0, ans=0.125 2023-06-23 15:54:16,389 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:54:38,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1521882.0, ans=0.125 2023-06-23 15:54:51,608 INFO [train.py:996] (3/4) Epoch 9, batch 9700, loss[loss=0.2162, simple_loss=0.3039, pruned_loss=0.06424, over 21672.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3186, pruned_loss=0.08427, over 4284120.96 frames. ], batch size: 389, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:54:59,129 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=15.0 2023-06-23 15:55:02,828 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.621e+02 5.546e+02 7.387e+02 1.131e+03 2.841e+03, threshold=1.477e+03, percent-clipped=15.0 2023-06-23 15:55:33,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1522062.0, ans=0.125 2023-06-23 15:55:50,316 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=12.0 2023-06-23 15:56:32,914 INFO [train.py:996] (3/4) Epoch 9, batch 9750, loss[loss=0.2894, simple_loss=0.3817, pruned_loss=0.09849, over 21826.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3122, pruned_loss=0.08333, over 4291152.03 frames. ], batch size: 118, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:56:35,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1522242.0, ans=0.0 2023-06-23 15:56:47,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1522242.0, ans=0.05 2023-06-23 15:56:57,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1522302.0, ans=0.0 2023-06-23 15:57:18,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1522362.0, ans=0.125 2023-06-23 15:57:27,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.85 vs. limit=15.0 2023-06-23 15:58:11,378 INFO [train.py:996] (3/4) Epoch 9, batch 9800, loss[loss=0.2274, simple_loss=0.2987, pruned_loss=0.07803, over 21910.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3133, pruned_loss=0.08333, over 4287797.87 frames. ], batch size: 351, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:58:15,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1522542.0, ans=0.1 2023-06-23 15:58:18,326 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.522e+02 5.907e+02 7.792e+02 1.093e+03 2.144e+03, threshold=1.558e+03, percent-clipped=9.0 2023-06-23 15:58:22,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1522542.0, ans=0.125 2023-06-23 15:58:31,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1522602.0, ans=0.05 2023-06-23 15:58:54,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1522662.0, ans=0.125 2023-06-23 15:59:00,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1522662.0, ans=0.2 2023-06-23 15:59:14,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1522722.0, ans=0.125 2023-06-23 15:59:49,339 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=15.0 2023-06-23 15:59:53,346 INFO [train.py:996] (3/4) Epoch 9, batch 9850, loss[loss=0.2236, simple_loss=0.2919, pruned_loss=0.07763, over 21788.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3095, pruned_loss=0.08235, over 4276658.01 frames. ], batch size: 333, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:00:06,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1522842.0, ans=0.1 2023-06-23 16:00:29,884 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=22.5 2023-06-23 16:01:30,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1523082.0, ans=0.125 2023-06-23 16:01:31,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1523082.0, ans=0.125 2023-06-23 16:01:34,835 INFO [train.py:996] (3/4) Epoch 9, batch 9900, loss[loss=0.2613, simple_loss=0.3339, pruned_loss=0.09441, over 21418.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3075, pruned_loss=0.08193, over 4272417.29 frames. ], batch size: 159, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:01:45,571 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.718e+02 5.699e+02 7.870e+02 1.232e+03 3.104e+03, threshold=1.574e+03, percent-clipped=11.0 2023-06-23 16:01:52,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1523142.0, ans=0.125 2023-06-23 16:02:09,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1523202.0, ans=0.125 2023-06-23 16:02:41,481 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=15.0 2023-06-23 16:03:15,819 INFO [train.py:996] (3/4) Epoch 9, batch 9950, loss[loss=0.2528, simple_loss=0.3163, pruned_loss=0.09466, over 21826.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3088, pruned_loss=0.08344, over 4273699.47 frames. ], batch size: 118, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:03:47,653 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-06-23 16:03:48,935 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-23 16:03:53,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1523502.0, ans=0.125 2023-06-23 16:04:13,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1523562.0, ans=0.125 2023-06-23 16:04:35,521 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=15.0 2023-06-23 16:05:02,509 INFO [train.py:996] (3/4) Epoch 9, batch 10000, loss[loss=0.2045, simple_loss=0.2728, pruned_loss=0.06814, over 21768.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3039, pruned_loss=0.08236, over 4271001.02 frames. ], batch size: 282, lr: 3.32e-03, grad_scale: 32.0 2023-06-23 16:05:14,523 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.772e+02 5.460e+02 7.211e+02 1.053e+03 2.107e+03, threshold=1.442e+03, percent-clipped=5.0 2023-06-23 16:05:23,812 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=15.0 2023-06-23 16:05:45,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1523862.0, ans=0.125 2023-06-23 16:06:09,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1523922.0, ans=0.1 2023-06-23 16:06:17,037 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-23 16:06:50,037 INFO [train.py:996] (3/4) Epoch 9, batch 10050, loss[loss=0.1861, simple_loss=0.2611, pruned_loss=0.05553, over 21424.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3079, pruned_loss=0.08353, over 4276374.86 frames. ], batch size: 211, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:07:01,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1524042.0, ans=0.125 2023-06-23 16:07:24,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1524102.0, ans=0.1 2023-06-23 16:07:49,398 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-23 16:08:33,212 INFO [train.py:996] (3/4) Epoch 9, batch 10100, loss[loss=0.2265, simple_loss=0.2983, pruned_loss=0.0773, over 20264.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3059, pruned_loss=0.08104, over 4269145.28 frames. ], batch size: 707, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:08:41,441 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.724e+02 5.845e+02 8.901e+02 1.389e+03 2.930e+03, threshold=1.780e+03, percent-clipped=23.0 2023-06-23 16:08:49,116 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.24 vs. limit=5.0 2023-06-23 16:09:25,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1524462.0, ans=0.0 2023-06-23 16:10:08,486 INFO [train.py:996] (3/4) Epoch 9, batch 10150, loss[loss=0.2054, simple_loss=0.2807, pruned_loss=0.06511, over 21657.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3103, pruned_loss=0.08342, over 4268890.60 frames. ], batch size: 247, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:10:11,316 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=22.5 2023-06-23 16:10:18,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1524642.0, ans=0.125 2023-06-23 16:11:00,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1524762.0, ans=0.125 2023-06-23 16:11:29,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1524822.0, ans=0.0 2023-06-23 16:11:32,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1524882.0, ans=0.125 2023-06-23 16:11:38,064 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=22.5 2023-06-23 16:11:48,168 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-06-23 16:11:48,560 INFO [train.py:996] (3/4) Epoch 9, batch 10200, loss[loss=0.2072, simple_loss=0.2979, pruned_loss=0.05823, over 21678.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3095, pruned_loss=0.08134, over 4255320.88 frames. ], batch size: 391, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:12:03,214 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.213e+02 5.208e+02 7.016e+02 1.136e+03 3.363e+03, threshold=1.403e+03, percent-clipped=6.0 2023-06-23 16:12:12,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1525002.0, ans=0.1 2023-06-23 16:12:31,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1525062.0, ans=0.0 2023-06-23 16:12:40,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1525062.0, ans=0.2 2023-06-23 16:13:11,031 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.02 vs. limit=15.0 2023-06-23 16:13:24,890 INFO [train.py:996] (3/4) Epoch 9, batch 10250, loss[loss=0.1624, simple_loss=0.2543, pruned_loss=0.03528, over 21555.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3036, pruned_loss=0.07473, over 4267646.80 frames. ], batch size: 230, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:14:18,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1525362.0, ans=0.125 2023-06-23 16:14:39,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1525422.0, ans=0.0 2023-06-23 16:14:44,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1525422.0, ans=0.0 2023-06-23 16:15:14,131 INFO [train.py:996] (3/4) Epoch 9, batch 10300, loss[loss=0.2358, simple_loss=0.3349, pruned_loss=0.06835, over 21809.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3075, pruned_loss=0.07622, over 4275997.16 frames. ], batch size: 282, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:15:24,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.918e+02 5.852e+02 8.943e+02 1.203e+03 2.933e+03, threshold=1.789e+03, percent-clipped=17.0 2023-06-23 16:16:49,546 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.56 vs. limit=15.0 2023-06-23 16:16:56,239 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.39 vs. limit=10.0 2023-06-23 16:16:56,862 INFO [train.py:996] (3/4) Epoch 9, batch 10350, loss[loss=0.2091, simple_loss=0.2914, pruned_loss=0.06337, over 21678.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3089, pruned_loss=0.07645, over 4275230.48 frames. ], batch size: 351, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:17:38,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1525962.0, ans=0.2 2023-06-23 16:17:58,379 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=22.5 2023-06-23 16:18:14,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1526022.0, ans=0.125 2023-06-23 16:18:45,380 INFO [train.py:996] (3/4) Epoch 9, batch 10400, loss[loss=0.279, simple_loss=0.3486, pruned_loss=0.1047, over 21455.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3045, pruned_loss=0.07646, over 4264471.57 frames. ], batch size: 507, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:18:52,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1526142.0, ans=10.0 2023-06-23 16:18:55,290 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.672e+02 5.585e+02 9.781e+02 1.543e+03 3.065e+03, threshold=1.956e+03, percent-clipped=20.0 2023-06-23 16:18:55,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1526142.0, ans=0.0 2023-06-23 16:19:15,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1526202.0, ans=0.125 2023-06-23 16:19:54,541 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=15.0 2023-06-23 16:20:14,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1526382.0, ans=0.1 2023-06-23 16:20:15,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1526382.0, ans=0.125 2023-06-23 16:20:27,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1526442.0, ans=0.1 2023-06-23 16:20:28,475 INFO [train.py:996] (3/4) Epoch 9, batch 10450, loss[loss=0.2484, simple_loss=0.3206, pruned_loss=0.08815, over 21764.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3078, pruned_loss=0.07942, over 4273721.56 frames. ], batch size: 247, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:20:28,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1526442.0, ans=0.0 2023-06-23 16:21:19,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1526562.0, ans=0.09899494936611666 2023-06-23 16:21:35,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1526622.0, ans=0.1 2023-06-23 16:21:49,170 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=15.0 2023-06-23 16:22:09,061 INFO [train.py:996] (3/4) Epoch 9, batch 10500, loss[loss=0.2381, simple_loss=0.2979, pruned_loss=0.08913, over 21489.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3067, pruned_loss=0.0784, over 4274394.03 frames. ], batch size: 441, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:22:23,392 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.525e+02 6.343e+02 8.149e+02 1.174e+03 2.736e+03, threshold=1.630e+03, percent-clipped=6.0 2023-06-23 16:23:51,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1526982.0, ans=0.125 2023-06-23 16:23:53,898 INFO [train.py:996] (3/4) Epoch 9, batch 10550, loss[loss=0.2024, simple_loss=0.2615, pruned_loss=0.07169, over 21755.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3014, pruned_loss=0.07792, over 4262495.57 frames. ], batch size: 124, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:24:14,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1527102.0, ans=0.1 2023-06-23 16:24:14,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1527102.0, ans=0.0 2023-06-23 16:24:15,994 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=15.0 2023-06-23 16:24:25,325 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-23 16:24:42,137 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=12.0 2023-06-23 16:24:52,425 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-23 16:25:16,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1527282.0, ans=0.2 2023-06-23 16:25:21,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1527282.0, ans=0.0 2023-06-23 16:25:24,077 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=15.0 2023-06-23 16:25:35,484 INFO [train.py:996] (3/4) Epoch 9, batch 10600, loss[loss=0.1722, simple_loss=0.2683, pruned_loss=0.03802, over 21640.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2967, pruned_loss=0.07617, over 4263195.94 frames. ], batch size: 263, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:25:37,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1527342.0, ans=0.0 2023-06-23 16:25:50,373 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.855e+02 5.123e+02 6.754e+02 9.468e+02 2.113e+03, threshold=1.351e+03, percent-clipped=4.0 2023-06-23 16:25:57,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1527402.0, ans=0.125 2023-06-23 16:26:02,126 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.70 vs. limit=10.0 2023-06-23 16:26:33,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1527462.0, ans=0.2 2023-06-23 16:26:35,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1527522.0, ans=0.125 2023-06-23 16:26:50,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1527522.0, ans=0.1 2023-06-23 16:26:54,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1527522.0, ans=0.125 2023-06-23 16:27:21,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1527642.0, ans=0.04949747468305833 2023-06-23 16:27:22,964 INFO [train.py:996] (3/4) Epoch 9, batch 10650, loss[loss=0.1786, simple_loss=0.2682, pruned_loss=0.04444, over 21833.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.299, pruned_loss=0.07492, over 4268255.80 frames. ], batch size: 317, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:27:25,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1527642.0, ans=0.125 2023-06-23 16:27:58,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1527762.0, ans=0.0 2023-06-23 16:29:03,467 INFO [train.py:996] (3/4) Epoch 9, batch 10700, loss[loss=0.2647, simple_loss=0.3353, pruned_loss=0.09699, over 21309.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2995, pruned_loss=0.07519, over 4262941.68 frames. ], batch size: 159, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:29:12,814 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.925e+02 6.514e+02 1.117e+03 1.445e+03 3.043e+03, threshold=2.235e+03, percent-clipped=29.0 2023-06-23 16:29:36,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1528002.0, ans=0.2 2023-06-23 16:29:40,966 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=15.0 2023-06-23 16:30:20,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1528122.0, ans=0.1 2023-06-23 16:30:40,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1528182.0, ans=0.125 2023-06-23 16:30:47,162 INFO [train.py:996] (3/4) Epoch 9, batch 10750, loss[loss=0.2606, simple_loss=0.3376, pruned_loss=0.09182, over 21266.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3099, pruned_loss=0.07933, over 4259531.78 frames. ], batch size: 176, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:31:49,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1528422.0, ans=0.0 2023-06-23 16:32:33,879 INFO [train.py:996] (3/4) Epoch 9, batch 10800, loss[loss=0.2437, simple_loss=0.318, pruned_loss=0.08467, over 21760.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3137, pruned_loss=0.07932, over 4262682.44 frames. ], batch size: 332, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 16:32:42,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1528542.0, ans=0.125 2023-06-23 16:32:43,298 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.788e+02 5.066e+02 7.349e+02 1.067e+03 2.269e+03, threshold=1.470e+03, percent-clipped=1.0 2023-06-23 16:34:14,700 INFO [train.py:996] (3/4) Epoch 9, batch 10850, loss[loss=0.2205, simple_loss=0.2866, pruned_loss=0.07724, over 21271.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3151, pruned_loss=0.07999, over 4261252.89 frames. ], batch size: 131, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:34:39,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1528902.0, ans=0.125 2023-06-23 16:35:39,012 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=15.0 2023-06-23 16:35:49,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1529082.0, ans=0.2 2023-06-23 16:35:56,331 INFO [train.py:996] (3/4) Epoch 9, batch 10900, loss[loss=0.2288, simple_loss=0.3314, pruned_loss=0.06314, over 20839.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3097, pruned_loss=0.07842, over 4247344.54 frames. ], batch size: 609, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:36:12,634 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.519e+02 5.159e+02 7.524e+02 1.150e+03 2.135e+03, threshold=1.505e+03, percent-clipped=11.0 2023-06-23 16:36:16,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1529202.0, ans=0.2 2023-06-23 16:36:19,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1529202.0, ans=0.1 2023-06-23 16:36:48,678 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.66 vs. limit=22.5 2023-06-23 16:37:05,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1529322.0, ans=0.125 2023-06-23 16:37:36,154 INFO [train.py:996] (3/4) Epoch 9, batch 10950, loss[loss=0.1928, simple_loss=0.2602, pruned_loss=0.06268, over 21542.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3052, pruned_loss=0.07717, over 4244377.78 frames. ], batch size: 263, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:37:54,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1529442.0, ans=0.2 2023-06-23 16:38:07,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1529502.0, ans=0.2 2023-06-23 16:38:13,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1529502.0, ans=0.0 2023-06-23 16:38:34,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1529562.0, ans=0.1 2023-06-23 16:38:38,371 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-23 16:38:44,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1529622.0, ans=0.125 2023-06-23 16:38:58,245 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=22.5 2023-06-23 16:39:14,280 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.90 vs. limit=15.0 2023-06-23 16:39:16,265 INFO [train.py:996] (3/4) Epoch 9, batch 11000, loss[loss=0.2162, simple_loss=0.284, pruned_loss=0.07419, over 20040.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3041, pruned_loss=0.07743, over 4237816.93 frames. ], batch size: 703, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:39:32,526 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.640e+02 5.350e+02 8.050e+02 1.212e+03 3.028e+03, threshold=1.610e+03, percent-clipped=11.0 2023-06-23 16:40:44,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-23 16:40:45,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1529982.0, ans=0.0 2023-06-23 16:40:54,200 INFO [train.py:996] (3/4) Epoch 9, batch 11050, loss[loss=0.2376, simple_loss=0.2804, pruned_loss=0.09742, over 21388.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3012, pruned_loss=0.07852, over 4233850.15 frames. ], batch size: 508, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:41:12,329 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:41:46,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1530162.0, ans=0.1 2023-06-23 16:41:59,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1530222.0, ans=10.0 2023-06-23 16:42:03,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1530222.0, ans=10.0 2023-06-23 16:42:38,852 INFO [train.py:996] (3/4) Epoch 9, batch 11100, loss[loss=0.2551, simple_loss=0.3115, pruned_loss=0.0993, over 21842.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2994, pruned_loss=0.07853, over 4245102.25 frames. ], batch size: 98, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:42:45,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1530342.0, ans=0.0 2023-06-23 16:42:51,395 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.448e+02 5.087e+02 6.616e+02 8.877e+02 2.244e+03, threshold=1.323e+03, percent-clipped=5.0 2023-06-23 16:43:10,589 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:43:33,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1530522.0, ans=0.125 2023-06-23 16:44:18,613 INFO [train.py:996] (3/4) Epoch 9, batch 11150, loss[loss=0.2448, simple_loss=0.3222, pruned_loss=0.08372, over 21203.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2981, pruned_loss=0.07854, over 4248235.26 frames. ], batch size: 159, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:44:52,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1530702.0, ans=0.125 2023-06-23 16:45:00,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1530762.0, ans=0.125 2023-06-23 16:45:02,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1530762.0, ans=0.125 2023-06-23 16:45:02,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1530762.0, ans=0.2 2023-06-23 16:45:04,499 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-23 16:45:58,171 INFO [train.py:996] (3/4) Epoch 9, batch 11200, loss[loss=0.204, simple_loss=0.2607, pruned_loss=0.07362, over 21769.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2973, pruned_loss=0.07832, over 4237252.64 frames. ], batch size: 112, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:46:02,161 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-06-23 16:46:03,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1530942.0, ans=0.07 2023-06-23 16:46:04,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1530942.0, ans=0.0 2023-06-23 16:46:10,776 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.780e+02 5.536e+02 7.546e+02 1.213e+03 2.221e+03, threshold=1.509e+03, percent-clipped=19.0 2023-06-23 16:46:40,191 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-23 16:46:49,593 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-06-23 16:47:00,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1531122.0, ans=0.0 2023-06-23 16:47:08,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1531122.0, ans=0.0 2023-06-23 16:47:32,746 INFO [train.py:996] (3/4) Epoch 9, batch 11250, loss[loss=0.2349, simple_loss=0.3059, pruned_loss=0.08198, over 21367.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2961, pruned_loss=0.07827, over 4247628.77 frames. ], batch size: 548, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:47:42,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1531242.0, ans=0.09899494936611666 2023-06-23 16:47:45,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1531242.0, ans=0.125 2023-06-23 16:49:11,422 INFO [train.py:996] (3/4) Epoch 9, batch 11300, loss[loss=0.2405, simple_loss=0.3039, pruned_loss=0.08858, over 21335.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2985, pruned_loss=0.0783, over 4257445.80 frames. ], batch size: 159, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:49:28,319 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.911e+02 5.299e+02 7.050e+02 1.034e+03 1.810e+03, threshold=1.410e+03, percent-clipped=1.0 2023-06-23 16:49:30,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1531602.0, ans=0.0 2023-06-23 16:49:39,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1531602.0, ans=0.125 2023-06-23 16:50:56,816 INFO [train.py:996] (3/4) Epoch 9, batch 11350, loss[loss=0.254, simple_loss=0.3347, pruned_loss=0.0867, over 21711.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3004, pruned_loss=0.07758, over 4260783.40 frames. ], batch size: 351, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:51:53,093 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:52:06,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1532022.0, ans=0.2 2023-06-23 16:52:39,048 INFO [train.py:996] (3/4) Epoch 9, batch 11400, loss[loss=0.1921, simple_loss=0.2422, pruned_loss=0.07097, over 16719.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3064, pruned_loss=0.08006, over 4259004.10 frames. ], batch size: 61, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:52:56,782 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.822e+02 6.591e+02 8.859e+02 1.390e+03 3.018e+03, threshold=1.772e+03, percent-clipped=23.0 2023-06-23 16:53:24,057 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-23 16:54:04,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1532382.0, ans=0.0 2023-06-23 16:54:06,384 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=22.5 2023-06-23 16:54:07,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1532382.0, ans=0.0 2023-06-23 16:54:20,200 INFO [train.py:996] (3/4) Epoch 9, batch 11450, loss[loss=0.2419, simple_loss=0.3283, pruned_loss=0.07776, over 21522.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3081, pruned_loss=0.07932, over 4262252.16 frames. ], batch size: 471, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:54:48,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1532502.0, ans=0.125 2023-06-23 16:55:56,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1532682.0, ans=0.125 2023-06-23 16:56:02,576 INFO [train.py:996] (3/4) Epoch 9, batch 11500, loss[loss=0.1971, simple_loss=0.2873, pruned_loss=0.05349, over 21424.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3108, pruned_loss=0.07967, over 4268355.26 frames. ], batch size: 211, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:56:19,882 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.487e+02 5.481e+02 7.380e+02 1.202e+03 2.850e+03, threshold=1.476e+03, percent-clipped=9.0 2023-06-23 16:56:22,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1532802.0, ans=0.0 2023-06-23 16:56:33,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1532802.0, ans=0.0 2023-06-23 16:56:43,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1532862.0, ans=0.1 2023-06-23 16:57:27,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1532982.0, ans=0.0 2023-06-23 16:57:33,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1532982.0, ans=0.5 2023-06-23 16:57:49,151 INFO [train.py:996] (3/4) Epoch 9, batch 11550, loss[loss=0.3345, simple_loss=0.4554, pruned_loss=0.1068, over 21178.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3177, pruned_loss=0.07972, over 4270561.41 frames. ], batch size: 548, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:58:10,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1533102.0, ans=0.125 2023-06-23 16:58:27,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1533102.0, ans=0.125 2023-06-23 16:59:25,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1533282.0, ans=0.025 2023-06-23 16:59:28,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1533282.0, ans=0.125 2023-06-23 16:59:31,702 INFO [train.py:996] (3/4) Epoch 9, batch 11600, loss[loss=0.25, simple_loss=0.3374, pruned_loss=0.08132, over 21799.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3297, pruned_loss=0.08166, over 4273522.61 frames. ], batch size: 124, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 16:59:50,456 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.192e+02 7.053e+02 9.279e+02 1.499e+03 3.190e+03, threshold=1.856e+03, percent-clipped=25.0 2023-06-23 17:00:02,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1533402.0, ans=0.07 2023-06-23 17:00:20,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1533462.0, ans=0.2 2023-06-23 17:00:27,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1533462.0, ans=0.1 2023-06-23 17:00:55,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1533582.0, ans=0.0 2023-06-23 17:00:57,339 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=15.0 2023-06-23 17:01:12,915 INFO [train.py:996] (3/4) Epoch 9, batch 11650, loss[loss=0.2184, simple_loss=0.289, pruned_loss=0.07385, over 21837.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3361, pruned_loss=0.08281, over 4276972.49 frames. ], batch size: 107, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:01:13,794 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-23 17:01:28,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1533642.0, ans=0.125 2023-06-23 17:01:44,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1533702.0, ans=0.125 2023-06-23 17:01:59,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1533762.0, ans=0.1 2023-06-23 17:02:06,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1533762.0, ans=0.05 2023-06-23 17:02:22,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1533822.0, ans=0.0 2023-06-23 17:02:51,590 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.145e-02 2023-06-23 17:02:52,842 INFO [train.py:996] (3/4) Epoch 9, batch 11700, loss[loss=0.2088, simple_loss=0.2737, pruned_loss=0.07196, over 21673.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3267, pruned_loss=0.08216, over 4273924.20 frames. ], batch size: 282, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:03:10,675 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.713e+02 7.514e+02 1.058e+03 1.633e+03 4.255e+03, threshold=2.116e+03, percent-clipped=16.0 2023-06-23 17:03:12,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1534002.0, ans=0.2 2023-06-23 17:03:12,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1534002.0, ans=0.1 2023-06-23 17:03:23,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1534002.0, ans=0.125 2023-06-23 17:04:07,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1534122.0, ans=0.125 2023-06-23 17:04:09,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1534182.0, ans=0.125 2023-06-23 17:04:31,755 INFO [train.py:996] (3/4) Epoch 9, batch 11750, loss[loss=0.2299, simple_loss=0.2853, pruned_loss=0.08728, over 21627.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3167, pruned_loss=0.08184, over 4272556.96 frames. ], batch size: 231, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:05:11,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1534302.0, ans=0.0 2023-06-23 17:05:30,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1534362.0, ans=0.0 2023-06-23 17:05:44,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1534422.0, ans=0.1 2023-06-23 17:05:55,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1534482.0, ans=0.2 2023-06-23 17:06:17,848 INFO [train.py:996] (3/4) Epoch 9, batch 11800, loss[loss=0.2655, simple_loss=0.3345, pruned_loss=0.09828, over 21392.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3203, pruned_loss=0.08458, over 4268022.19 frames. ], batch size: 549, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:06:21,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1534542.0, ans=0.125 2023-06-23 17:06:32,021 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.688e+02 5.572e+02 8.368e+02 1.434e+03 3.192e+03, threshold=1.674e+03, percent-clipped=11.0 2023-06-23 17:06:33,020 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-23 17:07:05,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1534662.0, ans=0.1 2023-06-23 17:07:32,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=15.0 2023-06-23 17:07:58,030 INFO [train.py:996] (3/4) Epoch 9, batch 11850, loss[loss=0.24, simple_loss=0.3343, pruned_loss=0.07288, over 21823.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.321, pruned_loss=0.08338, over 4276502.13 frames. ], batch size: 282, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:09:39,290 INFO [train.py:996] (3/4) Epoch 9, batch 11900, loss[loss=0.2494, simple_loss=0.3297, pruned_loss=0.08451, over 21369.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3208, pruned_loss=0.08091, over 4281001.35 frames. ], batch size: 471, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:09:59,003 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.827e+02 5.472e+02 7.234e+02 9.480e+02 2.463e+03, threshold=1.447e+03, percent-clipped=1.0 2023-06-23 17:10:50,936 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:11:13,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1535382.0, ans=0.0 2023-06-23 17:11:26,103 INFO [train.py:996] (3/4) Epoch 9, batch 11950, loss[loss=0.2295, simple_loss=0.3426, pruned_loss=0.05823, over 21201.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3208, pruned_loss=0.07768, over 4276800.09 frames. ], batch size: 548, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:12:01,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1535562.0, ans=0.125 2023-06-23 17:12:04,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1535562.0, ans=0.2 2023-06-23 17:12:05,553 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:12:07,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1535562.0, ans=0.05 2023-06-23 17:13:01,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1535682.0, ans=10.0 2023-06-23 17:13:03,641 INFO [train.py:996] (3/4) Epoch 9, batch 12000, loss[loss=0.2469, simple_loss=0.3061, pruned_loss=0.09387, over 21786.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3168, pruned_loss=0.07618, over 4270665.62 frames. ], batch size: 371, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:13:03,642 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 17:13:24,468 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2567, simple_loss=0.3528, pruned_loss=0.08029, over 1796401.00 frames. 2023-06-23 17:13:24,469 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-23 17:13:38,397 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.383e+02 5.788e+02 7.844e+02 1.305e+03 3.845e+03, threshold=1.569e+03, percent-clipped=19.0 2023-06-23 17:13:59,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1535862.0, ans=0.0 2023-06-23 17:14:04,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1535862.0, ans=0.09899494936611666 2023-06-23 17:14:13,804 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.07 vs. limit=22.5 2023-06-23 17:14:14,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1535862.0, ans=0.125 2023-06-23 17:14:40,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1535982.0, ans=0.0 2023-06-23 17:14:57,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1536042.0, ans=0.5 2023-06-23 17:15:03,715 INFO [train.py:996] (3/4) Epoch 9, batch 12050, loss[loss=0.2954, simple_loss=0.3426, pruned_loss=0.1241, over 21796.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3126, pruned_loss=0.07752, over 4271386.65 frames. ], batch size: 508, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:15:36,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1536162.0, ans=0.125 2023-06-23 17:16:45,349 INFO [train.py:996] (3/4) Epoch 9, batch 12100, loss[loss=0.3126, simple_loss=0.3742, pruned_loss=0.1255, over 21372.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3194, pruned_loss=0.08349, over 4277122.35 frames. ], batch size: 507, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:16:56,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1536342.0, ans=0.1 2023-06-23 17:17:01,185 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.886e+02 6.749e+02 9.796e+02 1.461e+03 3.096e+03, threshold=1.959e+03, percent-clipped=20.0 2023-06-23 17:17:34,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1536462.0, ans=0.0 2023-06-23 17:17:54,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1536522.0, ans=0.2 2023-06-23 17:17:54,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1536522.0, ans=0.1 2023-06-23 17:18:19,337 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-23 17:18:20,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1536582.0, ans=0.2 2023-06-23 17:18:31,913 INFO [train.py:996] (3/4) Epoch 9, batch 12150, loss[loss=0.1741, simple_loss=0.2273, pruned_loss=0.06045, over 20848.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3209, pruned_loss=0.08202, over 4273155.10 frames. ], batch size: 613, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:18:32,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1536642.0, ans=0.125 2023-06-23 17:19:06,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1536702.0, ans=0.1 2023-06-23 17:19:11,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1536702.0, ans=0.0 2023-06-23 17:19:45,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1536822.0, ans=0.125 2023-06-23 17:20:11,275 INFO [train.py:996] (3/4) Epoch 9, batch 12200, loss[loss=0.2407, simple_loss=0.2966, pruned_loss=0.09241, over 21544.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3179, pruned_loss=0.08165, over 4271538.30 frames. ], batch size: 391, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:20:17,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1536942.0, ans=0.0 2023-06-23 17:20:32,011 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.485e+02 6.911e+02 1.120e+03 1.509e+03 3.105e+03, threshold=2.240e+03, percent-clipped=12.0 2023-06-23 17:20:38,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1537002.0, ans=0.125 2023-06-23 17:21:25,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.59 vs. limit=15.0 2023-06-23 17:21:37,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1537182.0, ans=0.0 2023-06-23 17:21:45,649 INFO [train.py:996] (3/4) Epoch 9, batch 12250, loss[loss=0.1413, simple_loss=0.2104, pruned_loss=0.03605, over 21065.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3093, pruned_loss=0.07845, over 4266262.92 frames. ], batch size: 143, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:22:03,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1537242.0, ans=0.1 2023-06-23 17:22:57,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1537422.0, ans=0.125 2023-06-23 17:23:08,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1537482.0, ans=0.125 2023-06-23 17:23:24,449 INFO [train.py:996] (3/4) Epoch 9, batch 12300, loss[loss=0.1722, simple_loss=0.2582, pruned_loss=0.04306, over 21389.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3026, pruned_loss=0.0723, over 4259143.14 frames. ], batch size: 194, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:23:28,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1537542.0, ans=0.1 2023-06-23 17:23:42,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1537542.0, ans=0.125 2023-06-23 17:23:45,600 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.509e+02 5.150e+02 7.519e+02 1.212e+03 3.138e+03, threshold=1.504e+03, percent-clipped=3.0 2023-06-23 17:24:14,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1537662.0, ans=0.1 2023-06-23 17:24:50,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1537782.0, ans=0.125 2023-06-23 17:25:02,673 INFO [train.py:996] (3/4) Epoch 9, batch 12350, loss[loss=0.2622, simple_loss=0.341, pruned_loss=0.09171, over 21814.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3074, pruned_loss=0.07308, over 4268948.90 frames. ], batch size: 332, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:25:44,189 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:26:00,461 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-23 17:26:31,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1538082.0, ans=0.0 2023-06-23 17:26:40,781 INFO [train.py:996] (3/4) Epoch 9, batch 12400, loss[loss=0.2674, simple_loss=0.3232, pruned_loss=0.1058, over 21356.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3094, pruned_loss=0.07672, over 4279277.01 frames. ], batch size: 159, lr: 3.30e-03, grad_scale: 32.0 2023-06-23 17:27:01,902 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.060e+02 5.603e+02 7.484e+02 1.004e+03 2.626e+03, threshold=1.497e+03, percent-clipped=10.0 2023-06-23 17:27:53,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1538322.0, ans=0.0 2023-06-23 17:28:04,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1538382.0, ans=0.125 2023-06-23 17:28:25,782 INFO [train.py:996] (3/4) Epoch 9, batch 12450, loss[loss=0.2695, simple_loss=0.3447, pruned_loss=0.09711, over 21754.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3118, pruned_loss=0.07931, over 4285301.04 frames. ], batch size: 247, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:28:26,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1538442.0, ans=0.125 2023-06-23 17:28:37,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1538442.0, ans=0.125 2023-06-23 17:29:19,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1538562.0, ans=0.0 2023-06-23 17:29:20,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1538562.0, ans=0.1 2023-06-23 17:29:26,720 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-23 17:29:34,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1538622.0, ans=0.2 2023-06-23 17:30:10,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1538742.0, ans=0.125 2023-06-23 17:30:11,446 INFO [train.py:996] (3/4) Epoch 9, batch 12500, loss[loss=0.3183, simple_loss=0.3947, pruned_loss=0.1209, over 21496.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3235, pruned_loss=0.08379, over 4287140.34 frames. ], batch size: 471, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:30:19,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1538742.0, ans=0.125 2023-06-23 17:30:33,991 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.008e+02 5.952e+02 7.744e+02 1.112e+03 2.842e+03, threshold=1.549e+03, percent-clipped=7.0 2023-06-23 17:30:51,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1538862.0, ans=0.2 2023-06-23 17:31:58,449 INFO [train.py:996] (3/4) Epoch 9, batch 12550, loss[loss=0.3104, simple_loss=0.3752, pruned_loss=0.1228, over 21752.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3288, pruned_loss=0.08655, over 4283332.76 frames. ], batch size: 441, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:32:26,976 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.01 vs. limit=10.0 2023-06-23 17:32:41,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1539162.0, ans=0.1 2023-06-23 17:33:22,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1539282.0, ans=0.0 2023-06-23 17:33:34,862 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-23 17:33:45,021 INFO [train.py:996] (3/4) Epoch 9, batch 12600, loss[loss=0.1957, simple_loss=0.2893, pruned_loss=0.05099, over 21585.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3263, pruned_loss=0.08354, over 4279869.39 frames. ], batch size: 230, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:33:48,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1539342.0, ans=0.125 2023-06-23 17:33:51,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1539342.0, ans=0.125 2023-06-23 17:33:55,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1539342.0, ans=0.1 2023-06-23 17:34:01,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1539402.0, ans=0.125 2023-06-23 17:34:03,202 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.580e+02 5.911e+02 8.328e+02 1.277e+03 2.400e+03, threshold=1.666e+03, percent-clipped=14.0 2023-06-23 17:34:06,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1539402.0, ans=0.0 2023-06-23 17:34:11,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1539402.0, ans=15.0 2023-06-23 17:34:14,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1539402.0, ans=0.5 2023-06-23 17:34:39,686 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.48 vs. limit=15.0 2023-06-23 17:35:25,015 INFO [train.py:996] (3/4) Epoch 9, batch 12650, loss[loss=0.1965, simple_loss=0.2592, pruned_loss=0.06688, over 20231.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.319, pruned_loss=0.07958, over 4278908.21 frames. ], batch size: 703, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:36:22,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1539822.0, ans=0.125 2023-06-23 17:36:34,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1539822.0, ans=0.125 2023-06-23 17:36:43,876 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=22.5 2023-06-23 17:36:44,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1539882.0, ans=0.125 2023-06-23 17:37:05,720 INFO [train.py:996] (3/4) Epoch 9, batch 12700, loss[loss=0.2767, simple_loss=0.3394, pruned_loss=0.107, over 21422.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3186, pruned_loss=0.08203, over 4286957.42 frames. ], batch size: 548, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:37:12,243 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.10 vs. limit=15.0 2023-06-23 17:37:23,855 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.729e+02 5.436e+02 7.219e+02 1.107e+03 2.161e+03, threshold=1.444e+03, percent-clipped=5.0 2023-06-23 17:37:31,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1540002.0, ans=0.125 2023-06-23 17:37:55,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1540062.0, ans=0.125 2023-06-23 17:38:08,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1540122.0, ans=0.125 2023-06-23 17:38:09,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1540122.0, ans=0.0 2023-06-23 17:38:22,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1540122.0, ans=0.2 2023-06-23 17:38:46,105 INFO [train.py:996] (3/4) Epoch 9, batch 12750, loss[loss=0.2147, simple_loss=0.3043, pruned_loss=0.06259, over 21707.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3195, pruned_loss=0.08212, over 4291989.18 frames. ], batch size: 351, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:39:49,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1540422.0, ans=10.0 2023-06-23 17:39:49,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1540422.0, ans=0.0 2023-06-23 17:39:54,551 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:40:22,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1540482.0, ans=0.125 2023-06-23 17:40:26,537 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.96 vs. limit=22.5 2023-06-23 17:40:26,700 INFO [train.py:996] (3/4) Epoch 9, batch 12800, loss[loss=0.2495, simple_loss=0.3209, pruned_loss=0.08903, over 21784.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.317, pruned_loss=0.08191, over 4294201.30 frames. ], batch size: 298, lr: 3.30e-03, grad_scale: 32.0 2023-06-23 17:40:51,643 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.701e+02 5.271e+02 6.331e+02 9.056e+02 1.664e+03, threshold=1.266e+03, percent-clipped=3.0 2023-06-23 17:41:37,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1540722.0, ans=0.1 2023-06-23 17:41:41,596 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.96 vs. limit=6.0 2023-06-23 17:42:08,020 INFO [train.py:996] (3/4) Epoch 9, batch 12850, loss[loss=0.2186, simple_loss=0.3081, pruned_loss=0.06458, over 21777.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3196, pruned_loss=0.08285, over 4294526.50 frames. ], batch size: 247, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:42:22,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1540842.0, ans=0.2 2023-06-23 17:42:37,716 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-23 17:43:14,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1541022.0, ans=0.2 2023-06-23 17:43:42,203 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:43:48,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1541082.0, ans=0.1 2023-06-23 17:43:54,683 INFO [train.py:996] (3/4) Epoch 9, batch 12900, loss[loss=0.2424, simple_loss=0.3357, pruned_loss=0.07454, over 21632.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3179, pruned_loss=0.08033, over 4276937.74 frames. ], batch size: 442, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:44:24,418 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 5.349e+02 7.787e+02 1.135e+03 3.186e+03, threshold=1.557e+03, percent-clipped=18.0 2023-06-23 17:44:44,182 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=22.5 2023-06-23 17:44:46,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1541262.0, ans=0.1 2023-06-23 17:45:10,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1541322.0, ans=0.125 2023-06-23 17:45:21,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1541382.0, ans=0.025 2023-06-23 17:45:32,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1541382.0, ans=0.125 2023-06-23 17:45:41,756 INFO [train.py:996] (3/4) Epoch 9, batch 12950, loss[loss=0.2579, simple_loss=0.337, pruned_loss=0.08941, over 21806.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3189, pruned_loss=0.07971, over 4269499.52 frames. ], batch size: 118, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:46:08,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1541502.0, ans=0.125 2023-06-23 17:46:29,939 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-06-23 17:47:02,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1541682.0, ans=0.125 2023-06-23 17:47:26,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1541742.0, ans=0.125 2023-06-23 17:47:27,845 INFO [train.py:996] (3/4) Epoch 9, batch 13000, loss[loss=0.1746, simple_loss=0.2431, pruned_loss=0.05307, over 21160.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3203, pruned_loss=0.08012, over 4265074.88 frames. ], batch size: 143, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:47:46,459 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.593e+02 5.869e+02 8.632e+02 1.298e+03 2.714e+03, threshold=1.726e+03, percent-clipped=15.0 2023-06-23 17:48:40,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1541982.0, ans=0.1 2023-06-23 17:48:49,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1541982.0, ans=0.125 2023-06-23 17:48:51,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1541982.0, ans=0.0 2023-06-23 17:49:01,579 INFO [train.py:996] (3/4) Epoch 9, batch 13050, loss[loss=0.2614, simple_loss=0.3235, pruned_loss=0.09963, over 21700.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3145, pruned_loss=0.07691, over 4268979.21 frames. ], batch size: 473, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:49:26,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1542102.0, ans=0.125 2023-06-23 17:50:03,578 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-23 17:50:17,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1542222.0, ans=0.125 2023-06-23 17:50:45,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1542342.0, ans=0.125 2023-06-23 17:50:46,359 INFO [train.py:996] (3/4) Epoch 9, batch 13100, loss[loss=0.2422, simple_loss=0.3208, pruned_loss=0.08176, over 21334.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3156, pruned_loss=0.07728, over 4276747.03 frames. ], batch size: 159, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:50:47,464 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-06-23 17:50:50,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1542342.0, ans=0.0 2023-06-23 17:51:06,460 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.778e+02 5.747e+02 7.827e+02 1.039e+03 1.771e+03, threshold=1.565e+03, percent-clipped=1.0 2023-06-23 17:51:58,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1542522.0, ans=0.09899494936611666 2023-06-23 17:52:27,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1542642.0, ans=0.0 2023-06-23 17:52:28,862 INFO [train.py:996] (3/4) Epoch 9, batch 13150, loss[loss=0.2126, simple_loss=0.2893, pruned_loss=0.06792, over 21869.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3182, pruned_loss=0.07961, over 4276741.10 frames. ], batch size: 317, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:52:34,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1542642.0, ans=0.125 2023-06-23 17:52:44,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1542702.0, ans=0.0 2023-06-23 17:54:10,295 INFO [train.py:996] (3/4) Epoch 9, batch 13200, loss[loss=0.2408, simple_loss=0.312, pruned_loss=0.0848, over 21311.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3158, pruned_loss=0.07943, over 4274970.63 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 32.0 2023-06-23 17:54:23,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1542942.0, ans=0.125 2023-06-23 17:54:33,687 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.479e+02 5.951e+02 7.570e+02 1.042e+03 3.191e+03, threshold=1.514e+03, percent-clipped=13.0 2023-06-23 17:54:40,910 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-06-23 17:54:54,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1543062.0, ans=22.5 2023-06-23 17:54:57,121 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:54:57,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1543062.0, ans=0.2 2023-06-23 17:55:05,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1543062.0, ans=0.2 2023-06-23 17:55:34,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1543182.0, ans=0.125 2023-06-23 17:55:49,999 INFO [train.py:996] (3/4) Epoch 9, batch 13250, loss[loss=0.2373, simple_loss=0.3137, pruned_loss=0.0805, over 21695.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.317, pruned_loss=0.08235, over 4268667.69 frames. ], batch size: 441, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:56:03,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1543242.0, ans=0.0 2023-06-23 17:56:10,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1543302.0, ans=0.125 2023-06-23 17:56:25,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1543302.0, ans=0.0 2023-06-23 17:56:25,756 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.47 vs. limit=10.0 2023-06-23 17:56:36,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1543362.0, ans=0.1 2023-06-23 17:56:53,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1543362.0, ans=0.125 2023-06-23 17:57:17,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1543482.0, ans=0.125 2023-06-23 17:57:36,331 INFO [train.py:996] (3/4) Epoch 9, batch 13300, loss[loss=0.2431, simple_loss=0.3239, pruned_loss=0.08112, over 21638.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3208, pruned_loss=0.08273, over 4269612.65 frames. ], batch size: 230, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:58:08,191 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.720e+02 5.402e+02 7.318e+02 1.029e+03 1.964e+03, threshold=1.464e+03, percent-clipped=5.0 2023-06-23 17:58:15,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1543602.0, ans=0.125 2023-06-23 17:58:35,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-23 17:58:45,178 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.99 vs. limit=15.0 2023-06-23 17:59:18,315 INFO [train.py:996] (3/4) Epoch 9, batch 13350, loss[loss=0.2517, simple_loss=0.3806, pruned_loss=0.06142, over 19732.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3255, pruned_loss=0.08557, over 4274115.66 frames. ], batch size: 702, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:59:20,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1543842.0, ans=0.125 2023-06-23 18:00:52,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1544082.0, ans=0.0 2023-06-23 18:00:57,190 INFO [train.py:996] (3/4) Epoch 9, batch 13400, loss[loss=0.2116, simple_loss=0.2909, pruned_loss=0.06615, over 21888.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3266, pruned_loss=0.08776, over 4279469.75 frames. ], batch size: 316, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 18:01:34,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1544202.0, ans=0.125 2023-06-23 18:01:35,260 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.876e+02 6.088e+02 8.910e+02 1.105e+03 2.382e+03, threshold=1.782e+03, percent-clipped=11.0 2023-06-23 18:02:01,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1544322.0, ans=0.125 2023-06-23 18:02:19,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1544322.0, ans=0.1 2023-06-23 18:02:19,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1544322.0, ans=0.0 2023-06-23 18:02:34,815 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=22.5 2023-06-23 18:02:50,785 INFO [train.py:996] (3/4) Epoch 9, batch 13450, loss[loss=0.2195, simple_loss=0.2899, pruned_loss=0.07461, over 21759.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3267, pruned_loss=0.08954, over 4277315.78 frames. ], batch size: 118, lr: 3.30e-03, grad_scale: 8.0 2023-06-23 18:03:07,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1544442.0, ans=0.125 2023-06-23 18:03:33,919 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:03:41,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1544562.0, ans=0.2 2023-06-23 18:03:57,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1544622.0, ans=0.09899494936611666 2023-06-23 18:04:30,757 INFO [train.py:996] (3/4) Epoch 9, batch 13500, loss[loss=0.2434, simple_loss=0.3262, pruned_loss=0.08036, over 21344.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3195, pruned_loss=0.08726, over 4273746.25 frames. ], batch size: 549, lr: 3.30e-03, grad_scale: 8.0 2023-06-23 18:04:34,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1544742.0, ans=0.1 2023-06-23 18:04:54,276 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.791e+02 5.275e+02 7.519e+02 1.324e+03 2.778e+03, threshold=1.504e+03, percent-clipped=14.0 2023-06-23 18:05:12,078 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=12.0 2023-06-23 18:05:20,278 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-23 18:06:08,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1544982.0, ans=0.1 2023-06-23 18:06:13,517 INFO [train.py:996] (3/4) Epoch 9, batch 13550, loss[loss=0.2366, simple_loss=0.3423, pruned_loss=0.06549, over 21675.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3204, pruned_loss=0.08509, over 4277658.37 frames. ], batch size: 263, lr: 3.30e-03, grad_scale: 8.0 2023-06-23 18:06:28,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1545102.0, ans=0.0 2023-06-23 18:06:48,920 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-23 18:07:03,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1545162.0, ans=0.2 2023-06-23 18:07:23,818 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=15.0 2023-06-23 18:07:29,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1545222.0, ans=0.125 2023-06-23 18:07:33,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1545282.0, ans=0.2 2023-06-23 18:07:45,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1545282.0, ans=0.125 2023-06-23 18:07:55,218 INFO [train.py:996] (3/4) Epoch 9, batch 13600, loss[loss=0.2079, simple_loss=0.282, pruned_loss=0.06691, over 21786.00 frames. ], tot_loss[loss=0.246, simple_loss=0.322, pruned_loss=0.08499, over 4277934.46 frames. ], batch size: 247, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 18:08:18,101 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.790e+02 6.429e+02 9.165e+02 1.553e+03 3.162e+03, threshold=1.833e+03, percent-clipped=25.0 2023-06-23 18:08:42,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=15.0 2023-06-23 18:09:30,310 INFO [train.py:996] (3/4) Epoch 9, batch 13650, loss[loss=0.2072, simple_loss=0.2686, pruned_loss=0.0729, over 21636.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.316, pruned_loss=0.08166, over 4281848.43 frames. ], batch size: 247, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 18:09:30,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1545642.0, ans=0.125 2023-06-23 18:10:20,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1545762.0, ans=0.125 2023-06-23 18:11:10,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1545882.0, ans=0.04949747468305833 2023-06-23 18:11:13,769 INFO [train.py:996] (3/4) Epoch 9, batch 13700, loss[loss=0.2123, simple_loss=0.2699, pruned_loss=0.07729, over 21248.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3107, pruned_loss=0.08006, over 4271460.03 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 18:11:41,507 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.658e+02 5.677e+02 7.972e+02 1.070e+03 2.613e+03, threshold=1.594e+03, percent-clipped=4.0 2023-06-23 18:12:09,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1546062.0, ans=0.035 2023-06-23 18:12:25,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1546122.0, ans=0.1 2023-06-23 18:12:26,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1546122.0, ans=0.2 2023-06-23 18:12:33,779 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.26 vs. limit=10.0 2023-06-23 18:12:51,915 INFO [train.py:996] (3/4) Epoch 9, batch 13750, loss[loss=0.2295, simple_loss=0.2905, pruned_loss=0.08432, over 21545.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3085, pruned_loss=0.07963, over 4264962.56 frames. ], batch size: 195, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:13:10,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1546242.0, ans=0.125 2023-06-23 18:13:18,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=1546302.0, ans=0.1 2023-06-23 18:13:44,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1546362.0, ans=0.0 2023-06-23 18:14:16,368 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-23 18:14:35,710 INFO [train.py:996] (3/4) Epoch 9, batch 13800, loss[loss=0.2683, simple_loss=0.3632, pruned_loss=0.08671, over 21779.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3148, pruned_loss=0.07969, over 4260286.13 frames. ], batch size: 282, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:14:36,635 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-23 18:15:16,349 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.059e+02 5.824e+02 9.603e+02 1.417e+03 3.093e+03, threshold=1.921e+03, percent-clipped=19.0 2023-06-23 18:16:10,568 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-23 18:16:23,571 INFO [train.py:996] (3/4) Epoch 9, batch 13850, loss[loss=0.2852, simple_loss=0.371, pruned_loss=0.09975, over 21733.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3214, pruned_loss=0.08131, over 4265206.25 frames. ], batch size: 441, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:16:25,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1546842.0, ans=0.125 2023-06-23 18:16:47,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1546842.0, ans=0.0 2023-06-23 18:17:27,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1547022.0, ans=0.0 2023-06-23 18:17:30,570 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=22.5 2023-06-23 18:18:14,857 INFO [train.py:996] (3/4) Epoch 9, batch 13900, loss[loss=0.2475, simple_loss=0.3148, pruned_loss=0.09012, over 21829.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3248, pruned_loss=0.08469, over 4264266.88 frames. ], batch size: 298, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:18:41,699 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.964e+02 6.028e+02 8.450e+02 1.187e+03 2.483e+03, threshold=1.690e+03, percent-clipped=4.0 2023-06-23 18:19:30,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-23 18:19:49,962 INFO [train.py:996] (3/4) Epoch 9, batch 13950, loss[loss=0.2321, simple_loss=0.3104, pruned_loss=0.07691, over 21775.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3243, pruned_loss=0.08581, over 4270691.22 frames. ], batch size: 298, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:20:30,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1547562.0, ans=0.125 2023-06-23 18:20:58,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1547622.0, ans=0.125 2023-06-23 18:21:29,623 INFO [train.py:996] (3/4) Epoch 9, batch 14000, loss[loss=0.1826, simple_loss=0.2555, pruned_loss=0.05481, over 21489.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3208, pruned_loss=0.08322, over 4269176.19 frames. ], batch size: 195, lr: 3.29e-03, grad_scale: 32.0 2023-06-23 18:21:56,232 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.630e+02 5.065e+02 8.726e+02 1.240e+03 2.803e+03, threshold=1.745e+03, percent-clipped=8.0 2023-06-23 18:22:43,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1547982.0, ans=0.0 2023-06-23 18:22:52,825 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=15.0 2023-06-23 18:23:01,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1548042.0, ans=0.0 2023-06-23 18:23:02,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1548042.0, ans=15.0 2023-06-23 18:23:03,101 INFO [train.py:996] (3/4) Epoch 9, batch 14050, loss[loss=0.2005, simple_loss=0.2642, pruned_loss=0.06844, over 21563.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3144, pruned_loss=0.07902, over 4275689.34 frames. ], batch size: 230, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:23:33,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1548102.0, ans=0.125 2023-06-23 18:23:59,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1548222.0, ans=0.1 2023-06-23 18:24:00,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1548222.0, ans=0.125 2023-06-23 18:24:02,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1548222.0, ans=0.125 2023-06-23 18:24:28,673 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=12.0 2023-06-23 18:24:42,555 INFO [train.py:996] (3/4) Epoch 9, batch 14100, loss[loss=0.2293, simple_loss=0.2979, pruned_loss=0.08034, over 21226.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3101, pruned_loss=0.07848, over 4260203.73 frames. ], batch size: 143, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:25:01,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1548402.0, ans=0.0 2023-06-23 18:25:03,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1548402.0, ans=0.07 2023-06-23 18:25:10,280 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.343e+02 6.298e+02 9.143e+02 1.408e+03 2.663e+03, threshold=1.829e+03, percent-clipped=10.0 2023-06-23 18:25:20,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1548462.0, ans=0.2 2023-06-23 18:25:39,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1548522.0, ans=0.1 2023-06-23 18:25:48,144 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-23 18:26:10,598 INFO [train.py:996] (3/4) Epoch 9, batch 14150, loss[loss=0.2448, simple_loss=0.3288, pruned_loss=0.08044, over 21769.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.315, pruned_loss=0.08041, over 4242365.36 frames. ], batch size: 112, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:26:28,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1548642.0, ans=0.125 2023-06-23 18:26:46,678 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.02 vs. limit=12.0 2023-06-23 18:27:15,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1548822.0, ans=0.125 2023-06-23 18:27:35,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1548882.0, ans=0.0 2023-06-23 18:27:42,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1548882.0, ans=0.125 2023-06-23 18:27:48,776 INFO [train.py:996] (3/4) Epoch 9, batch 14200, loss[loss=0.222, simple_loss=0.2868, pruned_loss=0.0786, over 21461.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3138, pruned_loss=0.07946, over 4252201.08 frames. ], batch size: 548, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:28:22,644 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.722e+02 5.420e+02 7.650e+02 1.190e+03 2.098e+03, threshold=1.530e+03, percent-clipped=4.0 2023-06-23 18:28:36,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1549062.0, ans=0.0 2023-06-23 18:28:48,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1549062.0, ans=0.125 2023-06-23 18:28:54,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1549122.0, ans=0.07 2023-06-23 18:29:27,836 INFO [train.py:996] (3/4) Epoch 9, batch 14250, loss[loss=0.1946, simple_loss=0.2652, pruned_loss=0.06198, over 21661.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3076, pruned_loss=0.07912, over 4256343.21 frames. ], batch size: 282, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:29:33,682 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=22.5 2023-06-23 18:29:45,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1549242.0, ans=0.125 2023-06-23 18:29:53,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1549302.0, ans=0.125 2023-06-23 18:30:17,583 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=22.5 2023-06-23 18:30:26,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1549362.0, ans=0.1 2023-06-23 18:30:27,236 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=22.5 2023-06-23 18:30:53,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1549482.0, ans=0.1 2023-06-23 18:31:09,752 INFO [train.py:996] (3/4) Epoch 9, batch 14300, loss[loss=0.3452, simple_loss=0.4504, pruned_loss=0.12, over 21247.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3129, pruned_loss=0.07995, over 4262786.85 frames. ], batch size: 549, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:31:49,037 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.696e+02 4.681e+02 6.476e+02 1.240e+03 3.295e+03, threshold=1.295e+03, percent-clipped=18.0 2023-06-23 18:32:04,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1549662.0, ans=0.125 2023-06-23 18:32:10,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1549662.0, ans=0.0 2023-06-23 18:32:33,285 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=22.5 2023-06-23 18:32:40,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1549782.0, ans=0.125 2023-06-23 18:32:40,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1549782.0, ans=0.025 2023-06-23 18:32:49,917 INFO [train.py:996] (3/4) Epoch 9, batch 14350, loss[loss=0.2132, simple_loss=0.279, pruned_loss=0.07375, over 21396.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3204, pruned_loss=0.08131, over 4266561.90 frames. ], batch size: 159, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:33:16,269 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.67 vs. limit=10.0 2023-06-23 18:33:28,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1549902.0, ans=0.0 2023-06-23 18:33:36,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1549962.0, ans=0.1 2023-06-23 18:33:43,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1549962.0, ans=0.0 2023-06-23 18:33:56,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1550022.0, ans=0.5 2023-06-23 18:33:58,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1550022.0, ans=0.1 2023-06-23 18:34:05,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1550022.0, ans=0.2 2023-06-23 18:34:34,655 INFO [train.py:996] (3/4) Epoch 9, batch 14400, loss[loss=0.21, simple_loss=0.2735, pruned_loss=0.07326, over 21680.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3167, pruned_loss=0.08124, over 4263983.65 frames. ], batch size: 282, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:35:09,621 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.631e+02 4.844e+02 6.439e+02 1.111e+03 2.671e+03, threshold=1.288e+03, percent-clipped=19.0 2023-06-23 18:35:14,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1550262.0, ans=0.125 2023-06-23 18:35:58,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1550382.0, ans=0.0 2023-06-23 18:36:07,879 INFO [train.py:996] (3/4) Epoch 9, batch 14450, loss[loss=0.2579, simple_loss=0.321, pruned_loss=0.09746, over 20716.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3103, pruned_loss=0.0813, over 4265830.26 frames. ], batch size: 609, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:36:21,758 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-23 18:36:36,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1550502.0, ans=0.5 2023-06-23 18:36:58,921 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=22.5 2023-06-23 18:37:12,476 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.92 vs. limit=10.0 2023-06-23 18:37:36,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1550682.0, ans=0.125 2023-06-23 18:37:43,324 INFO [train.py:996] (3/4) Epoch 9, batch 14500, loss[loss=0.225, simple_loss=0.2973, pruned_loss=0.07638, over 21801.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3073, pruned_loss=0.0811, over 4267412.87 frames. ], batch size: 118, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:38:11,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1550742.0, ans=0.125 2023-06-23 18:38:23,476 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.792e+02 5.208e+02 6.817e+02 8.713e+02 1.535e+03, threshold=1.363e+03, percent-clipped=1.0 2023-06-23 18:38:47,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1550922.0, ans=0.125 2023-06-23 18:38:47,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1550922.0, ans=0.1 2023-06-23 18:39:02,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1550922.0, ans=0.125 2023-06-23 18:39:05,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1550982.0, ans=0.125 2023-06-23 18:39:29,508 INFO [train.py:996] (3/4) Epoch 9, batch 14550, loss[loss=0.3061, simple_loss=0.377, pruned_loss=0.1176, over 21565.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3107, pruned_loss=0.08183, over 4260927.01 frames. ], batch size: 414, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:40:28,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1551222.0, ans=0.125 2023-06-23 18:40:36,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1551222.0, ans=0.125 2023-06-23 18:40:45,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1551282.0, ans=0.2 2023-06-23 18:41:10,880 INFO [train.py:996] (3/4) Epoch 9, batch 14600, loss[loss=0.2502, simple_loss=0.3318, pruned_loss=0.08427, over 21877.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3171, pruned_loss=0.08485, over 4265274.37 frames. ], batch size: 371, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:41:42,074 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.404e+02 6.083e+02 8.730e+02 1.243e+03 2.471e+03, threshold=1.746e+03, percent-clipped=17.0 2023-06-23 18:42:11,132 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-23 18:42:24,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1551582.0, ans=0.5 2023-06-23 18:42:24,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1551582.0, ans=0.125 2023-06-23 18:42:28,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1551582.0, ans=0.0 2023-06-23 18:42:39,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1551642.0, ans=0.0 2023-06-23 18:42:45,939 INFO [train.py:996] (3/4) Epoch 9, batch 14650, loss[loss=0.2499, simple_loss=0.3351, pruned_loss=0.08237, over 21680.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3199, pruned_loss=0.08429, over 4272829.06 frames. ], batch size: 441, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:43:18,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1551702.0, ans=0.0 2023-06-23 18:43:52,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1551822.0, ans=0.125 2023-06-23 18:44:05,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1551882.0, ans=0.0 2023-06-23 18:44:21,191 INFO [train.py:996] (3/4) Epoch 9, batch 14700, loss[loss=0.2181, simple_loss=0.2932, pruned_loss=0.07152, over 21323.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3118, pruned_loss=0.07795, over 4275393.79 frames. ], batch size: 131, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:44:59,144 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.773e+02 5.196e+02 7.542e+02 1.109e+03 2.941e+03, threshold=1.508e+03, percent-clipped=7.0 2023-06-23 18:44:59,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1552002.0, ans=0.125 2023-06-23 18:45:29,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1552122.0, ans=0.1 2023-06-23 18:45:36,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1552122.0, ans=0.0 2023-06-23 18:45:40,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1552182.0, ans=0.07 2023-06-23 18:46:08,853 INFO [train.py:996] (3/4) Epoch 9, batch 14750, loss[loss=0.2999, simple_loss=0.3761, pruned_loss=0.1118, over 21597.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3159, pruned_loss=0.08093, over 4266054.46 frames. ], batch size: 263, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:46:43,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1552302.0, ans=0.125 2023-06-23 18:46:45,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1552362.0, ans=0.0 2023-06-23 18:46:59,362 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-23 18:47:44,955 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2023-06-23 18:47:45,547 INFO [train.py:996] (3/4) Epoch 9, batch 14800, loss[loss=0.2335, simple_loss=0.3067, pruned_loss=0.08009, over 21613.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3275, pruned_loss=0.08651, over 4270002.32 frames. ], batch size: 298, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:47:57,661 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-23 18:48:16,734 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.287e+02 6.500e+02 8.733e+02 1.311e+03 2.731e+03, threshold=1.747e+03, percent-clipped=18.0 2023-06-23 18:48:35,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1552662.0, ans=0.2 2023-06-23 18:49:22,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1552782.0, ans=0.0 2023-06-23 18:49:32,207 INFO [train.py:996] (3/4) Epoch 9, batch 14850, loss[loss=0.2125, simple_loss=0.2773, pruned_loss=0.0739, over 21247.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3204, pruned_loss=0.08508, over 4256440.46 frames. ], batch size: 176, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:49:47,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1552902.0, ans=0.125 2023-06-23 18:49:52,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1552902.0, ans=0.125 2023-06-23 18:50:22,516 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-23 18:51:11,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1553082.0, ans=0.125 2023-06-23 18:51:15,489 INFO [train.py:996] (3/4) Epoch 9, batch 14900, loss[loss=0.2543, simple_loss=0.3347, pruned_loss=0.08692, over 21941.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3227, pruned_loss=0.08655, over 4260176.87 frames. ], batch size: 372, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:51:15,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1553142.0, ans=0.2 2023-06-23 18:51:15,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1553142.0, ans=0.1 2023-06-23 18:51:27,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1553142.0, ans=0.0 2023-06-23 18:51:54,858 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.598e+02 5.568e+02 9.380e+02 1.428e+03 3.360e+03, threshold=1.876e+03, percent-clipped=13.0 2023-06-23 18:52:04,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1553262.0, ans=0.125 2023-06-23 18:52:08,153 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=22.5 2023-06-23 18:52:47,009 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-23 18:52:55,903 INFO [train.py:996] (3/4) Epoch 9, batch 14950, loss[loss=0.2326, simple_loss=0.3144, pruned_loss=0.07543, over 21213.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3227, pruned_loss=0.0854, over 4271621.99 frames. ], batch size: 176, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:53:47,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1553562.0, ans=0.1 2023-06-23 18:53:57,544 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:53:57,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1553562.0, ans=0.0 2023-06-23 18:54:01,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1553622.0, ans=0.0 2023-06-23 18:54:24,314 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=15.0 2023-06-23 18:54:33,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1553682.0, ans=0.1 2023-06-23 18:54:37,607 INFO [train.py:996] (3/4) Epoch 9, batch 15000, loss[loss=0.2375, simple_loss=0.3033, pruned_loss=0.08587, over 21485.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3266, pruned_loss=0.08775, over 4265750.37 frames. ], batch size: 548, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:54:37,607 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 18:54:50,885 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7323, 1.7531, 2.6216, 2.7337], device='cuda:3') 2023-06-23 18:54:58,181 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2574, simple_loss=0.352, pruned_loss=0.08137, over 1796401.00 frames. 2023-06-23 18:54:58,182 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-23 18:55:16,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1553742.0, ans=0.0 2023-06-23 18:55:32,913 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.760e+02 5.829e+02 9.207e+02 1.364e+03 3.991e+03, threshold=1.841e+03, percent-clipped=17.0 2023-06-23 18:55:34,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1553862.0, ans=0.2 2023-06-23 18:55:59,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1553922.0, ans=0.0 2023-06-23 18:56:36,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1553982.0, ans=0.125 2023-06-23 18:56:39,781 INFO [train.py:996] (3/4) Epoch 9, batch 15050, loss[loss=0.1945, simple_loss=0.2544, pruned_loss=0.06732, over 21814.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3255, pruned_loss=0.0881, over 4260607.02 frames. ], batch size: 102, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:56:40,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1554042.0, ans=0.0 2023-06-23 18:57:04,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1554102.0, ans=0.0 2023-06-23 18:57:38,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1554222.0, ans=0.125 2023-06-23 18:58:11,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1554282.0, ans=0.125 2023-06-23 18:58:18,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1554282.0, ans=0.125 2023-06-23 18:58:20,502 INFO [train.py:996] (3/4) Epoch 9, batch 15100, loss[loss=0.2851, simple_loss=0.3501, pruned_loss=0.11, over 21287.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3273, pruned_loss=0.08756, over 4267881.26 frames. ], batch size: 143, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:58:36,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1554342.0, ans=0.1 2023-06-23 18:58:43,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1554402.0, ans=0.0 2023-06-23 18:58:59,268 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.914e+02 5.498e+02 7.540e+02 1.313e+03 2.793e+03, threshold=1.508e+03, percent-clipped=8.0 2023-06-23 18:59:07,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=15.0 2023-06-23 18:59:17,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1554522.0, ans=0.125 2023-06-23 18:59:18,453 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-23 18:59:39,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1554522.0, ans=0.0 2023-06-23 18:59:54,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1554582.0, ans=0.0 2023-06-23 19:00:04,649 INFO [train.py:996] (3/4) Epoch 9, batch 15150, loss[loss=0.2292, simple_loss=0.282, pruned_loss=0.08819, over 21202.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3226, pruned_loss=0.087, over 4269126.50 frames. ], batch size: 159, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 19:00:07,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1554642.0, ans=0.1 2023-06-23 19:00:28,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1554702.0, ans=0.5 2023-06-23 19:01:23,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1554882.0, ans=0.0 2023-06-23 19:01:28,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1554882.0, ans=0.0 2023-06-23 19:01:45,734 INFO [train.py:996] (3/4) Epoch 9, batch 15200, loss[loss=0.2109, simple_loss=0.2634, pruned_loss=0.07922, over 22019.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3152, pruned_loss=0.08368, over 4268095.50 frames. ], batch size: 103, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 19:02:19,109 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.584e+02 6.669e+02 9.281e+02 1.408e+03 4.015e+03, threshold=1.856e+03, percent-clipped=19.0 2023-06-23 19:02:24,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1555062.0, ans=0.125 2023-06-23 19:02:26,390 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:03:00,735 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=15.0 2023-06-23 19:03:27,220 INFO [train.py:996] (3/4) Epoch 9, batch 15250, loss[loss=0.2319, simple_loss=0.3016, pruned_loss=0.08115, over 21874.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3111, pruned_loss=0.08286, over 4274847.59 frames. ], batch size: 107, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 19:03:43,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1555242.0, ans=0.125 2023-06-23 19:03:46,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1555302.0, ans=0.1 2023-06-23 19:05:06,543 INFO [train.py:996] (3/4) Epoch 9, batch 15300, loss[loss=0.2523, simple_loss=0.317, pruned_loss=0.09381, over 21759.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.315, pruned_loss=0.08639, over 4274626.32 frames. ], batch size: 247, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 19:05:28,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1555602.0, ans=0.125 2023-06-23 19:05:41,056 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.812e+02 5.913e+02 8.241e+02 1.222e+03 2.288e+03, threshold=1.648e+03, percent-clipped=6.0 2023-06-23 19:05:56,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1555662.0, ans=0.125 2023-06-23 19:06:29,104 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-23 19:06:52,449 INFO [train.py:996] (3/4) Epoch 9, batch 15350, loss[loss=0.2438, simple_loss=0.3223, pruned_loss=0.08261, over 21350.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3185, pruned_loss=0.08831, over 4278818.02 frames. ], batch size: 548, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:06:52,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1555842.0, ans=0.0 2023-06-23 19:07:41,425 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-23 19:07:45,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1556022.0, ans=0.0 2023-06-23 19:07:51,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1556022.0, ans=0.0 2023-06-23 19:08:15,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1556082.0, ans=0.125 2023-06-23 19:08:26,418 INFO [train.py:996] (3/4) Epoch 9, batch 15400, loss[loss=0.2334, simple_loss=0.3096, pruned_loss=0.07856, over 21517.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3204, pruned_loss=0.08654, over 4274452.26 frames. ], batch size: 211, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:08:54,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1556202.0, ans=0.1 2023-06-23 19:08:58,589 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.809e+02 6.075e+02 7.840e+02 1.049e+03 1.941e+03, threshold=1.568e+03, percent-clipped=4.0 2023-06-23 19:09:25,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1556322.0, ans=0.0 2023-06-23 19:09:44,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1556382.0, ans=0.0 2023-06-23 19:10:04,868 INFO [train.py:996] (3/4) Epoch 9, batch 15450, loss[loss=0.2244, simple_loss=0.2835, pruned_loss=0.08269, over 21607.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3168, pruned_loss=0.08577, over 4282673.89 frames. ], batch size: 548, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:10:12,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1556442.0, ans=0.0 2023-06-23 19:10:22,448 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.06 vs. limit=15.0 2023-06-23 19:10:23,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1556442.0, ans=0.2 2023-06-23 19:11:45,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1556742.0, ans=0.0 2023-06-23 19:11:46,664 INFO [train.py:996] (3/4) Epoch 9, batch 15500, loss[loss=0.2735, simple_loss=0.3475, pruned_loss=0.09976, over 21329.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3196, pruned_loss=0.08514, over 4272550.38 frames. ], batch size: 548, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:12:14,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1556802.0, ans=0.2 2023-06-23 19:12:25,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1556802.0, ans=0.09899494936611666 2023-06-23 19:12:26,941 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.768e+02 5.144e+02 7.122e+02 1.016e+03 2.468e+03, threshold=1.424e+03, percent-clipped=4.0 2023-06-23 19:12:27,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1556802.0, ans=0.125 2023-06-23 19:13:31,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1556982.0, ans=0.125 2023-06-23 19:13:33,993 INFO [train.py:996] (3/4) Epoch 9, batch 15550, loss[loss=0.272, simple_loss=0.3564, pruned_loss=0.09382, over 21267.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3211, pruned_loss=0.08328, over 4268943.76 frames. ], batch size: 548, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:13:55,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1557102.0, ans=0.0 2023-06-23 19:14:05,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1557102.0, ans=0.125 2023-06-23 19:14:18,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1557162.0, ans=0.0 2023-06-23 19:14:31,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1557162.0, ans=0.125 2023-06-23 19:14:31,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1557162.0, ans=0.0 2023-06-23 19:15:06,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1557282.0, ans=0.5 2023-06-23 19:15:14,760 INFO [train.py:996] (3/4) Epoch 9, batch 15600, loss[loss=0.2167, simple_loss=0.3036, pruned_loss=0.06488, over 21601.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3135, pruned_loss=0.08121, over 4262908.56 frames. ], batch size: 414, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:15:28,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1557342.0, ans=0.2 2023-06-23 19:15:34,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1557402.0, ans=0.0 2023-06-23 19:15:49,478 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.952e+02 5.285e+02 6.916e+02 1.084e+03 2.169e+03, threshold=1.383e+03, percent-clipped=9.0 2023-06-23 19:16:25,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1557522.0, ans=0.125 2023-06-23 19:16:28,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1557522.0, ans=0.1 2023-06-23 19:16:55,695 INFO [train.py:996] (3/4) Epoch 9, batch 15650, loss[loss=0.24, simple_loss=0.3009, pruned_loss=0.08952, over 15886.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3119, pruned_loss=0.08052, over 4253437.61 frames. ], batch size: 67, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:17:00,635 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:17:04,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1557642.0, ans=0.125 2023-06-23 19:17:09,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1557642.0, ans=0.1 2023-06-23 19:17:13,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1557702.0, ans=0.1 2023-06-23 19:17:25,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1557702.0, ans=0.125 2023-06-23 19:17:51,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1557762.0, ans=0.1 2023-06-23 19:18:22,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1557882.0, ans=0.0 2023-06-23 19:18:25,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1557882.0, ans=0.09899494936611666 2023-06-23 19:18:31,491 INFO [train.py:996] (3/4) Epoch 9, batch 15700, loss[loss=0.2252, simple_loss=0.2847, pruned_loss=0.0828, over 21182.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3079, pruned_loss=0.07963, over 4251620.99 frames. ], batch size: 143, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:18:33,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1557942.0, ans=0.0 2023-06-23 19:18:53,870 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=22.5 2023-06-23 19:19:07,038 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.566e+02 5.534e+02 7.672e+02 1.120e+03 2.103e+03, threshold=1.534e+03, percent-clipped=13.0 2023-06-23 19:19:26,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1558062.0, ans=0.0 2023-06-23 19:19:38,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1558122.0, ans=15.0 2023-06-23 19:19:50,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1558122.0, ans=0.125 2023-06-23 19:20:11,353 INFO [train.py:996] (3/4) Epoch 9, batch 15750, loss[loss=0.2409, simple_loss=0.2884, pruned_loss=0.09674, over 16370.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3034, pruned_loss=0.07947, over 4254760.50 frames. ], batch size: 66, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:21:49,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1558542.0, ans=0.125 2023-06-23 19:21:50,623 INFO [train.py:996] (3/4) Epoch 9, batch 15800, loss[loss=0.2049, simple_loss=0.2575, pruned_loss=0.07614, over 20789.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2988, pruned_loss=0.07941, over 4253889.12 frames. ], batch size: 608, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:22:12,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1558602.0, ans=0.1 2023-06-23 19:22:26,520 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=12.57 vs. limit=15.0 2023-06-23 19:22:26,965 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.689e+02 5.286e+02 6.867e+02 8.896e+02 1.872e+03, threshold=1.373e+03, percent-clipped=1.0 2023-06-23 19:23:07,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1558722.0, ans=0.2 2023-06-23 19:23:26,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1558782.0, ans=0.1 2023-06-23 19:23:29,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1558842.0, ans=0.0 2023-06-23 19:23:30,819 INFO [train.py:996] (3/4) Epoch 9, batch 15850, loss[loss=0.2423, simple_loss=0.3119, pruned_loss=0.08633, over 21237.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3018, pruned_loss=0.08143, over 4258127.96 frames. ], batch size: 143, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:23:54,118 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.70 vs. limit=12.0 2023-06-23 19:24:54,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1559082.0, ans=0.125 2023-06-23 19:24:55,316 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.97 vs. limit=15.0 2023-06-23 19:25:10,612 INFO [train.py:996] (3/4) Epoch 9, batch 15900, loss[loss=0.25, simple_loss=0.3252, pruned_loss=0.08742, over 21827.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.2988, pruned_loss=0.08122, over 4268447.20 frames. ], batch size: 372, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:25:23,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1559142.0, ans=0.125 2023-06-23 19:25:43,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1559202.0, ans=0.125 2023-06-23 19:25:46,392 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.755e+02 5.068e+02 6.374e+02 9.133e+02 1.940e+03, threshold=1.275e+03, percent-clipped=6.0 2023-06-23 19:26:15,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1559322.0, ans=0.125 2023-06-23 19:26:49,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1559382.0, ans=0.1 2023-06-23 19:26:50,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1559442.0, ans=0.125 2023-06-23 19:26:51,965 INFO [train.py:996] (3/4) Epoch 9, batch 15950, loss[loss=0.1805, simple_loss=0.2338, pruned_loss=0.06358, over 20773.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.299, pruned_loss=0.07836, over 4270124.80 frames. ], batch size: 609, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:27:01,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1559442.0, ans=0.1 2023-06-23 19:27:01,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1559442.0, ans=0.125 2023-06-23 19:27:21,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1559502.0, ans=0.125 2023-06-23 19:28:08,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1559622.0, ans=0.04949747468305833 2023-06-23 19:28:12,431 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.80 vs. limit=15.0 2023-06-23 19:28:32,472 INFO [train.py:996] (3/4) Epoch 9, batch 16000, loss[loss=0.2302, simple_loss=0.324, pruned_loss=0.06824, over 21810.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3005, pruned_loss=0.07648, over 4261754.46 frames. ], batch size: 351, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:29:07,944 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.766e+02 5.425e+02 7.645e+02 1.261e+03 2.910e+03, threshold=1.529e+03, percent-clipped=25.0 2023-06-23 19:29:37,975 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:30:09,390 INFO [train.py:996] (3/4) Epoch 9, batch 16050, loss[loss=0.3041, simple_loss=0.3986, pruned_loss=0.1048, over 21697.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3038, pruned_loss=0.0748, over 4261461.95 frames. ], batch size: 441, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:30:20,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1560042.0, ans=0.0 2023-06-23 19:31:41,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1560282.0, ans=0.09899494936611666 2023-06-23 19:31:45,097 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=15.0 2023-06-23 19:31:45,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1560282.0, ans=0.025 2023-06-23 19:31:48,824 INFO [train.py:996] (3/4) Epoch 9, batch 16100, loss[loss=0.3059, simple_loss=0.3776, pruned_loss=0.1171, over 21584.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3106, pruned_loss=0.07728, over 4261660.11 frames. ], batch size: 507, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:31:50,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1560342.0, ans=0.2 2023-06-23 19:32:23,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1560402.0, ans=0.0 2023-06-23 19:32:24,537 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 6.839e+02 1.039e+03 1.501e+03 2.959e+03, threshold=2.078e+03, percent-clipped=23.0 2023-06-23 19:32:26,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1560462.0, ans=0.2 2023-06-23 19:32:41,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1560522.0, ans=0.0 2023-06-23 19:33:00,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1560522.0, ans=0.04949747468305833 2023-06-23 19:33:29,426 INFO [train.py:996] (3/4) Epoch 9, batch 16150, loss[loss=0.2193, simple_loss=0.2874, pruned_loss=0.07562, over 20216.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3112, pruned_loss=0.07884, over 4260523.31 frames. ], batch size: 703, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:33:32,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1560642.0, ans=0.0 2023-06-23 19:34:06,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1560762.0, ans=0.125 2023-06-23 19:34:42,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1560822.0, ans=0.125 2023-06-23 19:34:53,692 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=22.5 2023-06-23 19:35:08,319 INFO [train.py:996] (3/4) Epoch 9, batch 16200, loss[loss=0.2683, simple_loss=0.3443, pruned_loss=0.09618, over 21837.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3148, pruned_loss=0.0799, over 4269427.70 frames. ], batch size: 247, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:35:39,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1561002.0, ans=0.125 2023-06-23 19:35:45,991 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.932e+02 6.489e+02 9.592e+02 1.250e+03 2.736e+03, threshold=1.918e+03, percent-clipped=7.0 2023-06-23 19:35:56,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1561062.0, ans=0.0 2023-06-23 19:35:57,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1561062.0, ans=0.2 2023-06-23 19:36:19,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1561122.0, ans=0.125 2023-06-23 19:36:37,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1561182.0, ans=0.125 2023-06-23 19:36:45,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1561182.0, ans=0.125 2023-06-23 19:36:45,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1561182.0, ans=0.125 2023-06-23 19:36:56,498 INFO [train.py:996] (3/4) Epoch 9, batch 16250, loss[loss=0.2311, simple_loss=0.3025, pruned_loss=0.07987, over 20067.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3154, pruned_loss=0.08117, over 4267416.09 frames. ], batch size: 702, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:37:01,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1561242.0, ans=0.2 2023-06-23 19:38:11,749 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=22.5 2023-06-23 19:38:19,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1561482.0, ans=0.5 2023-06-23 19:38:36,171 INFO [train.py:996] (3/4) Epoch 9, batch 16300, loss[loss=0.2392, simple_loss=0.3016, pruned_loss=0.08844, over 21345.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3083, pruned_loss=0.07745, over 4267141.48 frames. ], batch size: 507, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:38:42,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1561542.0, ans=0.125 2023-06-23 19:38:45,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1561542.0, ans=0.125 2023-06-23 19:38:51,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1561602.0, ans=0.125 2023-06-23 19:39:18,109 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.285e+02 4.893e+02 6.873e+02 9.809e+02 2.054e+03, threshold=1.375e+03, percent-clipped=1.0 2023-06-23 19:39:42,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1561722.0, ans=0.125 2023-06-23 19:40:03,357 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:40:09,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1561782.0, ans=0.125 2023-06-23 19:40:15,707 INFO [train.py:996] (3/4) Epoch 9, batch 16350, loss[loss=0.2617, simple_loss=0.3392, pruned_loss=0.0921, over 20761.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3066, pruned_loss=0.07726, over 4265482.28 frames. ], batch size: 611, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:40:19,783 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.89 vs. limit=6.0 2023-06-23 19:40:51,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1561902.0, ans=0.07 2023-06-23 19:40:57,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1561962.0, ans=0.09899494936611666 2023-06-23 19:41:16,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1562022.0, ans=0.125 2023-06-23 19:41:54,439 INFO [train.py:996] (3/4) Epoch 9, batch 16400, loss[loss=0.2516, simple_loss=0.3138, pruned_loss=0.09467, over 21432.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3096, pruned_loss=0.07828, over 4270386.26 frames. ], batch size: 131, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:42:07,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1562142.0, ans=0.2 2023-06-23 19:42:10,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1562202.0, ans=0.07 2023-06-23 19:42:20,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1562202.0, ans=0.0 2023-06-23 19:42:25,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1562202.0, ans=0.125 2023-06-23 19:42:37,799 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.770e+02 4.704e+02 7.326e+02 1.027e+03 2.811e+03, threshold=1.465e+03, percent-clipped=10.0 2023-06-23 19:42:38,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1562262.0, ans=0.0 2023-06-23 19:43:23,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1562382.0, ans=0.125 2023-06-23 19:43:34,518 INFO [train.py:996] (3/4) Epoch 9, batch 16450, loss[loss=0.2478, simple_loss=0.3127, pruned_loss=0.09145, over 21873.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3109, pruned_loss=0.07973, over 4278990.30 frames. ], batch size: 351, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:43:34,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1562442.0, ans=0.1 2023-06-23 19:45:15,076 INFO [train.py:996] (3/4) Epoch 9, batch 16500, loss[loss=0.1836, simple_loss=0.2364, pruned_loss=0.06538, over 21878.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3072, pruned_loss=0.07948, over 4276016.28 frames. ], batch size: 107, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:45:20,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1562742.0, ans=0.2 2023-06-23 19:45:39,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1562802.0, ans=0.0 2023-06-23 19:46:03,628 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.321e+02 6.770e+02 9.881e+02 1.345e+03 3.319e+03, threshold=1.976e+03, percent-clipped=17.0 2023-06-23 19:46:05,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1562862.0, ans=0.5 2023-06-23 19:46:07,780 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.27 vs. limit=10.0 2023-06-23 19:46:15,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1562862.0, ans=0.125 2023-06-23 19:46:17,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1562922.0, ans=0.0 2023-06-23 19:46:17,697 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.10 vs. limit=6.0 2023-06-23 19:46:30,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1562922.0, ans=0.5 2023-06-23 19:46:56,208 INFO [train.py:996] (3/4) Epoch 9, batch 16550, loss[loss=0.2716, simple_loss=0.3565, pruned_loss=0.09333, over 21552.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3063, pruned_loss=0.07801, over 4270310.27 frames. ], batch size: 414, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:47:40,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1563102.0, ans=0.0 2023-06-23 19:47:51,287 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.86 vs. limit=10.0 2023-06-23 19:48:11,591 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-23 19:48:22,978 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=15.0 2023-06-23 19:48:47,687 INFO [train.py:996] (3/4) Epoch 9, batch 16600, loss[loss=0.2247, simple_loss=0.3283, pruned_loss=0.06062, over 20872.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3148, pruned_loss=0.08093, over 4270505.93 frames. ], batch size: 608, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:49:15,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1563402.0, ans=0.0 2023-06-23 19:49:26,429 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.870e+02 6.782e+02 8.603e+02 1.169e+03 2.865e+03, threshold=1.721e+03, percent-clipped=6.0 2023-06-23 19:50:28,624 INFO [train.py:996] (3/4) Epoch 9, batch 16650, loss[loss=0.3341, simple_loss=0.3928, pruned_loss=0.1377, over 21291.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3239, pruned_loss=0.08278, over 4267727.48 frames. ], batch size: 507, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:50:32,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1563642.0, ans=0.2 2023-06-23 19:50:41,911 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.37 vs. limit=12.0 2023-06-23 19:50:47,307 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-23 19:51:03,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1563702.0, ans=0.125 2023-06-23 19:52:08,214 INFO [train.py:996] (3/4) Epoch 9, batch 16700, loss[loss=0.2778, simple_loss=0.3565, pruned_loss=0.0996, over 20692.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3259, pruned_loss=0.08468, over 4264771.80 frames. ], batch size: 607, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:52:40,688 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-23 19:52:57,949 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.483e+02 5.925e+02 7.840e+02 1.052e+03 2.518e+03, threshold=1.568e+03, percent-clipped=7.0 2023-06-23 19:53:09,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1564062.0, ans=0.1 2023-06-23 19:53:58,344 INFO [train.py:996] (3/4) Epoch 9, batch 16750, loss[loss=0.2727, simple_loss=0.3437, pruned_loss=0.1008, over 21596.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3278, pruned_loss=0.08706, over 4268940.69 frames. ], batch size: 263, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:54:06,187 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-06-23 19:54:37,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1564302.0, ans=0.125 2023-06-23 19:55:14,633 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=22.5 2023-06-23 19:55:43,492 INFO [train.py:996] (3/4) Epoch 9, batch 16800, loss[loss=0.2256, simple_loss=0.2906, pruned_loss=0.08034, over 21445.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3336, pruned_loss=0.0866, over 4264804.42 frames. ], batch size: 211, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:56:08,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1564602.0, ans=0.2 2023-06-23 19:56:08,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1564602.0, ans=0.2 2023-06-23 19:56:11,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1564602.0, ans=0.2 2023-06-23 19:56:26,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.073e+02 6.649e+02 8.449e+02 1.121e+03 2.457e+03, threshold=1.690e+03, percent-clipped=14.0 2023-06-23 19:57:07,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1564782.0, ans=0.0 2023-06-23 19:57:23,018 INFO [train.py:996] (3/4) Epoch 9, batch 16850, loss[loss=0.2247, simple_loss=0.2884, pruned_loss=0.0805, over 21554.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3288, pruned_loss=0.08675, over 4265211.52 frames. ], batch size: 548, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:57:41,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1564842.0, ans=0.125 2023-06-23 19:58:10,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1564962.0, ans=0.1 2023-06-23 19:58:52,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1565082.0, ans=0.2 2023-06-23 19:59:05,589 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=22.5 2023-06-23 19:59:07,892 INFO [train.py:996] (3/4) Epoch 9, batch 16900, loss[loss=0.2582, simple_loss=0.3099, pruned_loss=0.1032, over 20258.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3234, pruned_loss=0.08509, over 4268399.44 frames. ], batch size: 707, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 19:59:15,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1565142.0, ans=0.125 2023-06-23 19:59:42,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1565262.0, ans=0.0 2023-06-23 19:59:45,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1565262.0, ans=0.125 2023-06-23 19:59:46,585 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.622e+02 4.767e+02 6.726e+02 1.260e+03 2.714e+03, threshold=1.345e+03, percent-clipped=10.0 2023-06-23 20:00:04,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1565322.0, ans=0.0 2023-06-23 20:00:45,063 INFO [train.py:996] (3/4) Epoch 9, batch 16950, loss[loss=0.2003, simple_loss=0.2756, pruned_loss=0.0625, over 21811.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3172, pruned_loss=0.08432, over 4276084.36 frames. ], batch size: 298, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:00:47,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1565442.0, ans=0.0 2023-06-23 20:01:05,472 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.59 vs. limit=15.0 2023-06-23 20:01:33,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1565562.0, ans=0.2 2023-06-23 20:01:44,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1565622.0, ans=0.0 2023-06-23 20:01:50,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1565622.0, ans=0.2 2023-06-23 20:01:56,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1565682.0, ans=0.125 2023-06-23 20:02:06,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1565682.0, ans=0.0 2023-06-23 20:02:23,827 INFO [train.py:996] (3/4) Epoch 9, batch 17000, loss[loss=0.2185, simple_loss=0.2899, pruned_loss=0.07358, over 21673.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3137, pruned_loss=0.08435, over 4279389.60 frames. ], batch size: 263, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:02:30,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1565742.0, ans=0.125 2023-06-23 20:02:34,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1565742.0, ans=10.0 2023-06-23 20:02:59,083 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=22.5 2023-06-23 20:03:04,622 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.656e+02 4.823e+02 5.834e+02 7.335e+02 1.533e+03, threshold=1.167e+03, percent-clipped=2.0 2023-06-23 20:03:08,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1565862.0, ans=0.1 2023-06-23 20:03:17,475 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=15.0 2023-06-23 20:03:47,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1565982.0, ans=0.125 2023-06-23 20:04:05,383 INFO [train.py:996] (3/4) Epoch 9, batch 17050, loss[loss=0.2949, simple_loss=0.3727, pruned_loss=0.1085, over 21409.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.32, pruned_loss=0.08617, over 4288771.96 frames. ], batch size: 548, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:04:12,499 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-23 20:04:17,216 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=12.0 2023-06-23 20:04:35,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1566102.0, ans=0.0 2023-06-23 20:05:12,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1566222.0, ans=0.125 2023-06-23 20:05:44,738 INFO [train.py:996] (3/4) Epoch 9, batch 17100, loss[loss=0.2309, simple_loss=0.2977, pruned_loss=0.08201, over 21924.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3183, pruned_loss=0.08692, over 4295747.54 frames. ], batch size: 316, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:06:24,472 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.937e+02 5.247e+02 7.699e+02 1.064e+03 2.324e+03, threshold=1.540e+03, percent-clipped=17.0 2023-06-23 20:06:31,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1566462.0, ans=0.0 2023-06-23 20:06:41,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-23 20:07:24,712 INFO [train.py:996] (3/4) Epoch 9, batch 17150, loss[loss=0.1912, simple_loss=0.2606, pruned_loss=0.06092, over 21466.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3141, pruned_loss=0.08572, over 4298117.21 frames. ], batch size: 194, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:07:26,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1566642.0, ans=0.0 2023-06-23 20:08:19,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1566762.0, ans=0.95 2023-06-23 20:08:36,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1566822.0, ans=0.1 2023-06-23 20:09:05,725 INFO [train.py:996] (3/4) Epoch 9, batch 17200, loss[loss=0.2485, simple_loss=0.3214, pruned_loss=0.08781, over 21755.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3162, pruned_loss=0.0863, over 4292769.37 frames. ], batch size: 298, lr: 3.27e-03, grad_scale: 32.0 2023-06-23 20:09:28,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1567002.0, ans=0.125 2023-06-23 20:09:43,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1567002.0, ans=0.125 2023-06-23 20:09:52,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1567062.0, ans=0.0 2023-06-23 20:09:53,655 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.320e+02 8.105e+02 1.085e+03 1.487e+03 3.292e+03, threshold=2.169e+03, percent-clipped=22.0 2023-06-23 20:10:25,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1567122.0, ans=0.125 2023-06-23 20:10:42,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1567182.0, ans=0.125 2023-06-23 20:10:52,863 INFO [train.py:996] (3/4) Epoch 9, batch 17250, loss[loss=0.2712, simple_loss=0.3649, pruned_loss=0.08874, over 17335.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3185, pruned_loss=0.08732, over 4281443.69 frames. ], batch size: 60, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:11:13,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1567242.0, ans=0.125 2023-06-23 20:11:16,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1567302.0, ans=0.0 2023-06-23 20:12:03,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1567422.0, ans=0.125 2023-06-23 20:12:35,685 INFO [train.py:996] (3/4) Epoch 9, batch 17300, loss[loss=0.3073, simple_loss=0.3724, pruned_loss=0.1211, over 21433.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3274, pruned_loss=0.09105, over 4284751.12 frames. ], batch size: 471, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:12:36,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1567542.0, ans=0.0 2023-06-23 20:12:55,884 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:12:55,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1567602.0, ans=0.2 2023-06-23 20:13:10,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1567602.0, ans=0.125 2023-06-23 20:13:28,002 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.316e+02 5.597e+02 7.560e+02 1.039e+03 2.489e+03, threshold=1.512e+03, percent-clipped=1.0 2023-06-23 20:14:04,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1567782.0, ans=0.1 2023-06-23 20:14:07,859 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.21 vs. limit=10.0 2023-06-23 20:14:22,058 INFO [train.py:996] (3/4) Epoch 9, batch 17350, loss[loss=0.2575, simple_loss=0.3501, pruned_loss=0.08243, over 21500.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3268, pruned_loss=0.09056, over 4278554.80 frames. ], batch size: 471, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:14:47,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1567902.0, ans=0.125 2023-06-23 20:15:30,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1568022.0, ans=0.125 2023-06-23 20:16:07,990 INFO [train.py:996] (3/4) Epoch 9, batch 17400, loss[loss=0.1439, simple_loss=0.1932, pruned_loss=0.04727, over 16692.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3228, pruned_loss=0.08682, over 4269497.37 frames. ], batch size: 60, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:16:17,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1568142.0, ans=0.1 2023-06-23 20:16:54,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1568262.0, ans=0.125 2023-06-23 20:16:55,290 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.466e+02 5.667e+02 9.192e+02 1.513e+03 3.310e+03, threshold=1.838e+03, percent-clipped=24.0 2023-06-23 20:17:14,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1568322.0, ans=0.2 2023-06-23 20:17:33,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1568382.0, ans=0.125 2023-06-23 20:17:43,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1568382.0, ans=0.125 2023-06-23 20:17:49,315 INFO [train.py:996] (3/4) Epoch 9, batch 17450, loss[loss=0.1863, simple_loss=0.2853, pruned_loss=0.04361, over 21764.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3188, pruned_loss=0.0837, over 4271026.81 frames. ], batch size: 351, lr: 3.27e-03, grad_scale: 8.0 2023-06-23 20:17:59,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1568442.0, ans=0.0 2023-06-23 20:18:11,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1568502.0, ans=0.1 2023-06-23 20:18:32,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1568562.0, ans=0.0 2023-06-23 20:18:58,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1568622.0, ans=0.0 2023-06-23 20:19:27,726 INFO [train.py:996] (3/4) Epoch 9, batch 17500, loss[loss=0.251, simple_loss=0.3256, pruned_loss=0.08826, over 21893.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3148, pruned_loss=0.08103, over 4271382.62 frames. ], batch size: 118, lr: 3.27e-03, grad_scale: 8.0 2023-06-23 20:19:48,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1568802.0, ans=0.125 2023-06-23 20:20:07,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1568862.0, ans=0.125 2023-06-23 20:20:19,171 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.530e+02 5.781e+02 8.305e+02 1.162e+03 2.249e+03, threshold=1.661e+03, percent-clipped=4.0 2023-06-23 20:20:26,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-06-23 20:21:00,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1568982.0, ans=0.125 2023-06-23 20:21:04,895 INFO [train.py:996] (3/4) Epoch 9, batch 17550, loss[loss=0.2285, simple_loss=0.3137, pruned_loss=0.07162, over 21803.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3158, pruned_loss=0.08023, over 4275060.33 frames. ], batch size: 124, lr: 3.27e-03, grad_scale: 8.0 2023-06-23 20:21:57,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1569162.0, ans=0.0 2023-06-23 20:22:00,367 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.39 vs. limit=22.5 2023-06-23 20:22:08,246 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=22.5 2023-06-23 20:22:21,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1569222.0, ans=0.015 2023-06-23 20:22:29,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1569282.0, ans=0.0 2023-06-23 20:22:30,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1569282.0, ans=0.2 2023-06-23 20:22:43,258 INFO [train.py:996] (3/4) Epoch 9, batch 17600, loss[loss=0.2513, simple_loss=0.3246, pruned_loss=0.08903, over 21507.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3173, pruned_loss=0.08059, over 4279852.94 frames. ], batch size: 194, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:23:35,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1569462.0, ans=0.0 2023-06-23 20:23:38,190 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.747e+02 5.641e+02 7.220e+02 1.088e+03 2.051e+03, threshold=1.444e+03, percent-clipped=1.0 2023-06-23 20:23:53,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1569522.0, ans=0.09899494936611666 2023-06-23 20:23:57,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1569522.0, ans=0.125 2023-06-23 20:24:25,244 INFO [train.py:996] (3/4) Epoch 9, batch 17650, loss[loss=0.264, simple_loss=0.3406, pruned_loss=0.09367, over 20813.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3153, pruned_loss=0.08068, over 4260626.15 frames. ], batch size: 609, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:24:40,863 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=22.5 2023-06-23 20:25:34,853 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=22.5 2023-06-23 20:26:06,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1569882.0, ans=0.2 2023-06-23 20:26:08,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1569942.0, ans=0.125 2023-06-23 20:26:14,533 INFO [train.py:996] (3/4) Epoch 9, batch 17700, loss[loss=0.2661, simple_loss=0.3436, pruned_loss=0.09432, over 21765.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3122, pruned_loss=0.07844, over 4264597.56 frames. ], batch size: 441, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:26:26,244 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.64 vs. limit=6.0 2023-06-23 20:26:35,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1570002.0, ans=0.0 2023-06-23 20:27:01,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1570062.0, ans=0.125 2023-06-23 20:27:02,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.206e+02 6.405e+02 1.162e+03 1.769e+03 3.070e+03, threshold=2.325e+03, percent-clipped=36.0 2023-06-23 20:27:51,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1570182.0, ans=0.125 2023-06-23 20:27:54,402 INFO [train.py:996] (3/4) Epoch 9, batch 17750, loss[loss=0.2417, simple_loss=0.3254, pruned_loss=0.07897, over 21763.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3181, pruned_loss=0.08101, over 4261910.27 frames. ], batch size: 332, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:28:11,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1570242.0, ans=0.125 2023-06-23 20:28:11,836 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-23 20:28:14,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1570302.0, ans=0.1 2023-06-23 20:28:53,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1570422.0, ans=0.05 2023-06-23 20:29:40,663 INFO [train.py:996] (3/4) Epoch 9, batch 17800, loss[loss=0.2022, simple_loss=0.2917, pruned_loss=0.0564, over 19826.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3164, pruned_loss=0.07921, over 4264212.99 frames. ], batch size: 702, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:29:44,256 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:29:47,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1570542.0, ans=0.0 2023-06-23 20:30:14,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1570602.0, ans=0.125 2023-06-23 20:30:24,622 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.850e+02 7.073e+02 9.847e+02 1.535e+03 2.589e+03, threshold=1.969e+03, percent-clipped=1.0 2023-06-23 20:30:28,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1570662.0, ans=0.1 2023-06-23 20:31:15,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1570842.0, ans=0.0 2023-06-23 20:31:17,166 INFO [train.py:996] (3/4) Epoch 9, batch 17850, loss[loss=0.2397, simple_loss=0.3145, pruned_loss=0.08243, over 21435.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3182, pruned_loss=0.08028, over 4270715.60 frames. ], batch size: 131, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:32:05,183 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.41 vs. limit=12.0 2023-06-23 20:32:56,149 INFO [train.py:996] (3/4) Epoch 9, batch 17900, loss[loss=0.2278, simple_loss=0.3102, pruned_loss=0.07274, over 21261.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3226, pruned_loss=0.08285, over 4275003.41 frames. ], batch size: 159, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:33:38,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1571262.0, ans=0.0 2023-06-23 20:33:49,662 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.056e+02 5.757e+02 7.357e+02 9.993e+02 2.226e+03, threshold=1.471e+03, percent-clipped=2.0 2023-06-23 20:34:41,338 INFO [train.py:996] (3/4) Epoch 9, batch 17950, loss[loss=0.2435, simple_loss=0.3347, pruned_loss=0.07611, over 21639.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3224, pruned_loss=0.07994, over 4267624.53 frames. ], batch size: 414, lr: 3.27e-03, grad_scale: 8.0 2023-06-23 20:35:31,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1571562.0, ans=0.0 2023-06-23 20:36:09,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1571682.0, ans=0.1 2023-06-23 20:36:18,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.95 vs. limit=6.0 2023-06-23 20:36:19,115 INFO [train.py:996] (3/4) Epoch 9, batch 18000, loss[loss=0.2154, simple_loss=0.2825, pruned_loss=0.07419, over 21539.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3155, pruned_loss=0.07907, over 4262710.28 frames. ], batch size: 414, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:36:19,116 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 20:36:36,002 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2626, simple_loss=0.3575, pruned_loss=0.08385, over 1796401.00 frames. 2023-06-23 20:36:36,003 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-23 20:36:45,068 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.71 vs. limit=12.0 2023-06-23 20:36:45,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1571742.0, ans=0.0 2023-06-23 20:36:52,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1571742.0, ans=0.125 2023-06-23 20:37:27,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1571862.0, ans=0.125 2023-06-23 20:37:28,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1571862.0, ans=0.0 2023-06-23 20:37:29,988 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.324e+02 4.978e+02 7.262e+02 1.032e+03 1.973e+03, threshold=1.452e+03, percent-clipped=7.0 2023-06-23 20:38:02,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1571982.0, ans=0.125 2023-06-23 20:38:20,217 INFO [train.py:996] (3/4) Epoch 9, batch 18050, loss[loss=0.2396, simple_loss=0.3067, pruned_loss=0.08626, over 21622.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3095, pruned_loss=0.07806, over 4266368.75 frames. ], batch size: 415, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:38:29,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1572042.0, ans=0.125 2023-06-23 20:38:54,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1572102.0, ans=0.125 2023-06-23 20:39:02,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1572162.0, ans=0.0 2023-06-23 20:39:54,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1572282.0, ans=0.0 2023-06-23 20:40:00,643 INFO [train.py:996] (3/4) Epoch 9, batch 18100, loss[loss=0.2507, simple_loss=0.3162, pruned_loss=0.09262, over 21775.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3146, pruned_loss=0.08104, over 4266354.88 frames. ], batch size: 102, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:40:25,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1572402.0, ans=0.0 2023-06-23 20:40:49,463 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:40:55,081 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.018e+02 5.547e+02 7.703e+02 1.178e+03 2.128e+03, threshold=1.541e+03, percent-clipped=13.0 2023-06-23 20:41:00,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1572462.0, ans=0.125 2023-06-23 20:41:26,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1572582.0, ans=0.04949747468305833 2023-06-23 20:41:36,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1572582.0, ans=0.125 2023-06-23 20:41:38,948 INFO [train.py:996] (3/4) Epoch 9, batch 18150, loss[loss=0.2108, simple_loss=0.2817, pruned_loss=0.06994, over 21508.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3169, pruned_loss=0.08076, over 4273063.06 frames. ], batch size: 195, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:41:52,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-23 20:42:04,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1572702.0, ans=0.05 2023-06-23 20:43:15,632 INFO [train.py:996] (3/4) Epoch 9, batch 18200, loss[loss=0.2124, simple_loss=0.2743, pruned_loss=0.07531, over 21582.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3106, pruned_loss=0.08056, over 4264475.69 frames. ], batch size: 263, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:44:03,321 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.079e+02 5.748e+02 7.422e+02 1.064e+03 2.381e+03, threshold=1.484e+03, percent-clipped=9.0 2023-06-23 20:44:34,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1573182.0, ans=0.125 2023-06-23 20:44:51,680 INFO [train.py:996] (3/4) Epoch 9, batch 18250, loss[loss=0.2263, simple_loss=0.2935, pruned_loss=0.07949, over 21838.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3029, pruned_loss=0.07812, over 4256247.63 frames. ], batch size: 351, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:44:56,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1573242.0, ans=0.125 2023-06-23 20:45:15,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1573302.0, ans=0.0 2023-06-23 20:45:16,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1573302.0, ans=0.2 2023-06-23 20:45:50,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1573422.0, ans=0.125 2023-06-23 20:45:58,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1573422.0, ans=0.0 2023-06-23 20:46:30,592 INFO [train.py:996] (3/4) Epoch 9, batch 18300, loss[loss=0.1807, simple_loss=0.2561, pruned_loss=0.05269, over 21298.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3014, pruned_loss=0.07779, over 4258377.33 frames. ], batch size: 131, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:46:30,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1573542.0, ans=0.125 2023-06-23 20:46:40,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-23 20:46:53,178 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-23 20:47:14,633 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.309e+02 5.362e+02 7.234e+02 9.653e+02 2.224e+03, threshold=1.447e+03, percent-clipped=7.0 2023-06-23 20:47:24,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1573662.0, ans=0.0 2023-06-23 20:47:58,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1573782.0, ans=0.0 2023-06-23 20:48:09,450 INFO [train.py:996] (3/4) Epoch 9, batch 18350, loss[loss=0.1824, simple_loss=0.2592, pruned_loss=0.05278, over 15889.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.307, pruned_loss=0.07725, over 4257142.00 frames. ], batch size: 61, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:48:39,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1573902.0, ans=0.125 2023-06-23 20:49:10,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1574022.0, ans=0.2 2023-06-23 20:49:37,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1574082.0, ans=0.1 2023-06-23 20:49:48,126 INFO [train.py:996] (3/4) Epoch 9, batch 18400, loss[loss=0.2294, simple_loss=0.3042, pruned_loss=0.07734, over 21196.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3043, pruned_loss=0.07641, over 4252788.54 frames. ], batch size: 159, lr: 3.27e-03, grad_scale: 32.0 2023-06-23 20:50:02,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1574202.0, ans=0.0 2023-06-23 20:50:09,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1574202.0, ans=0.1 2023-06-23 20:50:37,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1574262.0, ans=0.125 2023-06-23 20:50:38,667 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.618e+02 5.691e+02 8.490e+02 1.304e+03 3.377e+03, threshold=1.698e+03, percent-clipped=15.0 2023-06-23 20:51:05,629 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.31 vs. limit=15.0 2023-06-23 20:51:21,365 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:51:24,262 INFO [train.py:996] (3/4) Epoch 9, batch 18450, loss[loss=0.2985, simple_loss=0.4361, pruned_loss=0.08044, over 19717.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3042, pruned_loss=0.07278, over 4253090.24 frames. ], batch size: 702, lr: 3.27e-03, grad_scale: 32.0 2023-06-23 20:51:40,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1574502.0, ans=0.125 2023-06-23 20:52:00,687 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=12.0 2023-06-23 20:52:33,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1574622.0, ans=0.125 2023-06-23 20:52:39,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1574622.0, ans=0.04949747468305833 2023-06-23 20:53:03,417 INFO [train.py:996] (3/4) Epoch 9, batch 18500, loss[loss=0.2326, simple_loss=0.287, pruned_loss=0.08907, over 21335.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2992, pruned_loss=0.07109, over 4252918.67 frames. ], batch size: 144, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 20:53:34,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1574802.0, ans=0.125 2023-06-23 20:53:37,722 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-06-23 20:53:45,652 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.00 vs. limit=10.0 2023-06-23 20:53:57,744 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.581e+02 5.113e+02 7.633e+02 1.101e+03 4.944e+03, threshold=1.527e+03, percent-clipped=5.0 2023-06-23 20:54:10,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1574922.0, ans=0.0 2023-06-23 20:54:12,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1574922.0, ans=0.0 2023-06-23 20:54:27,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1574982.0, ans=0.125 2023-06-23 20:54:42,148 INFO [train.py:996] (3/4) Epoch 9, batch 18550, loss[loss=0.1883, simple_loss=0.2556, pruned_loss=0.0605, over 21206.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2949, pruned_loss=0.07054, over 4246688.98 frames. ], batch size: 548, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 20:54:54,455 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-06-23 20:55:39,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1575162.0, ans=0.0 2023-06-23 20:55:41,230 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2023-06-23 20:55:47,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1575222.0, ans=0.125 2023-06-23 20:55:50,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1575222.0, ans=0.0 2023-06-23 20:56:21,021 INFO [train.py:996] (3/4) Epoch 9, batch 18600, loss[loss=0.2081, simple_loss=0.2851, pruned_loss=0.06557, over 21551.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2931, pruned_loss=0.07152, over 4233402.25 frames. ], batch size: 230, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 20:57:16,745 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.060e+02 5.217e+02 6.797e+02 9.080e+02 2.355e+03, threshold=1.359e+03, percent-clipped=3.0 2023-06-23 20:57:20,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1575462.0, ans=0.1 2023-06-23 20:57:59,908 INFO [train.py:996] (3/4) Epoch 9, batch 18650, loss[loss=0.2501, simple_loss=0.3336, pruned_loss=0.08334, over 21782.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2933, pruned_loss=0.0723, over 4240184.82 frames. ], batch size: 391, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 20:59:10,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1575822.0, ans=0.125 2023-06-23 20:59:33,506 INFO [train.py:996] (3/4) Epoch 9, batch 18700, loss[loss=0.2656, simple_loss=0.3075, pruned_loss=0.1118, over 21599.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2912, pruned_loss=0.07297, over 4246842.11 frames. ], batch size: 508, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:00:03,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1576002.0, ans=0.2 2023-06-23 21:00:27,483 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-23 21:00:27,902 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.989e+02 5.925e+02 8.021e+02 1.131e+03 2.066e+03, threshold=1.604e+03, percent-clipped=15.0 2023-06-23 21:00:42,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1576122.0, ans=0.125 2023-06-23 21:00:48,080 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-23 21:00:53,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1576182.0, ans=0.1 2023-06-23 21:00:53,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1576182.0, ans=0.0 2023-06-23 21:01:10,839 INFO [train.py:996] (3/4) Epoch 9, batch 18750, loss[loss=0.2694, simple_loss=0.34, pruned_loss=0.09944, over 21816.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2939, pruned_loss=0.0759, over 4265656.07 frames. ], batch size: 118, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:01:42,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1576302.0, ans=0.125 2023-06-23 21:02:44,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1576482.0, ans=0.125 2023-06-23 21:02:50,395 INFO [train.py:996] (3/4) Epoch 9, batch 18800, loss[loss=0.2088, simple_loss=0.2996, pruned_loss=0.05901, over 21618.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3019, pruned_loss=0.07849, over 4271805.48 frames. ], batch size: 263, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:03:00,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1576542.0, ans=0.2 2023-06-23 21:03:15,368 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=15.0 2023-06-23 21:03:19,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1576602.0, ans=0.125 2023-06-23 21:03:48,139 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.728e+02 5.591e+02 7.847e+02 1.340e+03 4.014e+03, threshold=1.569e+03, percent-clipped=18.0 2023-06-23 21:03:51,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1576722.0, ans=0.125 2023-06-23 21:03:54,131 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.48 vs. limit=22.5 2023-06-23 21:04:05,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1576722.0, ans=0.2 2023-06-23 21:04:20,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1576782.0, ans=0.125 2023-06-23 21:04:24,330 INFO [train.py:996] (3/4) Epoch 9, batch 18850, loss[loss=0.2416, simple_loss=0.3003, pruned_loss=0.09151, over 21847.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2987, pruned_loss=0.07349, over 4276586.73 frames. ], batch size: 107, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:04:41,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1576842.0, ans=0.5 2023-06-23 21:04:50,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1576902.0, ans=0.0 2023-06-23 21:05:13,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1576962.0, ans=0.125 2023-06-23 21:05:23,151 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-23 21:05:30,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1577022.0, ans=0.125 2023-06-23 21:05:43,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1577022.0, ans=0.125 2023-06-23 21:05:44,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1577082.0, ans=0.2 2023-06-23 21:05:51,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1577082.0, ans=0.0 2023-06-23 21:06:01,632 INFO [train.py:996] (3/4) Epoch 9, batch 18900, loss[loss=0.2169, simple_loss=0.2833, pruned_loss=0.07523, over 21612.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2947, pruned_loss=0.0733, over 4280921.03 frames. ], batch size: 415, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:07:02,784 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.478e+02 5.570e+02 8.138e+02 1.105e+03 2.529e+03, threshold=1.628e+03, percent-clipped=6.0 2023-06-23 21:07:40,568 INFO [train.py:996] (3/4) Epoch 9, batch 18950, loss[loss=0.1928, simple_loss=0.2504, pruned_loss=0.06764, over 21164.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2929, pruned_loss=0.07528, over 4282131.74 frames. ], batch size: 608, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:08:01,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1577502.0, ans=0.0 2023-06-23 21:09:24,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1577742.0, ans=0.125 2023-06-23 21:09:25,297 INFO [train.py:996] (3/4) Epoch 9, batch 19000, loss[loss=0.2625, simple_loss=0.3363, pruned_loss=0.09436, over 21837.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3036, pruned_loss=0.07842, over 4276736.84 frames. ], batch size: 282, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:10:22,696 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.986e+02 5.803e+02 7.303e+02 9.676e+02 2.097e+03, threshold=1.461e+03, percent-clipped=8.0 2023-06-23 21:10:33,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1577922.0, ans=0.125 2023-06-23 21:11:04,883 INFO [train.py:996] (3/4) Epoch 9, batch 19050, loss[loss=0.2414, simple_loss=0.3046, pruned_loss=0.08911, over 21334.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3088, pruned_loss=0.08147, over 4270762.96 frames. ], batch size: 176, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:12:01,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1578162.0, ans=0.125 2023-06-23 21:12:07,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1578222.0, ans=0.125 2023-06-23 21:12:09,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=22.5 2023-06-23 21:12:21,799 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:12:26,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1578282.0, ans=0.125 2023-06-23 21:12:44,220 INFO [train.py:996] (3/4) Epoch 9, batch 19100, loss[loss=0.249, simple_loss=0.3047, pruned_loss=0.09662, over 21328.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3075, pruned_loss=0.08253, over 4277534.68 frames. ], batch size: 471, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:13:20,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1578402.0, ans=0.0 2023-06-23 21:13:20,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1578402.0, ans=0.125 2023-06-23 21:13:38,216 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.144e+02 5.907e+02 9.492e+02 1.356e+03 2.303e+03, threshold=1.898e+03, percent-clipped=18.0 2023-06-23 21:13:55,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1578522.0, ans=0.0 2023-06-23 21:13:57,970 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-06-23 21:14:26,127 INFO [train.py:996] (3/4) Epoch 9, batch 19150, loss[loss=0.3259, simple_loss=0.4155, pruned_loss=0.1181, over 21531.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3098, pruned_loss=0.08335, over 4276726.23 frames. ], batch size: 471, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:15:26,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1578822.0, ans=0.0 2023-06-23 21:16:07,488 INFO [train.py:996] (3/4) Epoch 9, batch 19200, loss[loss=0.2292, simple_loss=0.3331, pruned_loss=0.06263, over 21601.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.32, pruned_loss=0.08414, over 4272338.60 frames. ], batch size: 230, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:16:26,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1578942.0, ans=0.1 2023-06-23 21:16:32,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1579002.0, ans=0.1 2023-06-23 21:16:55,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1579062.0, ans=0.0 2023-06-23 21:17:01,509 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.099e+02 6.099e+02 9.205e+02 1.363e+03 2.424e+03, threshold=1.841e+03, percent-clipped=8.0 2023-06-23 21:17:24,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1579122.0, ans=0.05 2023-06-23 21:17:47,695 INFO [train.py:996] (3/4) Epoch 9, batch 19250, loss[loss=0.2266, simple_loss=0.3117, pruned_loss=0.07078, over 21784.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3193, pruned_loss=0.07883, over 4270191.26 frames. ], batch size: 414, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:17:54,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1579242.0, ans=0.0 2023-06-23 21:18:12,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1579302.0, ans=0.07 2023-06-23 21:18:23,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1579302.0, ans=0.0 2023-06-23 21:18:34,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1579362.0, ans=0.1 2023-06-23 21:18:52,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1579422.0, ans=0.1 2023-06-23 21:19:19,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1579482.0, ans=0.125 2023-06-23 21:19:24,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1579482.0, ans=0.025 2023-06-23 21:19:27,356 INFO [train.py:996] (3/4) Epoch 9, batch 19300, loss[loss=0.2443, simple_loss=0.3183, pruned_loss=0.08517, over 21550.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3168, pruned_loss=0.07903, over 4276494.51 frames. ], batch size: 471, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:19:38,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1579542.0, ans=0.1 2023-06-23 21:19:39,307 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.95 vs. limit=15.0 2023-06-23 21:19:52,626 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=15.0 2023-06-23 21:20:01,378 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:20:17,166 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:20:23,010 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.305e+02 4.858e+02 7.673e+02 1.130e+03 2.664e+03, threshold=1.535e+03, percent-clipped=6.0 2023-06-23 21:20:23,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1579662.0, ans=0.125 2023-06-23 21:20:41,948 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.67 vs. limit=12.0 2023-06-23 21:21:18,257 INFO [train.py:996] (3/4) Epoch 9, batch 19350, loss[loss=0.1685, simple_loss=0.2443, pruned_loss=0.04638, over 21307.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3124, pruned_loss=0.07478, over 4273281.12 frames. ], batch size: 131, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:21:38,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1579902.0, ans=0.1 2023-06-23 21:21:46,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1579902.0, ans=0.125 2023-06-23 21:21:58,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1579962.0, ans=0.2 2023-06-23 21:22:46,727 INFO [train.py:996] (3/4) Epoch 9, batch 19400, loss[loss=0.2391, simple_loss=0.3239, pruned_loss=0.0772, over 21595.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3086, pruned_loss=0.07355, over 4274197.04 frames. ], batch size: 441, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:23:18,402 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.54 vs. limit=10.0 2023-06-23 21:23:41,603 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.900e+02 5.598e+02 7.737e+02 1.076e+03 2.272e+03, threshold=1.547e+03, percent-clipped=6.0 2023-06-23 21:23:42,624 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=22.5 2023-06-23 21:23:54,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1580322.0, ans=0.05 2023-06-23 21:24:36,303 INFO [train.py:996] (3/4) Epoch 9, batch 19450, loss[loss=0.2364, simple_loss=0.2904, pruned_loss=0.09121, over 21477.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3064, pruned_loss=0.07582, over 4284038.80 frames. ], batch size: 441, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:25:34,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1580622.0, ans=0.0 2023-06-23 21:25:36,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1580622.0, ans=0.125 2023-06-23 21:25:49,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1580682.0, ans=0.2 2023-06-23 21:26:17,319 INFO [train.py:996] (3/4) Epoch 9, batch 19500, loss[loss=0.1929, simple_loss=0.261, pruned_loss=0.06241, over 21400.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3023, pruned_loss=0.07695, over 4279825.46 frames. ], batch size: 160, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:26:32,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1580802.0, ans=0.035 2023-06-23 21:26:49,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1580862.0, ans=0.0 2023-06-23 21:27:07,902 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.161e+02 5.798e+02 8.156e+02 1.303e+03 2.400e+03, threshold=1.631e+03, percent-clipped=12.0 2023-06-23 21:27:08,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1580862.0, ans=0.1 2023-06-23 21:27:20,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1580922.0, ans=0.2 2023-06-23 21:27:57,886 INFO [train.py:996] (3/4) Epoch 9, batch 19550, loss[loss=0.2129, simple_loss=0.3077, pruned_loss=0.0591, over 21531.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2984, pruned_loss=0.07566, over 4277743.37 frames. ], batch size: 471, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:28:13,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1581102.0, ans=0.0 2023-06-23 21:28:13,596 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=15.0 2023-06-23 21:28:20,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1581102.0, ans=0.125 2023-06-23 21:28:30,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1581162.0, ans=0.125 2023-06-23 21:28:48,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1581222.0, ans=0.2 2023-06-23 21:29:09,977 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.25 vs. limit=15.0 2023-06-23 21:29:37,012 INFO [train.py:996] (3/4) Epoch 9, batch 19600, loss[loss=0.2692, simple_loss=0.334, pruned_loss=0.1022, over 21836.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3002, pruned_loss=0.07631, over 4279666.58 frames. ], batch size: 112, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:29:53,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1581402.0, ans=0.0 2023-06-23 21:30:00,242 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-06-23 21:30:23,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1581462.0, ans=0.125 2023-06-23 21:30:25,818 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.955e+02 6.042e+02 7.862e+02 1.200e+03 2.695e+03, threshold=1.572e+03, percent-clipped=11.0 2023-06-23 21:31:11,426 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:31:15,785 INFO [train.py:996] (3/4) Epoch 9, batch 19650, loss[loss=0.2318, simple_loss=0.3033, pruned_loss=0.08014, over 21630.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3053, pruned_loss=0.07976, over 4287550.59 frames. ], batch size: 230, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:31:20,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1581642.0, ans=0.1 2023-06-23 21:32:22,915 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.62 vs. limit=6.0 2023-06-23 21:32:59,036 INFO [train.py:996] (3/4) Epoch 9, batch 19700, loss[loss=0.2258, simple_loss=0.3213, pruned_loss=0.06514, over 21677.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3094, pruned_loss=0.08118, over 4289067.19 frames. ], batch size: 351, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:33:12,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1581942.0, ans=0.1 2023-06-23 21:33:39,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1582062.0, ans=0.0 2023-06-23 21:33:50,802 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.28 vs. limit=15.0 2023-06-23 21:34:05,969 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.102e+02 5.626e+02 7.962e+02 1.128e+03 2.480e+03, threshold=1.592e+03, percent-clipped=10.0 2023-06-23 21:34:26,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1582182.0, ans=0.0 2023-06-23 21:34:43,665 INFO [train.py:996] (3/4) Epoch 9, batch 19750, loss[loss=0.2785, simple_loss=0.3759, pruned_loss=0.09055, over 21775.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3211, pruned_loss=0.08358, over 4286627.28 frames. ], batch size: 282, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:35:13,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1582302.0, ans=0.125 2023-06-23 21:35:29,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1582362.0, ans=0.0 2023-06-23 21:35:43,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1582362.0, ans=0.0 2023-06-23 21:35:53,316 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.97 vs. limit=10.0 2023-06-23 21:36:23,313 INFO [train.py:996] (3/4) Epoch 9, batch 19800, loss[loss=0.2255, simple_loss=0.2867, pruned_loss=0.08216, over 21240.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3184, pruned_loss=0.08301, over 4290258.49 frames. ], batch size: 176, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:36:28,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1582542.0, ans=0.0 2023-06-23 21:36:55,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1582602.0, ans=0.125 2023-06-23 21:37:05,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1582662.0, ans=0.0 2023-06-23 21:37:06,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1582662.0, ans=0.125 2023-06-23 21:37:18,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1582662.0, ans=0.125 2023-06-23 21:37:26,053 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.106e+02 6.339e+02 1.006e+03 1.413e+03 2.674e+03, threshold=2.011e+03, percent-clipped=18.0 2023-06-23 21:37:34,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1582722.0, ans=0.5 2023-06-23 21:37:39,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1582722.0, ans=0.125 2023-06-23 21:37:52,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1582782.0, ans=0.125 2023-06-23 21:37:54,376 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-23 21:38:05,339 INFO [train.py:996] (3/4) Epoch 9, batch 19850, loss[loss=0.1842, simple_loss=0.2632, pruned_loss=0.05266, over 21345.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3117, pruned_loss=0.07807, over 4286941.16 frames. ], batch size: 176, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:38:29,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1582902.0, ans=0.05 2023-06-23 21:38:36,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1582902.0, ans=0.1 2023-06-23 21:38:40,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1582902.0, ans=0.1 2023-06-23 21:38:41,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1582902.0, ans=0.125 2023-06-23 21:39:00,170 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:39:00,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1582962.0, ans=0.125 2023-06-23 21:39:41,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1583082.0, ans=0.09899494936611666 2023-06-23 21:39:43,942 INFO [train.py:996] (3/4) Epoch 9, batch 19900, loss[loss=0.2131, simple_loss=0.2813, pruned_loss=0.07243, over 21596.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3101, pruned_loss=0.07517, over 4287797.62 frames. ], batch size: 263, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:40:25,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1583202.0, ans=0.125 2023-06-23 21:40:30,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1583262.0, ans=0.1 2023-06-23 21:40:45,973 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.607e+02 5.026e+02 6.181e+02 9.309e+02 2.570e+03, threshold=1.236e+03, percent-clipped=2.0 2023-06-23 21:40:46,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1583322.0, ans=0.1 2023-06-23 21:41:04,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1583322.0, ans=0.1 2023-06-23 21:41:28,255 INFO [train.py:996] (3/4) Epoch 9, batch 19950, loss[loss=0.1975, simple_loss=0.2795, pruned_loss=0.05774, over 21730.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.304, pruned_loss=0.07445, over 4281412.45 frames. ], batch size: 316, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:42:06,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1583502.0, ans=0.125 2023-06-23 21:43:01,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1583682.0, ans=0.1 2023-06-23 21:43:07,068 INFO [train.py:996] (3/4) Epoch 9, batch 20000, loss[loss=0.2219, simple_loss=0.2919, pruned_loss=0.07591, over 21885.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3051, pruned_loss=0.0758, over 4275819.58 frames. ], batch size: 124, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:43:56,356 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-06-23 21:44:03,260 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.657e+02 5.210e+02 7.571e+02 1.068e+03 2.474e+03, threshold=1.514e+03, percent-clipped=20.0 2023-06-23 21:44:46,969 INFO [train.py:996] (3/4) Epoch 9, batch 20050, loss[loss=0.2134, simple_loss=0.2949, pruned_loss=0.06597, over 21810.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3073, pruned_loss=0.07797, over 4287562.86 frames. ], batch size: 282, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:44:48,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1584042.0, ans=0.125 2023-06-23 21:44:55,738 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-23 21:45:48,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1584222.0, ans=0.125 2023-06-23 21:46:08,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1584282.0, ans=0.125 2023-06-23 21:46:28,200 INFO [train.py:996] (3/4) Epoch 9, batch 20100, loss[loss=0.2408, simple_loss=0.3408, pruned_loss=0.07041, over 21813.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.309, pruned_loss=0.07983, over 4282856.97 frames. ], batch size: 351, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:46:32,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1584342.0, ans=0.2 2023-06-23 21:46:58,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1584402.0, ans=0.125 2023-06-23 21:47:00,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1584402.0, ans=0.125 2023-06-23 21:47:32,233 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.946e+02 5.376e+02 7.013e+02 1.127e+03 1.999e+03, threshold=1.403e+03, percent-clipped=12.0 2023-06-23 21:48:07,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1584582.0, ans=0.125 2023-06-23 21:48:18,543 INFO [train.py:996] (3/4) Epoch 9, batch 20150, loss[loss=0.3007, simple_loss=0.3675, pruned_loss=0.1169, over 21696.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.318, pruned_loss=0.0832, over 4283956.41 frames. ], batch size: 351, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:48:26,240 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-23 21:48:29,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1584642.0, ans=0.1 2023-06-23 21:48:32,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1584642.0, ans=0.125 2023-06-23 21:48:36,508 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-06-23 21:48:55,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1584762.0, ans=0.125 2023-06-23 21:49:50,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1584882.0, ans=0.125 2023-06-23 21:50:01,802 INFO [train.py:996] (3/4) Epoch 9, batch 20200, loss[loss=0.2857, simple_loss=0.3861, pruned_loss=0.0926, over 21673.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.325, pruned_loss=0.08568, over 4282780.98 frames. ], batch size: 389, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:50:28,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1585002.0, ans=10.0 2023-06-23 21:50:36,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1585002.0, ans=0.1 2023-06-23 21:50:59,409 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.884e+02 7.487e+02 1.027e+03 1.466e+03 2.661e+03, threshold=2.055e+03, percent-clipped=25.0 2023-06-23 21:51:20,044 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=12.0 2023-06-23 21:51:34,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1585182.0, ans=0.125 2023-06-23 21:51:42,573 INFO [train.py:996] (3/4) Epoch 9, batch 20250, loss[loss=0.205, simple_loss=0.3024, pruned_loss=0.05381, over 21827.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3249, pruned_loss=0.0842, over 4276566.70 frames. ], batch size: 282, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:51:49,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1585242.0, ans=0.125 2023-06-23 21:52:56,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1585422.0, ans=0.2 2023-06-23 21:52:56,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1585422.0, ans=0.125 2023-06-23 21:52:58,901 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=22.5 2023-06-23 21:53:08,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1585482.0, ans=0.1 2023-06-23 21:53:08,802 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.35 vs. limit=15.0 2023-06-23 21:53:19,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1585482.0, ans=0.125 2023-06-23 21:53:20,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1585542.0, ans=0.0 2023-06-23 21:53:22,025 INFO [train.py:996] (3/4) Epoch 9, batch 20300, loss[loss=0.1952, simple_loss=0.2714, pruned_loss=0.05948, over 21897.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3215, pruned_loss=0.08181, over 4259485.31 frames. ], batch size: 98, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:53:27,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1585542.0, ans=0.5 2023-06-23 21:54:05,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1585662.0, ans=0.0 2023-06-23 21:54:28,240 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.674e+02 5.917e+02 9.257e+02 1.332e+03 3.110e+03, threshold=1.851e+03, percent-clipped=6.0 2023-06-23 21:54:59,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1585842.0, ans=0.2 2023-06-23 21:55:00,616 INFO [train.py:996] (3/4) Epoch 9, batch 20350, loss[loss=0.2613, simple_loss=0.3386, pruned_loss=0.092, over 21363.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3211, pruned_loss=0.08182, over 4266103.02 frames. ], batch size: 131, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:55:19,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1585842.0, ans=0.0 2023-06-23 21:55:31,275 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=22.5 2023-06-23 21:55:34,640 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.71 vs. limit=10.0 2023-06-23 21:55:36,115 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.73 vs. limit=15.0 2023-06-23 21:56:21,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1586022.0, ans=0.2 2023-06-23 21:56:24,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1586082.0, ans=0.0 2023-06-23 21:56:40,517 INFO [train.py:996] (3/4) Epoch 9, batch 20400, loss[loss=0.245, simple_loss=0.3236, pruned_loss=0.08314, over 21937.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.324, pruned_loss=0.08415, over 4262484.02 frames. ], batch size: 316, lr: 3.25e-03, grad_scale: 32.0 2023-06-23 21:57:37,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1586262.0, ans=0.0 2023-06-23 21:57:42,770 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.128e+02 5.797e+02 8.307e+02 1.162e+03 2.401e+03, threshold=1.661e+03, percent-clipped=4.0 2023-06-23 21:57:48,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1586322.0, ans=0.125 2023-06-23 21:58:10,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1586382.0, ans=0.125 2023-06-23 21:58:15,247 INFO [train.py:996] (3/4) Epoch 9, batch 20450, loss[loss=0.2642, simple_loss=0.3205, pruned_loss=0.104, over 21580.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3253, pruned_loss=0.08713, over 4265958.33 frames. ], batch size: 507, lr: 3.25e-03, grad_scale: 32.0 2023-06-23 21:58:38,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1586502.0, ans=0.035 2023-06-23 21:58:43,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1586502.0, ans=0.1 2023-06-23 21:59:33,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1586682.0, ans=0.125 2023-06-23 21:59:46,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1586682.0, ans=0.125 2023-06-23 21:59:49,253 INFO [train.py:996] (3/4) Epoch 9, batch 20500, loss[loss=0.2228, simple_loss=0.2857, pruned_loss=0.07992, over 21681.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3211, pruned_loss=0.08717, over 4261611.90 frames. ], batch size: 282, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:00:24,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1586862.0, ans=0.125 2023-06-23 22:00:58,936 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.931e+02 5.832e+02 8.906e+02 1.259e+03 2.508e+03, threshold=1.781e+03, percent-clipped=16.0 2023-06-23 22:01:03,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1586922.0, ans=0.0 2023-06-23 22:01:20,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1586982.0, ans=0.125 2023-06-23 22:01:29,766 INFO [train.py:996] (3/4) Epoch 9, batch 20550, loss[loss=0.2332, simple_loss=0.3027, pruned_loss=0.08184, over 21263.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3133, pruned_loss=0.08523, over 4255959.95 frames. ], batch size: 176, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:01:37,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1587042.0, ans=0.1 2023-06-23 22:02:41,571 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-23 22:03:03,885 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=12.0 2023-06-23 22:03:04,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1587282.0, ans=0.125 2023-06-23 22:03:08,475 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=12.0 2023-06-23 22:03:10,665 INFO [train.py:996] (3/4) Epoch 9, batch 20600, loss[loss=0.2431, simple_loss=0.3394, pruned_loss=0.07338, over 20713.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3161, pruned_loss=0.08345, over 4250790.52 frames. ], batch size: 607, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:03:11,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1587342.0, ans=0.0 2023-06-23 22:03:11,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1587342.0, ans=0.0 2023-06-23 22:04:20,268 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.756e+02 4.749e+02 5.700e+02 8.605e+02 1.495e+03, threshold=1.140e+03, percent-clipped=0.0 2023-06-23 22:04:26,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1587522.0, ans=0.125 2023-06-23 22:04:34,659 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-23 22:04:51,864 INFO [train.py:996] (3/4) Epoch 9, batch 20650, loss[loss=0.2165, simple_loss=0.2851, pruned_loss=0.07398, over 21266.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3115, pruned_loss=0.08341, over 4254750.81 frames. ], batch size: 548, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:05:12,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1587702.0, ans=0.2 2023-06-23 22:05:19,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1587702.0, ans=0.1 2023-06-23 22:05:32,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1587762.0, ans=0.2 2023-06-23 22:05:41,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1587762.0, ans=0.2 2023-06-23 22:05:45,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1587762.0, ans=0.2 2023-06-23 22:05:45,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1587762.0, ans=0.1 2023-06-23 22:06:11,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1587822.0, ans=0.0 2023-06-23 22:06:19,495 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:06:32,141 INFO [train.py:996] (3/4) Epoch 9, batch 20700, loss[loss=0.2241, simple_loss=0.3048, pruned_loss=0.07173, over 21650.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3046, pruned_loss=0.08028, over 4258749.92 frames. ], batch size: 414, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:07:44,043 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.364e+02 5.397e+02 7.207e+02 1.144e+03 2.870e+03, threshold=1.441e+03, percent-clipped=25.0 2023-06-23 22:07:48,375 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-06-23 22:08:01,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1588182.0, ans=0.125 2023-06-23 22:08:20,553 INFO [train.py:996] (3/4) Epoch 9, batch 20750, loss[loss=0.4047, simple_loss=0.4694, pruned_loss=0.17, over 21425.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3098, pruned_loss=0.08044, over 4260053.53 frames. ], batch size: 507, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:08:37,966 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=15.0 2023-06-23 22:09:20,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1588362.0, ans=0.0 2023-06-23 22:09:33,897 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.22 vs. limit=6.0 2023-06-23 22:09:41,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1588482.0, ans=0.125 2023-06-23 22:09:44,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1588482.0, ans=0.125 2023-06-23 22:09:56,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1588482.0, ans=0.2 2023-06-23 22:10:00,928 INFO [train.py:996] (3/4) Epoch 9, batch 20800, loss[loss=0.2048, simple_loss=0.2797, pruned_loss=0.06498, over 21564.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.309, pruned_loss=0.08038, over 4259400.41 frames. ], batch size: 263, lr: 3.25e-03, grad_scale: 32.0 2023-06-23 22:10:27,927 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-06-23 22:10:54,708 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:10:55,477 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.01 vs. limit=15.0 2023-06-23 22:11:01,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1588662.0, ans=0.2 2023-06-23 22:11:06,864 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.367e+02 5.156e+02 7.961e+02 1.123e+03 3.663e+03, threshold=1.592e+03, percent-clipped=17.0 2023-06-23 22:11:08,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1588722.0, ans=0.2 2023-06-23 22:11:40,263 INFO [train.py:996] (3/4) Epoch 9, batch 20850, loss[loss=0.1994, simple_loss=0.2692, pruned_loss=0.06484, over 21453.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3011, pruned_loss=0.07816, over 4258608.28 frames. ], batch size: 548, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:11:45,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1588842.0, ans=0.125 2023-06-23 22:12:23,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1588902.0, ans=0.125 2023-06-23 22:12:48,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1589022.0, ans=0.125 2023-06-23 22:13:15,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1589082.0, ans=0.09899494936611666 2023-06-23 22:13:18,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1589142.0, ans=0.1 2023-06-23 22:13:19,716 INFO [train.py:996] (3/4) Epoch 9, batch 20900, loss[loss=0.2347, simple_loss=0.3022, pruned_loss=0.08363, over 21538.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3038, pruned_loss=0.08005, over 4256785.28 frames. ], batch size: 195, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:13:21,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1589142.0, ans=0.125 2023-06-23 22:14:23,916 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.693e+02 5.793e+02 9.224e+02 1.777e+03 3.715e+03, threshold=1.845e+03, percent-clipped=30.0 2023-06-23 22:14:51,901 INFO [train.py:996] (3/4) Epoch 9, batch 20950, loss[loss=0.2901, simple_loss=0.4154, pruned_loss=0.08242, over 19701.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3003, pruned_loss=0.07619, over 4254251.92 frames. ], batch size: 702, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:15:27,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1589502.0, ans=0.125 2023-06-23 22:16:11,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-23 22:16:25,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1589682.0, ans=0.125 2023-06-23 22:16:28,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1589742.0, ans=0.125 2023-06-23 22:16:29,640 INFO [train.py:996] (3/4) Epoch 9, batch 21000, loss[loss=0.2442, simple_loss=0.311, pruned_loss=0.08871, over 21640.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3024, pruned_loss=0.07677, over 4247096.23 frames. ], batch size: 471, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:16:29,641 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 22:16:50,151 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2633, simple_loss=0.3613, pruned_loss=0.0826, over 1796401.00 frames. 2023-06-23 22:16:50,152 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-23 22:17:13,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1589802.0, ans=0.125 2023-06-23 22:17:35,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1589862.0, ans=0.125 2023-06-23 22:17:51,341 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 5.898e+02 8.113e+02 1.195e+03 2.501e+03, threshold=1.623e+03, percent-clipped=8.0 2023-06-23 22:17:57,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1589922.0, ans=0.125 2023-06-23 22:18:30,028 INFO [train.py:996] (3/4) Epoch 9, batch 21050, loss[loss=0.1886, simple_loss=0.287, pruned_loss=0.04509, over 19882.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3006, pruned_loss=0.07712, over 4254090.98 frames. ], batch size: 703, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:19:11,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1590162.0, ans=0.04949747468305833 2023-06-23 22:19:18,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1590162.0, ans=0.125 2023-06-23 22:19:24,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1590162.0, ans=0.0 2023-06-23 22:19:37,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1590222.0, ans=0.125 2023-06-23 22:20:08,617 INFO [train.py:996] (3/4) Epoch 9, batch 21100, loss[loss=0.2191, simple_loss=0.2788, pruned_loss=0.07973, over 21605.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2961, pruned_loss=0.0762, over 4255582.65 frames. ], batch size: 247, lr: 3.25e-03, grad_scale: 8.0 2023-06-23 22:20:09,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1590342.0, ans=0.0 2023-06-23 22:20:40,881 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-23 22:21:11,573 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.438e+02 5.136e+02 6.651e+02 8.328e+02 1.901e+03, threshold=1.330e+03, percent-clipped=2.0 2023-06-23 22:21:42,818 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-23 22:21:48,084 INFO [train.py:996] (3/4) Epoch 9, batch 21150, loss[loss=0.2108, simple_loss=0.2683, pruned_loss=0.07671, over 21240.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2921, pruned_loss=0.07642, over 4257739.28 frames. ], batch size: 144, lr: 3.25e-03, grad_scale: 8.0 2023-06-23 22:22:34,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1590762.0, ans=0.1 2023-06-23 22:23:02,961 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2023-06-23 22:23:21,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1590882.0, ans=0.0 2023-06-23 22:23:26,830 INFO [train.py:996] (3/4) Epoch 9, batch 21200, loss[loss=0.1855, simple_loss=0.2624, pruned_loss=0.05436, over 21707.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.288, pruned_loss=0.07547, over 4255662.15 frames. ], batch size: 298, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:23:40,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1590942.0, ans=0.04949747468305833 2023-06-23 22:24:23,966 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-23 22:24:29,349 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.749e+02 4.861e+02 6.796e+02 9.543e+02 2.010e+03, threshold=1.359e+03, percent-clipped=3.0 2023-06-23 22:24:41,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1591182.0, ans=0.125 2023-06-23 22:24:53,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1591182.0, ans=0.125 2023-06-23 22:25:03,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1591182.0, ans=0.125 2023-06-23 22:25:05,765 INFO [train.py:996] (3/4) Epoch 9, batch 21250, loss[loss=0.2807, simple_loss=0.3468, pruned_loss=0.1073, over 21599.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2868, pruned_loss=0.07559, over 4255556.01 frames. ], batch size: 263, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:26:18,556 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=15.0 2023-06-23 22:26:27,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1591482.0, ans=0.0 2023-06-23 22:26:33,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1591482.0, ans=0.0 2023-06-23 22:26:41,379 INFO [train.py:996] (3/4) Epoch 9, batch 21300, loss[loss=0.2423, simple_loss=0.3124, pruned_loss=0.08611, over 21607.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2942, pruned_loss=0.07819, over 4255919.07 frames. ], batch size: 548, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:27:12,552 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=15.0 2023-06-23 22:27:48,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1591722.0, ans=0.0 2023-06-23 22:27:49,240 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.208e+02 6.806e+02 9.811e+02 1.401e+03 3.569e+03, threshold=1.962e+03, percent-clipped=29.0 2023-06-23 22:27:59,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1591782.0, ans=0.1 2023-06-23 22:28:04,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1591782.0, ans=0.0 2023-06-23 22:28:04,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1591782.0, ans=0.125 2023-06-23 22:28:24,678 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=12.0 2023-06-23 22:28:25,292 INFO [train.py:996] (3/4) Epoch 9, batch 21350, loss[loss=0.2199, simple_loss=0.3135, pruned_loss=0.06319, over 21794.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2977, pruned_loss=0.07834, over 4255707.07 frames. ], batch size: 298, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:28:52,833 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-06-23 22:30:04,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1592082.0, ans=0.125 2023-06-23 22:30:10,696 INFO [train.py:996] (3/4) Epoch 9, batch 21400, loss[loss=0.2203, simple_loss=0.2899, pruned_loss=0.07534, over 21718.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3005, pruned_loss=0.07773, over 4266954.84 frames. ], batch size: 112, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:30:42,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1592202.0, ans=0.1 2023-06-23 22:30:50,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1592262.0, ans=0.0 2023-06-23 22:31:00,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1592262.0, ans=0.0 2023-06-23 22:31:08,188 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.672e+02 5.270e+02 6.886e+02 1.009e+03 2.109e+03, threshold=1.377e+03, percent-clipped=2.0 2023-06-23 22:31:50,206 INFO [train.py:996] (3/4) Epoch 9, batch 21450, loss[loss=0.2188, simple_loss=0.2885, pruned_loss=0.07453, over 21500.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3044, pruned_loss=0.07967, over 4266657.89 frames. ], batch size: 194, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:32:01,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1592442.0, ans=0.2 2023-06-23 22:32:22,642 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-23 22:33:27,834 INFO [train.py:996] (3/4) Epoch 9, batch 21500, loss[loss=0.2086, simple_loss=0.2674, pruned_loss=0.07491, over 21541.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3036, pruned_loss=0.08073, over 4253107.46 frames. ], batch size: 263, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:34:12,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1592862.0, ans=0.125 2023-06-23 22:34:29,426 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.705e+02 5.721e+02 7.470e+02 9.927e+02 1.833e+03, threshold=1.494e+03, percent-clipped=12.0 2023-06-23 22:34:31,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1592922.0, ans=0.1 2023-06-23 22:35:06,276 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2023-06-23 22:35:06,808 INFO [train.py:996] (3/4) Epoch 9, batch 21550, loss[loss=0.2037, simple_loss=0.2727, pruned_loss=0.06732, over 21835.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2966, pruned_loss=0.07803, over 4258608.68 frames. ], batch size: 98, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:36:29,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1593282.0, ans=0.125 2023-06-23 22:36:47,526 INFO [train.py:996] (3/4) Epoch 9, batch 21600, loss[loss=0.1823, simple_loss=0.2503, pruned_loss=0.05708, over 21472.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2958, pruned_loss=0.07699, over 4257183.45 frames. ], batch size: 212, lr: 3.25e-03, grad_scale: 32.0 2023-06-23 22:36:52,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1593342.0, ans=0.0 2023-06-23 22:37:09,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1593402.0, ans=0.1 2023-06-23 22:38:04,081 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.659e+02 6.338e+02 9.920e+02 1.459e+03 3.157e+03, threshold=1.984e+03, percent-clipped=22.0 2023-06-23 22:38:09,882 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.81 vs. limit=15.0 2023-06-23 22:38:28,836 INFO [train.py:996] (3/4) Epoch 9, batch 21650, loss[loss=0.2198, simple_loss=0.3121, pruned_loss=0.06373, over 21582.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2992, pruned_loss=0.07475, over 4256993.14 frames. ], batch size: 230, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:38:41,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1593642.0, ans=0.125 2023-06-23 22:38:50,464 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=15.0 2023-06-23 22:38:56,374 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.42 vs. limit=10.0 2023-06-23 22:39:03,002 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-23 22:39:03,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1593762.0, ans=0.125 2023-06-23 22:39:14,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1593762.0, ans=0.125 2023-06-23 22:40:05,494 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-23 22:40:06,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1593942.0, ans=0.2 2023-06-23 22:40:07,976 INFO [train.py:996] (3/4) Epoch 9, batch 21700, loss[loss=0.2564, simple_loss=0.308, pruned_loss=0.1024, over 21302.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.299, pruned_loss=0.07291, over 4257175.40 frames. ], batch size: 507, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:41:10,619 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.363e+02 6.189e+02 8.394e+02 1.254e+03 2.013e+03, threshold=1.679e+03, percent-clipped=1.0 2023-06-23 22:41:18,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1594122.0, ans=0.5 2023-06-23 22:41:30,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1594182.0, ans=0.125 2023-06-23 22:41:31,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1594182.0, ans=0.0 2023-06-23 22:41:34,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1594182.0, ans=0.125 2023-06-23 22:41:39,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1594182.0, ans=0.0 2023-06-23 22:41:45,166 INFO [train.py:996] (3/4) Epoch 9, batch 21750, loss[loss=0.2027, simple_loss=0.257, pruned_loss=0.07419, over 21208.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2942, pruned_loss=0.07242, over 4250290.43 frames. ], batch size: 548, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:41:49,710 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.29 vs. limit=12.0 2023-06-23 22:41:58,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1594242.0, ans=0.1 2023-06-23 22:42:38,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1594362.0, ans=0.015 2023-06-23 22:42:53,099 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:42:56,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1594422.0, ans=0.125 2023-06-23 22:43:25,515 INFO [train.py:996] (3/4) Epoch 9, batch 21800, loss[loss=0.2443, simple_loss=0.2901, pruned_loss=0.09926, over 21401.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.294, pruned_loss=0.07422, over 4251668.99 frames. ], batch size: 509, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:44:32,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1594722.0, ans=0.95 2023-06-23 22:44:34,096 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.869e+02 5.122e+02 6.776e+02 1.046e+03 2.535e+03, threshold=1.355e+03, percent-clipped=3.0 2023-06-23 22:44:49,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1594782.0, ans=0.0 2023-06-23 22:45:05,000 INFO [train.py:996] (3/4) Epoch 9, batch 21850, loss[loss=0.3167, simple_loss=0.3671, pruned_loss=0.1332, over 21692.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2986, pruned_loss=0.07485, over 4255176.80 frames. ], batch size: 507, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:45:05,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1594842.0, ans=0.2 2023-06-23 22:46:05,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1595022.0, ans=0.125 2023-06-23 22:46:05,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1595022.0, ans=0.0 2023-06-23 22:46:14,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1595022.0, ans=0.1 2023-06-23 22:46:36,076 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=22.5 2023-06-23 22:46:42,694 INFO [train.py:996] (3/4) Epoch 9, batch 21900, loss[loss=0.2338, simple_loss=0.3073, pruned_loss=0.08016, over 21692.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2982, pruned_loss=0.07535, over 4262381.18 frames. ], batch size: 389, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:47:00,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1595202.0, ans=0.125 2023-06-23 22:47:02,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1595202.0, ans=0.2 2023-06-23 22:47:49,991 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.776e+02 5.560e+02 7.973e+02 1.226e+03 2.341e+03, threshold=1.595e+03, percent-clipped=19.0 2023-06-23 22:47:55,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1595322.0, ans=0.125 2023-06-23 22:48:20,423 INFO [train.py:996] (3/4) Epoch 9, batch 21950, loss[loss=0.2967, simple_loss=0.3983, pruned_loss=0.09751, over 19733.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2935, pruned_loss=0.07461, over 4246572.45 frames. ], batch size: 702, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:49:02,626 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.10 vs. limit=10.0 2023-06-23 22:49:03,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1595562.0, ans=0.2 2023-06-23 22:49:19,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1595562.0, ans=0.125 2023-06-23 22:49:59,982 INFO [train.py:996] (3/4) Epoch 9, batch 22000, loss[loss=0.2297, simple_loss=0.299, pruned_loss=0.08021, over 21875.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2874, pruned_loss=0.07173, over 4255438.28 frames. ], batch size: 107, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 22:50:14,060 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=15.0 2023-06-23 22:51:14,071 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.248e+02 5.153e+02 7.605e+02 1.162e+03 2.837e+03, threshold=1.521e+03, percent-clipped=11.0 2023-06-23 22:51:40,190 INFO [train.py:996] (3/4) Epoch 9, batch 22050, loss[loss=0.1954, simple_loss=0.2769, pruned_loss=0.05695, over 16348.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2916, pruned_loss=0.0737, over 4253550.30 frames. ], batch size: 61, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 22:51:40,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1596042.0, ans=0.125 2023-06-23 22:52:15,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-06-23 22:52:54,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1596222.0, ans=0.0 2023-06-23 22:52:55,167 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=22.5 2023-06-23 22:53:13,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1596282.0, ans=0.125 2023-06-23 22:53:13,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1596282.0, ans=0.125 2023-06-23 22:53:19,861 INFO [train.py:996] (3/4) Epoch 9, batch 22100, loss[loss=0.2396, simple_loss=0.3053, pruned_loss=0.08697, over 21813.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3034, pruned_loss=0.07864, over 4246360.78 frames. ], batch size: 247, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:53:56,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1596402.0, ans=0.125 2023-06-23 22:54:23,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1596522.0, ans=0.0 2023-06-23 22:54:34,346 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.056e+02 6.584e+02 8.540e+02 1.234e+03 2.755e+03, threshold=1.708e+03, percent-clipped=13.0 2023-06-23 22:54:55,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-06-23 22:54:57,964 INFO [train.py:996] (3/4) Epoch 9, batch 22150, loss[loss=0.2332, simple_loss=0.3067, pruned_loss=0.07987, over 21902.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.306, pruned_loss=0.08046, over 4260763.99 frames. ], batch size: 351, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:55:01,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1596642.0, ans=0.1 2023-06-23 22:55:41,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1596762.0, ans=0.125 2023-06-23 22:55:49,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1596762.0, ans=0.125 2023-06-23 22:56:12,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1596822.0, ans=0.125 2023-06-23 22:56:30,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1596882.0, ans=0.125 2023-06-23 22:56:37,887 INFO [train.py:996] (3/4) Epoch 9, batch 22200, loss[loss=0.2296, simple_loss=0.3164, pruned_loss=0.07141, over 21633.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3077, pruned_loss=0.082, over 4276482.57 frames. ], batch size: 230, lr: 3.24e-03, grad_scale: 8.0 2023-06-23 22:56:45,367 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=22.5 2023-06-23 22:57:26,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1597062.0, ans=0.1 2023-06-23 22:57:45,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1597122.0, ans=0.1 2023-06-23 22:57:54,235 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.796e+02 5.424e+02 7.068e+02 9.828e+02 2.083e+03, threshold=1.414e+03, percent-clipped=7.0 2023-06-23 22:57:56,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1597122.0, ans=0.0 2023-06-23 22:58:16,262 INFO [train.py:996] (3/4) Epoch 9, batch 22250, loss[loss=0.294, simple_loss=0.3582, pruned_loss=0.1149, over 21370.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3153, pruned_loss=0.08384, over 4280257.41 frames. ], batch size: 143, lr: 3.24e-03, grad_scale: 8.0 2023-06-23 22:58:21,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1597242.0, ans=0.1 2023-06-23 22:59:07,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1597362.0, ans=0.125 2023-06-23 22:59:54,579 INFO [train.py:996] (3/4) Epoch 9, batch 22300, loss[loss=0.2342, simple_loss=0.2967, pruned_loss=0.08583, over 21311.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3165, pruned_loss=0.0855, over 4285715.03 frames. ], batch size: 159, lr: 3.24e-03, grad_scale: 8.0 2023-06-23 23:00:24,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1597602.0, ans=0.1 2023-06-23 23:00:30,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1597602.0, ans=0.0 2023-06-23 23:01:10,701 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.760e+02 5.946e+02 8.093e+02 1.234e+03 3.372e+03, threshold=1.619e+03, percent-clipped=19.0 2023-06-23 23:01:33,316 INFO [train.py:996] (3/4) Epoch 9, batch 22350, loss[loss=0.2621, simple_loss=0.3252, pruned_loss=0.09945, over 21699.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3155, pruned_loss=0.08619, over 4286023.07 frames. ], batch size: 508, lr: 3.24e-03, grad_scale: 8.0 2023-06-23 23:02:23,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1597962.0, ans=0.2 2023-06-23 23:02:27,562 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=12.0 2023-06-23 23:02:45,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1598022.0, ans=0.2 2023-06-23 23:03:03,144 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=22.5 2023-06-23 23:03:03,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1598082.0, ans=0.125 2023-06-23 23:03:22,837 INFO [train.py:996] (3/4) Epoch 9, batch 22400, loss[loss=0.2396, simple_loss=0.3123, pruned_loss=0.08343, over 21669.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3112, pruned_loss=0.08232, over 4289701.92 frames. ], batch size: 332, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:03:57,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1598202.0, ans=0.0 2023-06-23 23:04:18,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1598322.0, ans=0.2 2023-06-23 23:04:18,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1598322.0, ans=0.0 2023-06-23 23:04:24,038 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-23 23:04:29,511 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.562e+02 4.983e+02 6.853e+02 9.645e+02 2.077e+03, threshold=1.371e+03, percent-clipped=3.0 2023-06-23 23:04:40,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1598382.0, ans=0.0 2023-06-23 23:04:57,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1598382.0, ans=0.1 2023-06-23 23:05:00,745 INFO [train.py:996] (3/4) Epoch 9, batch 22450, loss[loss=0.2301, simple_loss=0.2858, pruned_loss=0.0872, over 21813.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3047, pruned_loss=0.08182, over 4282470.86 frames. ], batch size: 98, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:05:34,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1598502.0, ans=0.2 2023-06-23 23:05:57,991 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-23 23:06:09,198 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-23 23:06:39,879 INFO [train.py:996] (3/4) Epoch 9, batch 22500, loss[loss=0.2353, simple_loss=0.3109, pruned_loss=0.07987, over 21320.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3011, pruned_loss=0.08152, over 4285699.07 frames. ], batch size: 194, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:07:24,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1598862.0, ans=0.0 2023-06-23 23:07:26,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1598862.0, ans=0.0 2023-06-23 23:07:47,140 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.073e+02 5.895e+02 7.835e+02 1.248e+03 2.629e+03, threshold=1.567e+03, percent-clipped=21.0 2023-06-23 23:08:16,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1598982.0, ans=0.0 2023-06-23 23:08:18,957 INFO [train.py:996] (3/4) Epoch 9, batch 22550, loss[loss=0.2723, simple_loss=0.3505, pruned_loss=0.09705, over 21785.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3061, pruned_loss=0.0827, over 4285330.81 frames. ], batch size: 414, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:08:32,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1599042.0, ans=0.0 2023-06-23 23:08:49,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1599102.0, ans=0.1 2023-06-23 23:08:58,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1599102.0, ans=0.1 2023-06-23 23:09:39,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1599222.0, ans=0.2 2023-06-23 23:09:47,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1599282.0, ans=0.125 2023-06-23 23:10:06,612 INFO [train.py:996] (3/4) Epoch 9, batch 22600, loss[loss=0.2453, simple_loss=0.3289, pruned_loss=0.08085, over 21737.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3078, pruned_loss=0.08212, over 4290798.87 frames. ], batch size: 298, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:10:24,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1599402.0, ans=0.1 2023-06-23 23:11:13,383 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.041e+02 6.856e+02 1.098e+03 1.547e+03 4.006e+03, threshold=2.196e+03, percent-clipped=25.0 2023-06-23 23:11:17,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1599522.0, ans=0.125 2023-06-23 23:11:45,419 INFO [train.py:996] (3/4) Epoch 9, batch 22650, loss[loss=0.2152, simple_loss=0.2851, pruned_loss=0.07269, over 21741.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3052, pruned_loss=0.08206, over 4262074.92 frames. ], batch size: 112, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:12:17,875 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-06-23 23:12:42,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1599822.0, ans=0.1 2023-06-23 23:13:18,776 INFO [train.py:996] (3/4) Epoch 9, batch 22700, loss[loss=0.2159, simple_loss=0.2717, pruned_loss=0.08007, over 21628.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.2985, pruned_loss=0.08165, over 4263555.11 frames. ], batch size: 298, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:13:22,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1599942.0, ans=0.0 2023-06-23 23:13:35,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1600002.0, ans=0.125 2023-06-23 23:13:35,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1600002.0, ans=0.0 2023-06-23 23:13:38,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1600002.0, ans=0.0 2023-06-23 23:13:50,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1600062.0, ans=0.2 2023-06-23 23:14:00,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1600062.0, ans=0.035 2023-06-23 23:14:26,583 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.745e+02 5.778e+02 8.238e+02 1.243e+03 2.659e+03, threshold=1.648e+03, percent-clipped=2.0 2023-06-23 23:14:31,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1600182.0, ans=0.1 2023-06-23 23:14:58,259 INFO [train.py:996] (3/4) Epoch 9, batch 22750, loss[loss=0.2546, simple_loss=0.3233, pruned_loss=0.093, over 21758.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.2997, pruned_loss=0.08268, over 4264844.61 frames. ], batch size: 332, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:15:31,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1600362.0, ans=0.035 2023-06-23 23:16:05,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1600422.0, ans=0.2 2023-06-23 23:16:37,195 INFO [train.py:996] (3/4) Epoch 9, batch 22800, loss[loss=0.2249, simple_loss=0.2964, pruned_loss=0.07673, over 21877.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3061, pruned_loss=0.08501, over 4272782.10 frames. ], batch size: 351, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 23:16:59,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1600602.0, ans=0.125 2023-06-23 23:17:45,285 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.424e+02 5.839e+02 8.746e+02 1.348e+03 2.535e+03, threshold=1.749e+03, percent-clipped=13.0 2023-06-23 23:18:02,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1600782.0, ans=15.0 2023-06-23 23:18:15,367 INFO [train.py:996] (3/4) Epoch 9, batch 22850, loss[loss=0.2191, simple_loss=0.2784, pruned_loss=0.07985, over 21501.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3027, pruned_loss=0.0845, over 4274050.93 frames. ], batch size: 230, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:18:17,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1600842.0, ans=0.0 2023-06-23 23:18:23,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1600842.0, ans=0.2 2023-06-23 23:19:24,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1601082.0, ans=0.125 2023-06-23 23:19:49,658 INFO [train.py:996] (3/4) Epoch 9, batch 22900, loss[loss=0.2234, simple_loss=0.3071, pruned_loss=0.06985, over 21248.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3033, pruned_loss=0.08368, over 4258309.33 frames. ], batch size: 548, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:19:55,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1601142.0, ans=0.1 2023-06-23 23:20:22,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1601262.0, ans=0.2 2023-06-23 23:20:32,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1601262.0, ans=0.1 2023-06-23 23:21:02,727 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.772e+02 7.597e+02 1.119e+03 1.552e+03 2.740e+03, threshold=2.237e+03, percent-clipped=15.0 2023-06-23 23:21:23,736 INFO [train.py:996] (3/4) Epoch 9, batch 22950, loss[loss=0.1951, simple_loss=0.254, pruned_loss=0.06808, over 20339.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3145, pruned_loss=0.0812, over 4262018.02 frames. ], batch size: 703, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:21:30,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1601442.0, ans=0.04949747468305833 2023-06-23 23:21:31,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1601442.0, ans=0.0 2023-06-23 23:21:35,661 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:22:03,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1601562.0, ans=0.125 2023-06-23 23:22:11,102 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=15.0 2023-06-23 23:22:15,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1601562.0, ans=0.125 2023-06-23 23:22:47,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1601682.0, ans=0.125 2023-06-23 23:23:02,606 INFO [train.py:996] (3/4) Epoch 9, batch 23000, loss[loss=0.2395, simple_loss=0.3107, pruned_loss=0.0842, over 21850.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3158, pruned_loss=0.07952, over 4272076.20 frames. ], batch size: 298, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:23:10,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1601742.0, ans=0.2 2023-06-23 23:23:31,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1601802.0, ans=0.2 2023-06-23 23:23:55,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=1601862.0, ans=0.2 2023-06-23 23:24:17,139 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.759e+02 5.434e+02 6.875e+02 9.682e+02 1.732e+03, threshold=1.375e+03, percent-clipped=0.0 2023-06-23 23:24:38,050 INFO [train.py:996] (3/4) Epoch 9, batch 23050, loss[loss=0.3038, simple_loss=0.3603, pruned_loss=0.1236, over 21474.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3171, pruned_loss=0.08181, over 4278153.46 frames. ], batch size: 471, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:25:06,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1602102.0, ans=0.125 2023-06-23 23:25:19,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1602162.0, ans=0.1 2023-06-23 23:25:36,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1602162.0, ans=0.0 2023-06-23 23:25:47,720 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:26:13,052 INFO [train.py:996] (3/4) Epoch 9, batch 23100, loss[loss=0.195, simple_loss=0.2552, pruned_loss=0.06741, over 20721.00 frames. ], tot_loss[loss=0.239, simple_loss=0.313, pruned_loss=0.08251, over 4275231.08 frames. ], batch size: 608, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:27:09,073 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:27:13,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1602462.0, ans=0.2 2023-06-23 23:27:30,603 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.933e+02 6.265e+02 7.988e+02 9.890e+02 1.959e+03, threshold=1.598e+03, percent-clipped=10.0 2023-06-23 23:27:51,457 INFO [train.py:996] (3/4) Epoch 9, batch 23150, loss[loss=0.2318, simple_loss=0.2857, pruned_loss=0.08899, over 21340.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3074, pruned_loss=0.08182, over 4276431.52 frames. ], batch size: 159, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:28:52,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1602762.0, ans=0.125 2023-06-23 23:29:16,907 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.35 vs. limit=15.0 2023-06-23 23:29:29,514 INFO [train.py:996] (3/4) Epoch 9, batch 23200, loss[loss=0.2533, simple_loss=0.3242, pruned_loss=0.09124, over 20148.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3071, pruned_loss=0.08273, over 4285328.99 frames. ], batch size: 703, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 23:30:17,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1603062.0, ans=0.125 2023-06-23 23:30:29,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1603122.0, ans=0.125 2023-06-23 23:30:46,162 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.636e+02 5.576e+02 6.977e+02 1.069e+03 2.508e+03, threshold=1.395e+03, percent-clipped=7.0 2023-06-23 23:31:07,205 INFO [train.py:996] (3/4) Epoch 9, batch 23250, loss[loss=0.2527, simple_loss=0.3124, pruned_loss=0.09653, over 21625.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3058, pruned_loss=0.08342, over 4293956.75 frames. ], batch size: 471, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 23:31:52,215 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-23 23:32:37,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1603482.0, ans=0.125 2023-06-23 23:32:52,328 INFO [train.py:996] (3/4) Epoch 9, batch 23300, loss[loss=0.2607, simple_loss=0.3594, pruned_loss=0.08097, over 21428.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3133, pruned_loss=0.08511, over 4299790.77 frames. ], batch size: 211, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:33:20,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1603602.0, ans=0.1 2023-06-23 23:33:38,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1603662.0, ans=0.95 2023-06-23 23:33:47,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1603662.0, ans=0.125 2023-06-23 23:33:47,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1603662.0, ans=0.0 2023-06-23 23:34:00,024 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.84 vs. limit=10.0 2023-06-23 23:34:08,547 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.193e+02 5.673e+02 7.444e+02 1.083e+03 2.210e+03, threshold=1.489e+03, percent-clipped=13.0 2023-06-23 23:34:22,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1603782.0, ans=0.125 2023-06-23 23:34:37,329 INFO [train.py:996] (3/4) Epoch 9, batch 23350, loss[loss=0.1562, simple_loss=0.2353, pruned_loss=0.03857, over 21196.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3194, pruned_loss=0.0847, over 4299500.95 frames. ], batch size: 159, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:35:25,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1603962.0, ans=0.125 2023-06-23 23:35:59,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1604082.0, ans=0.0 2023-06-23 23:36:12,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1604082.0, ans=0.0 2023-06-23 23:36:15,592 INFO [train.py:996] (3/4) Epoch 9, batch 23400, loss[loss=0.2153, simple_loss=0.2838, pruned_loss=0.07338, over 21440.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.312, pruned_loss=0.08031, over 4299470.23 frames. ], batch size: 211, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:36:20,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1604142.0, ans=0.1 2023-06-23 23:36:25,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1604142.0, ans=0.0 2023-06-23 23:37:04,846 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:37:32,086 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.577e+02 6.077e+02 8.548e+02 1.177e+03 1.985e+03, threshold=1.710e+03, percent-clipped=13.0 2023-06-23 23:37:55,564 INFO [train.py:996] (3/4) Epoch 9, batch 23450, loss[loss=0.337, simple_loss=0.3802, pruned_loss=0.1469, over 21303.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3128, pruned_loss=0.08313, over 4300528.51 frames. ], batch size: 507, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:38:18,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1604442.0, ans=0.125 2023-06-23 23:38:53,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-23 23:38:57,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1604622.0, ans=0.125 2023-06-23 23:39:21,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1604682.0, ans=0.1 2023-06-23 23:39:33,084 INFO [train.py:996] (3/4) Epoch 9, batch 23500, loss[loss=0.2542, simple_loss=0.3123, pruned_loss=0.098, over 21621.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3134, pruned_loss=0.08451, over 4293120.87 frames. ], batch size: 548, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:39:56,399 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.31 vs. limit=15.0 2023-06-23 23:40:48,966 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.149e+02 5.607e+02 7.001e+02 9.691e+02 1.810e+03, threshold=1.400e+03, percent-clipped=1.0 2023-06-23 23:41:04,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1604982.0, ans=0.125 2023-06-23 23:41:05,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1604982.0, ans=0.0 2023-06-23 23:41:11,078 INFO [train.py:996] (3/4) Epoch 9, batch 23550, loss[loss=0.2169, simple_loss=0.2849, pruned_loss=0.07443, over 21302.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3078, pruned_loss=0.08388, over 4290257.90 frames. ], batch size: 131, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:41:38,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1605102.0, ans=0.125 2023-06-23 23:41:39,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1605102.0, ans=0.125 2023-06-23 23:42:36,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1605282.0, ans=0.025 2023-06-23 23:42:36,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1605282.0, ans=0.025 2023-06-23 23:42:36,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1605282.0, ans=0.125 2023-06-23 23:42:45,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1605282.0, ans=0.125 2023-06-23 23:42:54,884 INFO [train.py:996] (3/4) Epoch 9, batch 23600, loss[loss=0.2366, simple_loss=0.3135, pruned_loss=0.07981, over 21987.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3089, pruned_loss=0.08405, over 4288438.45 frames. ], batch size: 317, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:43:35,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1605462.0, ans=0.125 2023-06-23 23:44:15,248 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.063e+02 5.548e+02 8.580e+02 1.181e+03 2.336e+03, threshold=1.716e+03, percent-clipped=15.0 2023-06-23 23:44:21,129 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-23 23:44:28,950 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.43 vs. limit=22.5 2023-06-23 23:44:43,184 INFO [train.py:996] (3/4) Epoch 9, batch 23650, loss[loss=0.208, simple_loss=0.288, pruned_loss=0.06398, over 21728.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3093, pruned_loss=0.08195, over 4295270.98 frames. ], batch size: 247, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:44:49,129 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=15.0 2023-06-23 23:45:22,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1605762.0, ans=0.07 2023-06-23 23:45:27,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1605762.0, ans=0.05 2023-06-23 23:45:59,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1605822.0, ans=0.0 2023-06-23 23:46:12,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1605882.0, ans=0.07 2023-06-23 23:46:20,967 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-23 23:46:23,160 INFO [train.py:996] (3/4) Epoch 9, batch 23700, loss[loss=0.2702, simple_loss=0.3418, pruned_loss=0.09935, over 21396.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3121, pruned_loss=0.08185, over 4295644.05 frames. ], batch size: 507, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:46:28,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1605942.0, ans=0.0 2023-06-23 23:46:33,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1605942.0, ans=0.5 2023-06-23 23:47:00,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1606062.0, ans=0.0 2023-06-23 23:47:14,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1606062.0, ans=0.125 2023-06-23 23:47:25,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1606062.0, ans=0.125 2023-06-23 23:47:49,313 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.090e+02 6.365e+02 8.254e+02 1.222e+03 2.661e+03, threshold=1.651e+03, percent-clipped=9.0 2023-06-23 23:48:05,077 INFO [train.py:996] (3/4) Epoch 9, batch 23750, loss[loss=0.216, simple_loss=0.3201, pruned_loss=0.05595, over 21668.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3162, pruned_loss=0.08313, over 4291895.80 frames. ], batch size: 441, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:48:13,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1606242.0, ans=0.0 2023-06-23 23:48:47,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1606362.0, ans=0.95 2023-06-23 23:49:05,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1606362.0, ans=0.0 2023-06-23 23:49:18,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1606422.0, ans=0.125 2023-06-23 23:49:23,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1606422.0, ans=0.125 2023-06-23 23:49:34,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1606482.0, ans=0.2 2023-06-23 23:49:41,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1606482.0, ans=0.0 2023-06-23 23:49:45,910 INFO [train.py:996] (3/4) Epoch 9, batch 23800, loss[loss=0.2875, simple_loss=0.38, pruned_loss=0.09747, over 21626.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3132, pruned_loss=0.08049, over 4284887.79 frames. ], batch size: 389, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:49:53,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1606542.0, ans=0.1 2023-06-23 23:50:37,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1606662.0, ans=0.0 2023-06-23 23:50:39,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1606662.0, ans=0.0 2023-06-23 23:50:39,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1606662.0, ans=0.0 2023-06-23 23:50:52,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1606662.0, ans=0.035 2023-06-23 23:51:11,421 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.911e+02 6.511e+02 9.622e+02 1.496e+03 3.900e+03, threshold=1.924e+03, percent-clipped=16.0 2023-06-23 23:51:15,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1606782.0, ans=0.1 2023-06-23 23:51:26,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1606782.0, ans=0.0 2023-06-23 23:51:33,175 INFO [train.py:996] (3/4) Epoch 9, batch 23850, loss[loss=0.2532, simple_loss=0.3228, pruned_loss=0.09174, over 21491.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3224, pruned_loss=0.08279, over 4277705.81 frames. ], batch size: 211, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:51:58,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1606902.0, ans=0.125 2023-06-23 23:52:40,120 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=15.0 2023-06-23 23:52:42,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1607022.0, ans=0.125 2023-06-23 23:53:14,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1607082.0, ans=0.04949747468305833 2023-06-23 23:53:17,357 INFO [train.py:996] (3/4) Epoch 9, batch 23900, loss[loss=0.2341, simple_loss=0.3101, pruned_loss=0.07911, over 21773.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3292, pruned_loss=0.08538, over 4278646.06 frames. ], batch size: 124, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:53:17,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1607142.0, ans=0.1 2023-06-23 23:54:30,538 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.004e+02 6.120e+02 8.437e+02 1.170e+03 2.663e+03, threshold=1.687e+03, percent-clipped=5.0 2023-06-23 23:54:56,276 INFO [train.py:996] (3/4) Epoch 9, batch 23950, loss[loss=0.2722, simple_loss=0.3371, pruned_loss=0.1037, over 21449.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3227, pruned_loss=0.08487, over 4273586.32 frames. ], batch size: 131, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:54:58,845 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=12.0 2023-06-23 23:55:14,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1607442.0, ans=0.1 2023-06-23 23:56:12,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1607622.0, ans=0.125 2023-06-23 23:56:15,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1607682.0, ans=0.125 2023-06-23 23:56:32,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1607682.0, ans=0.1 2023-06-23 23:56:39,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1607742.0, ans=0.125 2023-06-23 23:56:40,012 INFO [train.py:996] (3/4) Epoch 9, batch 24000, loss[loss=0.335, simple_loss=0.3818, pruned_loss=0.1441, over 21437.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3229, pruned_loss=0.08754, over 4263807.00 frames. ], batch size: 510, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:56:40,013 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 23:57:00,114 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2698, simple_loss=0.3635, pruned_loss=0.08806, over 1796401.00 frames. 2023-06-23 23:57:00,115 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-23 23:57:18,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1607802.0, ans=0.025 2023-06-23 23:57:19,479 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.85 vs. limit=10.0 2023-06-23 23:57:22,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1607802.0, ans=0.125 2023-06-23 23:57:42,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1607862.0, ans=0.125 2023-06-23 23:57:43,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1607862.0, ans=0.0 2023-06-23 23:58:19,611 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.132e+02 5.711e+02 7.408e+02 1.023e+03 1.952e+03, threshold=1.482e+03, percent-clipped=3.0 2023-06-23 23:58:37,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1607982.0, ans=0.125 2023-06-23 23:58:41,976 INFO [train.py:996] (3/4) Epoch 9, batch 24050, loss[loss=0.2442, simple_loss=0.3308, pruned_loss=0.07879, over 21605.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3237, pruned_loss=0.08751, over 4271752.85 frames. ], batch size: 414, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:59:17,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1608162.0, ans=0.125 2023-06-23 23:59:27,926 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.30 vs. limit=10.0 2023-06-23 23:59:30,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1608162.0, ans=0.035 2023-06-24 00:00:21,936 INFO [train.py:996] (3/4) Epoch 9, batch 24100, loss[loss=0.2666, simple_loss=0.3628, pruned_loss=0.08522, over 20785.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3245, pruned_loss=0.08641, over 4277097.25 frames. ], batch size: 607, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:01:39,799 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.035e+02 6.221e+02 8.690e+02 1.208e+03 2.210e+03, threshold=1.738e+03, percent-clipped=15.0 2023-06-24 00:01:43,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1608582.0, ans=0.2 2023-06-24 00:02:00,852 INFO [train.py:996] (3/4) Epoch 9, batch 24150, loss[loss=0.3101, simple_loss=0.374, pruned_loss=0.1231, over 21855.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3247, pruned_loss=0.08859, over 4284689.89 frames. ], batch size: 107, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:02:25,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1608702.0, ans=0.0 2023-06-24 00:03:37,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1608882.0, ans=0.125 2023-06-24 00:03:41,741 INFO [train.py:996] (3/4) Epoch 9, batch 24200, loss[loss=0.2157, simple_loss=0.3019, pruned_loss=0.06477, over 21627.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3265, pruned_loss=0.08949, over 4287772.69 frames. ], batch size: 230, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:03:55,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1608942.0, ans=0.125 2023-06-24 00:04:16,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1609002.0, ans=0.0 2023-06-24 00:04:26,089 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:04:26,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1609062.0, ans=0.1 2023-06-24 00:05:07,422 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.768e+02 7.380e+02 9.978e+02 1.387e+03 2.651e+03, threshold=1.996e+03, percent-clipped=13.0 2023-06-24 00:05:14,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1609182.0, ans=0.035 2023-06-24 00:05:22,616 INFO [train.py:996] (3/4) Epoch 9, batch 24250, loss[loss=0.1835, simple_loss=0.2829, pruned_loss=0.04204, over 21672.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3215, pruned_loss=0.0819, over 4289200.70 frames. ], batch size: 263, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:05:54,872 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.92 vs. limit=5.0 2023-06-24 00:06:35,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1609422.0, ans=0.0 2023-06-24 00:06:40,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1609422.0, ans=0.2 2023-06-24 00:07:02,860 INFO [train.py:996] (3/4) Epoch 9, batch 24300, loss[loss=0.1811, simple_loss=0.2572, pruned_loss=0.05255, over 21826.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3152, pruned_loss=0.0768, over 4284218.87 frames. ], batch size: 107, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:08:28,254 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.714e+02 6.136e+02 8.324e+02 1.263e+03 3.161e+03, threshold=1.665e+03, percent-clipped=12.0 2023-06-24 00:08:52,373 INFO [train.py:996] (3/4) Epoch 9, batch 24350, loss[loss=0.2574, simple_loss=0.3274, pruned_loss=0.09374, over 21019.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3109, pruned_loss=0.07648, over 4286498.48 frames. ], batch size: 607, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:09:55,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1610022.0, ans=0.2 2023-06-24 00:10:34,907 INFO [train.py:996] (3/4) Epoch 9, batch 24400, loss[loss=0.2726, simple_loss=0.3454, pruned_loss=0.09992, over 21443.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3161, pruned_loss=0.0808, over 4288795.34 frames. ], batch size: 471, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:10:48,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1610142.0, ans=0.0 2023-06-24 00:11:09,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1610202.0, ans=0.0 2023-06-24 00:11:56,499 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.237e+02 5.410e+02 6.876e+02 9.194e+02 2.686e+03, threshold=1.375e+03, percent-clipped=10.0 2023-06-24 00:12:11,140 INFO [train.py:996] (3/4) Epoch 9, batch 24450, loss[loss=0.2473, simple_loss=0.3408, pruned_loss=0.07687, over 21898.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.32, pruned_loss=0.08267, over 4286749.56 frames. ], batch size: 372, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:12:13,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1610442.0, ans=0.125 2023-06-24 00:12:53,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1610562.0, ans=0.125 2023-06-24 00:13:01,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1610562.0, ans=0.0 2023-06-24 00:13:38,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1610682.0, ans=0.0 2023-06-24 00:13:41,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1610682.0, ans=0.1 2023-06-24 00:13:51,169 INFO [train.py:996] (3/4) Epoch 9, batch 24500, loss[loss=0.2105, simple_loss=0.2935, pruned_loss=0.06378, over 21461.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3185, pruned_loss=0.08189, over 4290113.81 frames. ], batch size: 194, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:13:51,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1610742.0, ans=0.1 2023-06-24 00:13:53,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1610742.0, ans=0.0 2023-06-24 00:14:02,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1610742.0, ans=10.0 2023-06-24 00:14:29,741 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:15:15,964 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.883e+02 4.941e+02 6.307e+02 8.711e+02 3.165e+03, threshold=1.261e+03, percent-clipped=6.0 2023-06-24 00:15:35,233 INFO [train.py:996] (3/4) Epoch 9, batch 24550, loss[loss=0.2721, simple_loss=0.3432, pruned_loss=0.1006, over 21924.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3211, pruned_loss=0.08437, over 4292674.54 frames. ], batch size: 372, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:16:04,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1611102.0, ans=0.1 2023-06-24 00:16:31,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1611222.0, ans=0.125 2023-06-24 00:16:42,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1611222.0, ans=0.125 2023-06-24 00:17:13,011 INFO [train.py:996] (3/4) Epoch 9, batch 24600, loss[loss=0.2632, simple_loss=0.3229, pruned_loss=0.1018, over 21474.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3151, pruned_loss=0.08386, over 4292768.83 frames. ], batch size: 509, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:17:24,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1611342.0, ans=0.0 2023-06-24 00:17:26,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1611342.0, ans=0.125 2023-06-24 00:17:52,515 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=15.0 2023-06-24 00:17:56,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1611462.0, ans=0.1 2023-06-24 00:18:00,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1611462.0, ans=0.125 2023-06-24 00:18:33,117 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.043e+02 5.417e+02 8.330e+02 1.065e+03 1.781e+03, threshold=1.666e+03, percent-clipped=13.0 2023-06-24 00:18:38,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1611582.0, ans=0.125 2023-06-24 00:18:52,526 INFO [train.py:996] (3/4) Epoch 9, batch 24650, loss[loss=0.1824, simple_loss=0.251, pruned_loss=0.05689, over 15698.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3057, pruned_loss=0.08208, over 4287070.85 frames. ], batch size: 64, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:19:07,923 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=12.0 2023-06-24 00:19:16,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1611702.0, ans=0.1 2023-06-24 00:19:19,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1611702.0, ans=0.125 2023-06-24 00:19:31,181 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-24 00:20:02,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1611822.0, ans=0.04949747468305833 2023-06-24 00:20:32,018 INFO [train.py:996] (3/4) Epoch 9, batch 24700, loss[loss=0.2115, simple_loss=0.2804, pruned_loss=0.07129, over 21339.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3061, pruned_loss=0.08137, over 4284215.69 frames. ], batch size: 211, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:20:45,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=15.0 2023-06-24 00:20:48,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1611942.0, ans=0.1 2023-06-24 00:21:21,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1612062.0, ans=0.0 2023-06-24 00:21:40,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1612122.0, ans=0.0 2023-06-24 00:21:53,234 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.694e+02 5.774e+02 7.890e+02 1.274e+03 2.911e+03, threshold=1.578e+03, percent-clipped=10.0 2023-06-24 00:22:03,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1612182.0, ans=0.1 2023-06-24 00:22:10,833 INFO [train.py:996] (3/4) Epoch 9, batch 24750, loss[loss=0.193, simple_loss=0.2662, pruned_loss=0.05985, over 16300.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3005, pruned_loss=0.07915, over 4257658.55 frames. ], batch size: 67, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:22:14,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1612242.0, ans=0.5 2023-06-24 00:22:22,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1612242.0, ans=0.2 2023-06-24 00:22:28,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1612242.0, ans=0.125 2023-06-24 00:22:33,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1612302.0, ans=0.125 2023-06-24 00:23:05,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1612362.0, ans=0.0 2023-06-24 00:23:29,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1612482.0, ans=0.1 2023-06-24 00:23:38,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1612482.0, ans=0.1 2023-06-24 00:23:44,076 INFO [train.py:996] (3/4) Epoch 9, batch 24800, loss[loss=0.2499, simple_loss=0.2991, pruned_loss=0.1004, over 21700.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2969, pruned_loss=0.07878, over 4256929.03 frames. ], batch size: 282, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:23:51,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1612542.0, ans=0.125 2023-06-24 00:23:57,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1612542.0, ans=0.125 2023-06-24 00:24:17,460 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.79 vs. limit=10.0 2023-06-24 00:24:27,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1612662.0, ans=0.125 2023-06-24 00:24:41,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1612662.0, ans=0.125 2023-06-24 00:24:42,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1612662.0, ans=0.125 2023-06-24 00:25:07,474 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.818e+02 6.000e+02 9.294e+02 1.511e+03 3.142e+03, threshold=1.859e+03, percent-clipped=19.0 2023-06-24 00:25:12,531 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:25:12,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1612782.0, ans=0.0 2023-06-24 00:25:17,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1612782.0, ans=15.0 2023-06-24 00:25:22,845 INFO [train.py:996] (3/4) Epoch 9, batch 24850, loss[loss=0.208, simple_loss=0.2682, pruned_loss=0.07388, over 21429.00 frames. ], tot_loss[loss=0.23, simple_loss=0.2985, pruned_loss=0.0808, over 4256585.87 frames. ], batch size: 194, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:25:24,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1612842.0, ans=0.125 2023-06-24 00:25:26,994 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.82 vs. limit=6.0 2023-06-24 00:25:36,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1612842.0, ans=0.125 2023-06-24 00:25:46,227 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=15.0 2023-06-24 00:25:47,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1612902.0, ans=0.0 2023-06-24 00:27:06,909 INFO [train.py:996] (3/4) Epoch 9, batch 24900, loss[loss=0.2934, simple_loss=0.3589, pruned_loss=0.114, over 21599.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3005, pruned_loss=0.08045, over 4264838.28 frames. ], batch size: 389, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:28:23,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1613322.0, ans=0.05 2023-06-24 00:28:33,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1613382.0, ans=0.2 2023-06-24 00:28:36,554 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.759e+02 5.969e+02 8.739e+02 1.291e+03 2.372e+03, threshold=1.748e+03, percent-clipped=6.0 2023-06-24 00:28:47,993 INFO [train.py:996] (3/4) Epoch 9, batch 24950, loss[loss=0.3004, simple_loss=0.36, pruned_loss=0.1204, over 21391.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3091, pruned_loss=0.08532, over 4267459.45 frames. ], batch size: 159, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:29:09,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1613502.0, ans=0.0 2023-06-24 00:29:23,524 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.67 vs. limit=8.0 2023-06-24 00:29:44,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1613562.0, ans=0.1 2023-06-24 00:30:12,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1613682.0, ans=10.0 2023-06-24 00:30:29,791 INFO [train.py:996] (3/4) Epoch 9, batch 25000, loss[loss=0.1898, simple_loss=0.2401, pruned_loss=0.06976, over 20323.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3146, pruned_loss=0.08606, over 4267589.37 frames. ], batch size: 703, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:31:40,577 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=12.0 2023-06-24 00:31:55,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1613982.0, ans=0.125 2023-06-24 00:31:57,960 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.997e+02 6.297e+02 8.541e+02 1.164e+03 2.225e+03, threshold=1.708e+03, percent-clipped=6.0 2023-06-24 00:32:08,772 INFO [train.py:996] (3/4) Epoch 9, batch 25050, loss[loss=0.2108, simple_loss=0.2667, pruned_loss=0.07745, over 21218.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3075, pruned_loss=0.08438, over 4267706.91 frames. ], batch size: 549, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:32:15,171 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.14 vs. limit=15.0 2023-06-24 00:32:50,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1614162.0, ans=0.0 2023-06-24 00:33:31,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1614222.0, ans=0.125 2023-06-24 00:33:50,136 INFO [train.py:996] (3/4) Epoch 9, batch 25100, loss[loss=0.1885, simple_loss=0.2521, pruned_loss=0.06249, over 21636.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3019, pruned_loss=0.08258, over 4272005.94 frames. ], batch size: 247, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:33:56,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1614342.0, ans=0.125 2023-06-24 00:34:21,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1614402.0, ans=0.1 2023-06-24 00:34:32,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1614462.0, ans=0.125 2023-06-24 00:34:32,900 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=12.0 2023-06-24 00:34:33,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1614462.0, ans=0.0 2023-06-24 00:34:35,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1614462.0, ans=0.09899494936611666 2023-06-24 00:34:40,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1614462.0, ans=0.0 2023-06-24 00:34:48,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1614462.0, ans=0.125 2023-06-24 00:34:52,163 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.47 vs. limit=15.0 2023-06-24 00:35:10,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1614582.0, ans=0.125 2023-06-24 00:35:16,805 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.018e+02 5.962e+02 8.437e+02 1.206e+03 2.426e+03, threshold=1.687e+03, percent-clipped=3.0 2023-06-24 00:35:26,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1614642.0, ans=0.125 2023-06-24 00:35:27,816 INFO [train.py:996] (3/4) Epoch 9, batch 25150, loss[loss=0.1916, simple_loss=0.2746, pruned_loss=0.05429, over 15986.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3053, pruned_loss=0.08035, over 4264875.30 frames. ], batch size: 61, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:35:32,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1614642.0, ans=0.95 2023-06-24 00:35:35,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1614642.0, ans=0.1 2023-06-24 00:35:45,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1614702.0, ans=0.1 2023-06-24 00:36:05,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1614702.0, ans=0.125 2023-06-24 00:36:12,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1614762.0, ans=0.035 2023-06-24 00:36:24,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1614822.0, ans=0.1 2023-06-24 00:36:57,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1614882.0, ans=0.125 2023-06-24 00:37:08,150 INFO [train.py:996] (3/4) Epoch 9, batch 25200, loss[loss=0.2037, simple_loss=0.2977, pruned_loss=0.0549, over 20880.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3047, pruned_loss=0.07791, over 4260689.03 frames. ], batch size: 608, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:38:26,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1615122.0, ans=0.125 2023-06-24 00:38:35,124 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.592e+02 5.246e+02 7.409e+02 1.372e+03 3.913e+03, threshold=1.482e+03, percent-clipped=20.0 2023-06-24 00:38:42,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1615182.0, ans=0.125 2023-06-24 00:38:46,395 INFO [train.py:996] (3/4) Epoch 9, batch 25250, loss[loss=0.2278, simple_loss=0.3001, pruned_loss=0.07771, over 21444.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3022, pruned_loss=0.07602, over 4259899.40 frames. ], batch size: 389, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:39:25,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1615362.0, ans=0.1 2023-06-24 00:39:25,420 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:40:24,959 INFO [train.py:996] (3/4) Epoch 9, batch 25300, loss[loss=0.1726, simple_loss=0.2402, pruned_loss=0.05252, over 17266.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2991, pruned_loss=0.07555, over 4253371.53 frames. ], batch size: 62, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:40:30,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1615542.0, ans=0.125 2023-06-24 00:40:43,743 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-24 00:40:51,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1615602.0, ans=0.125 2023-06-24 00:41:10,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1615662.0, ans=0.0 2023-06-24 00:41:20,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1615662.0, ans=0.0 2023-06-24 00:41:28,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1615722.0, ans=0.125 2023-06-24 00:41:28,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1615722.0, ans=0.1 2023-06-24 00:41:37,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1615722.0, ans=0.2 2023-06-24 00:41:55,181 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.408e+02 6.402e+02 8.209e+02 1.215e+03 2.497e+03, threshold=1.642e+03, percent-clipped=20.0 2023-06-24 00:42:04,753 INFO [train.py:996] (3/4) Epoch 9, batch 25350, loss[loss=0.1828, simple_loss=0.2732, pruned_loss=0.04625, over 21704.00 frames. ], tot_loss[loss=0.227, simple_loss=0.302, pruned_loss=0.076, over 4245648.25 frames. ], batch size: 332, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:42:52,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1615962.0, ans=0.0 2023-06-24 00:42:52,884 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2023-06-24 00:43:14,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1616022.0, ans=0.0 2023-06-24 00:43:14,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1616022.0, ans=0.125 2023-06-24 00:43:44,790 INFO [train.py:996] (3/4) Epoch 9, batch 25400, loss[loss=0.2258, simple_loss=0.293, pruned_loss=0.07934, over 21592.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2978, pruned_loss=0.07456, over 4253512.59 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:44:01,950 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.04 vs. limit=15.0 2023-06-24 00:44:02,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1616202.0, ans=0.0 2023-06-24 00:44:34,779 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=22.5 2023-06-24 00:45:17,974 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.471e+02 5.867e+02 9.461e+02 1.414e+03 2.497e+03, threshold=1.892e+03, percent-clipped=14.0 2023-06-24 00:45:27,888 INFO [train.py:996] (3/4) Epoch 9, batch 25450, loss[loss=0.224, simple_loss=0.3239, pruned_loss=0.06209, over 21675.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2989, pruned_loss=0.07684, over 4259109.81 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:45:40,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1616442.0, ans=0.125 2023-06-24 00:46:30,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1616622.0, ans=0.0 2023-06-24 00:46:56,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1616682.0, ans=0.5 2023-06-24 00:47:04,815 INFO [train.py:996] (3/4) Epoch 9, batch 25500, loss[loss=0.2038, simple_loss=0.2979, pruned_loss=0.05488, over 21749.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2992, pruned_loss=0.07406, over 4268506.77 frames. ], batch size: 332, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:47:27,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1616802.0, ans=0.1 2023-06-24 00:47:38,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1616802.0, ans=0.125 2023-06-24 00:47:51,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1616862.0, ans=0.2 2023-06-24 00:47:52,687 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=15.0 2023-06-24 00:47:53,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1616862.0, ans=0.125 2023-06-24 00:48:04,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1616862.0, ans=0.2 2023-06-24 00:48:04,826 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.86 vs. limit=22.5 2023-06-24 00:48:39,567 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.450e+02 4.898e+02 7.192e+02 1.024e+03 1.607e+03, threshold=1.438e+03, percent-clipped=0.0 2023-06-24 00:48:41,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1616982.0, ans=0.125 2023-06-24 00:48:49,462 INFO [train.py:996] (3/4) Epoch 9, batch 25550, loss[loss=0.2861, simple_loss=0.3799, pruned_loss=0.09614, over 21525.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3078, pruned_loss=0.0755, over 4270347.54 frames. ], batch size: 507, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:48:54,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1617042.0, ans=0.2 2023-06-24 00:49:12,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1617042.0, ans=0.125 2023-06-24 00:50:34,394 INFO [train.py:996] (3/4) Epoch 9, batch 25600, loss[loss=0.3364, simple_loss=0.3872, pruned_loss=0.1428, over 21378.00 frames. ], tot_loss[loss=0.231, simple_loss=0.311, pruned_loss=0.07552, over 4270841.13 frames. ], batch size: 507, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:50:52,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1617342.0, ans=0.125 2023-06-24 00:51:34,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1617462.0, ans=0.0 2023-06-24 00:51:35,276 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-24 00:51:45,729 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=15.0 2023-06-24 00:51:59,675 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.613e+02 7.503e+02 1.087e+03 1.475e+03 2.223e+03, threshold=2.175e+03, percent-clipped=27.0 2023-06-24 00:52:13,778 INFO [train.py:996] (3/4) Epoch 9, batch 25650, loss[loss=0.2506, simple_loss=0.3335, pruned_loss=0.08384, over 19837.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3108, pruned_loss=0.07799, over 4268830.93 frames. ], batch size: 702, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:52:50,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1617702.0, ans=0.1 2023-06-24 00:52:54,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1617702.0, ans=0.04949747468305833 2023-06-24 00:53:54,158 INFO [train.py:996] (3/4) Epoch 9, batch 25700, loss[loss=0.2393, simple_loss=0.3145, pruned_loss=0.08208, over 21201.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.308, pruned_loss=0.07936, over 4270657.09 frames. ], batch size: 143, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:54:32,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1618002.0, ans=0.125 2023-06-24 00:54:36,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1618002.0, ans=0.125 2023-06-24 00:54:42,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1618062.0, ans=0.1 2023-06-24 00:54:45,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1618062.0, ans=0.0 2023-06-24 00:55:05,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1618122.0, ans=0.125 2023-06-24 00:55:10,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1618122.0, ans=0.2 2023-06-24 00:55:24,786 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.450e+02 6.568e+02 8.876e+02 1.245e+03 3.057e+03, threshold=1.775e+03, percent-clipped=5.0 2023-06-24 00:55:36,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1618182.0, ans=0.125 2023-06-24 00:55:39,597 INFO [train.py:996] (3/4) Epoch 9, batch 25750, loss[loss=0.296, simple_loss=0.3591, pruned_loss=0.1165, over 21259.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3127, pruned_loss=0.08216, over 4267281.92 frames. ], batch size: 159, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:55:47,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=1618242.0, ans=12.0 2023-06-24 00:55:47,362 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.10 vs. limit=12.0 2023-06-24 00:55:52,096 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=15.0 2023-06-24 00:56:00,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1618242.0, ans=0.125 2023-06-24 00:56:32,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1618362.0, ans=0.2 2023-06-24 00:56:50,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1618422.0, ans=0.0 2023-06-24 00:57:00,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1618422.0, ans=0.1 2023-06-24 00:57:33,087 INFO [train.py:996] (3/4) Epoch 9, batch 25800, loss[loss=0.301, simple_loss=0.3694, pruned_loss=0.1163, over 21344.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3259, pruned_loss=0.08683, over 4262312.13 frames. ], batch size: 507, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:57:49,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1618602.0, ans=0.0 2023-06-24 00:58:02,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1618602.0, ans=0.0 2023-06-24 00:58:56,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-24 00:59:05,227 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.745e+02 6.884e+02 9.421e+02 1.458e+03 3.090e+03, threshold=1.884e+03, percent-clipped=11.0 2023-06-24 00:59:14,578 INFO [train.py:996] (3/4) Epoch 9, batch 25850, loss[loss=0.2363, simple_loss=0.3163, pruned_loss=0.07809, over 21819.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3265, pruned_loss=0.0853, over 4268057.95 frames. ], batch size: 414, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:59:20,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1618842.0, ans=0.1 2023-06-24 00:59:25,677 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2023-06-24 00:59:29,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1618902.0, ans=0.125 2023-06-24 01:00:30,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1619022.0, ans=0.0 2023-06-24 01:00:56,715 INFO [train.py:996] (3/4) Epoch 9, batch 25900, loss[loss=0.2415, simple_loss=0.3331, pruned_loss=0.07493, over 19956.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3276, pruned_loss=0.08617, over 4267582.95 frames. ], batch size: 703, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:01:23,108 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-24 01:01:42,600 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2023-06-24 01:02:13,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1619322.0, ans=0.125 2023-06-24 01:02:27,423 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.264e+02 6.970e+02 9.529e+02 1.427e+03 2.797e+03, threshold=1.906e+03, percent-clipped=4.0 2023-06-24 01:02:37,403 INFO [train.py:996] (3/4) Epoch 9, batch 25950, loss[loss=0.2704, simple_loss=0.3436, pruned_loss=0.09859, over 21550.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3344, pruned_loss=0.08904, over 4274702.50 frames. ], batch size: 414, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:02:51,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1619442.0, ans=0.125 2023-06-24 01:03:52,669 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.92 vs. limit=15.0 2023-06-24 01:03:57,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1619622.0, ans=0.0 2023-06-24 01:04:21,685 INFO [train.py:996] (3/4) Epoch 9, batch 26000, loss[loss=0.2852, simple_loss=0.3608, pruned_loss=0.1049, over 21820.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3335, pruned_loss=0.08789, over 4272285.08 frames. ], batch size: 124, lr: 3.22e-03, grad_scale: 32.0 2023-06-24 01:05:07,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1619862.0, ans=0.1 2023-06-24 01:05:11,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1619862.0, ans=22.5 2023-06-24 01:05:30,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1619922.0, ans=0.125 2023-06-24 01:05:49,115 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.394e+02 5.949e+02 7.869e+02 1.155e+03 1.920e+03, threshold=1.574e+03, percent-clipped=1.0 2023-06-24 01:06:01,895 INFO [train.py:996] (3/4) Epoch 9, batch 26050, loss[loss=0.2109, simple_loss=0.2743, pruned_loss=0.07373, over 21071.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3324, pruned_loss=0.08882, over 4274673.07 frames. ], batch size: 607, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:06:53,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.50 vs. limit=22.5 2023-06-24 01:06:53,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1620162.0, ans=0.0 2023-06-24 01:07:04,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1620222.0, ans=0.0 2023-06-24 01:07:05,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1620222.0, ans=0.035 2023-06-24 01:07:42,006 INFO [train.py:996] (3/4) Epoch 9, batch 26100, loss[loss=0.2147, simple_loss=0.287, pruned_loss=0.07121, over 21842.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.327, pruned_loss=0.08859, over 4280643.11 frames. ], batch size: 124, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:07:42,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1620342.0, ans=0.125 2023-06-24 01:08:21,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1620402.0, ans=0.125 2023-06-24 01:08:28,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1620462.0, ans=0.125 2023-06-24 01:08:39,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1620462.0, ans=0.125 2023-06-24 01:08:49,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1620522.0, ans=0.1 2023-06-24 01:09:04,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1620582.0, ans=0.0 2023-06-24 01:09:14,422 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.900e+02 5.600e+02 7.578e+02 1.132e+03 2.519e+03, threshold=1.516e+03, percent-clipped=12.0 2023-06-24 01:09:27,680 INFO [train.py:996] (3/4) Epoch 9, batch 26150, loss[loss=0.2807, simple_loss=0.3535, pruned_loss=0.1039, over 21827.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3238, pruned_loss=0.08818, over 4290967.87 frames. ], batch size: 118, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:09:48,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1620642.0, ans=0.0 2023-06-24 01:09:58,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-06-24 01:10:53,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1620882.0, ans=0.125 2023-06-24 01:11:08,822 INFO [train.py:996] (3/4) Epoch 9, batch 26200, loss[loss=0.149, simple_loss=0.219, pruned_loss=0.03944, over 17353.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3248, pruned_loss=0.08628, over 4287770.95 frames. ], batch size: 60, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:12:19,307 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:12:40,875 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.187e+02 6.061e+02 7.931e+02 1.081e+03 1.881e+03, threshold=1.586e+03, percent-clipped=8.0 2023-06-24 01:12:48,750 INFO [train.py:996] (3/4) Epoch 9, batch 26250, loss[loss=0.2217, simple_loss=0.2953, pruned_loss=0.07405, over 21480.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3268, pruned_loss=0.08543, over 4282824.34 frames. ], batch size: 131, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:12:56,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1621242.0, ans=0.04949747468305833 2023-06-24 01:13:16,507 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=22.5 2023-06-24 01:13:57,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1621422.0, ans=0.125 2023-06-24 01:14:27,380 INFO [train.py:996] (3/4) Epoch 9, batch 26300, loss[loss=0.238, simple_loss=0.3046, pruned_loss=0.08572, over 21309.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3245, pruned_loss=0.08572, over 4285793.16 frames. ], batch size: 159, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:15:50,825 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-24 01:16:00,709 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.956e+02 5.501e+02 7.809e+02 1.118e+03 2.350e+03, threshold=1.562e+03, percent-clipped=10.0 2023-06-24 01:16:08,896 INFO [train.py:996] (3/4) Epoch 9, batch 26350, loss[loss=0.2725, simple_loss=0.3391, pruned_loss=0.103, over 21627.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3232, pruned_loss=0.08651, over 4288821.25 frames. ], batch size: 230, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:17:42,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1622082.0, ans=0.125 2023-06-24 01:17:50,273 INFO [train.py:996] (3/4) Epoch 9, batch 26400, loss[loss=0.2378, simple_loss=0.2853, pruned_loss=0.09519, over 21571.00 frames. ], tot_loss[loss=0.246, simple_loss=0.318, pruned_loss=0.08694, over 4285180.08 frames. ], batch size: 441, lr: 3.22e-03, grad_scale: 32.0 2023-06-24 01:17:58,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1622142.0, ans=0.1 2023-06-24 01:18:36,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1622262.0, ans=0.0 2023-06-24 01:19:14,563 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2023-06-24 01:19:17,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1622382.0, ans=0.2 2023-06-24 01:19:27,013 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.128e+02 6.613e+02 9.874e+02 1.376e+03 2.693e+03, threshold=1.975e+03, percent-clipped=17.0 2023-06-24 01:19:33,649 INFO [train.py:996] (3/4) Epoch 9, batch 26450, loss[loss=0.234, simple_loss=0.3017, pruned_loss=0.08308, over 21163.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.316, pruned_loss=0.08616, over 4287006.71 frames. ], batch size: 143, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:19:37,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1622442.0, ans=0.0 2023-06-24 01:19:51,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1622502.0, ans=0.04949747468305833 2023-06-24 01:20:04,454 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-24 01:20:07,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1622502.0, ans=0.1 2023-06-24 01:20:30,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1622562.0, ans=0.2 2023-06-24 01:21:08,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1622682.0, ans=0.5 2023-06-24 01:21:16,728 INFO [train.py:996] (3/4) Epoch 9, batch 26500, loss[loss=0.2459, simple_loss=0.3169, pruned_loss=0.08743, over 21848.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3184, pruned_loss=0.08464, over 4277212.25 frames. ], batch size: 282, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:21:47,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1622802.0, ans=0.125 2023-06-24 01:21:58,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1622802.0, ans=0.1 2023-06-24 01:22:38,048 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=15.0 2023-06-24 01:22:58,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1622982.0, ans=0.2 2023-06-24 01:22:59,835 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.503e+02 8.000e+02 1.204e+03 2.180e+03 3.765e+03, threshold=2.409e+03, percent-clipped=29.0 2023-06-24 01:23:05,407 INFO [train.py:996] (3/4) Epoch 9, batch 26550, loss[loss=0.2227, simple_loss=0.3215, pruned_loss=0.06195, over 21564.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3154, pruned_loss=0.08167, over 4277523.20 frames. ], batch size: 441, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:23:38,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1623102.0, ans=0.125 2023-06-24 01:23:55,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1623162.0, ans=0.0 2023-06-24 01:23:57,517 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=22.5 2023-06-24 01:24:25,950 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-24 01:24:55,364 INFO [train.py:996] (3/4) Epoch 9, batch 26600, loss[loss=0.2083, simple_loss=0.3003, pruned_loss=0.0582, over 21639.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3159, pruned_loss=0.07894, over 4270954.37 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:24:57,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1623342.0, ans=0.125 2023-06-24 01:26:33,770 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.884e+02 5.096e+02 6.459e+02 8.360e+02 2.532e+03, threshold=1.292e+03, percent-clipped=3.0 2023-06-24 01:26:38,310 INFO [train.py:996] (3/4) Epoch 9, batch 26650, loss[loss=0.168, simple_loss=0.246, pruned_loss=0.04494, over 21264.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3085, pruned_loss=0.07719, over 4271232.54 frames. ], batch size: 160, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:26:44,609 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:26:49,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1623642.0, ans=0.025 2023-06-24 01:26:58,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1623702.0, ans=0.09899494936611666 2023-06-24 01:27:29,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1623822.0, ans=0.05 2023-06-24 01:28:05,782 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.31 vs. limit=12.0 2023-06-24 01:28:17,067 INFO [train.py:996] (3/4) Epoch 9, batch 26700, loss[loss=0.1908, simple_loss=0.2666, pruned_loss=0.05749, over 21815.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3009, pruned_loss=0.07407, over 4264769.22 frames. ], batch size: 298, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:28:40,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1624002.0, ans=0.125 2023-06-24 01:28:47,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1624002.0, ans=0.125 2023-06-24 01:29:02,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1624062.0, ans=0.125 2023-06-24 01:29:48,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1624182.0, ans=0.2 2023-06-24 01:29:58,467 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.138e+02 5.000e+02 7.039e+02 1.019e+03 2.446e+03, threshold=1.408e+03, percent-clipped=9.0 2023-06-24 01:30:03,512 INFO [train.py:996] (3/4) Epoch 9, batch 26750, loss[loss=0.1943, simple_loss=0.2947, pruned_loss=0.04698, over 21702.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3021, pruned_loss=0.07399, over 4274342.66 frames. ], batch size: 441, lr: 3.21e-03, grad_scale: 8.0 2023-06-24 01:30:19,458 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-06-24 01:30:32,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1624302.0, ans=0.0 2023-06-24 01:30:53,239 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.92 vs. limit=10.0 2023-06-24 01:31:18,065 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.66 vs. limit=15.0 2023-06-24 01:31:40,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1624482.0, ans=0.125 2023-06-24 01:31:44,729 INFO [train.py:996] (3/4) Epoch 9, batch 26800, loss[loss=0.302, simple_loss=0.3673, pruned_loss=0.1184, over 21426.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3097, pruned_loss=0.07854, over 4281131.00 frames. ], batch size: 471, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:32:16,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1624602.0, ans=0.1 2023-06-24 01:32:34,046 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.11 vs. limit=15.0 2023-06-24 01:33:19,722 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 6.341e+02 8.835e+02 1.201e+03 2.832e+03, threshold=1.767e+03, percent-clipped=16.0 2023-06-24 01:33:24,492 INFO [train.py:996] (3/4) Epoch 9, batch 26850, loss[loss=0.2235, simple_loss=0.2814, pruned_loss=0.08278, over 20084.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3104, pruned_loss=0.08077, over 4275562.39 frames. ], batch size: 703, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:33:34,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1624842.0, ans=0.125 2023-06-24 01:33:41,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1624842.0, ans=0.125 2023-06-24 01:33:42,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1624842.0, ans=0.0 2023-06-24 01:33:52,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1624902.0, ans=0.0 2023-06-24 01:33:55,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1624902.0, ans=0.1 2023-06-24 01:34:07,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1624962.0, ans=0.2 2023-06-24 01:34:26,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1624962.0, ans=0.125 2023-06-24 01:34:38,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1625022.0, ans=0.125 2023-06-24 01:34:40,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1625022.0, ans=0.125 2023-06-24 01:34:41,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1625022.0, ans=0.1 2023-06-24 01:34:44,485 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-24 01:35:07,308 INFO [train.py:996] (3/4) Epoch 9, batch 26900, loss[loss=0.208, simple_loss=0.2699, pruned_loss=0.07303, over 21756.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3027, pruned_loss=0.07954, over 4267628.92 frames. ], batch size: 300, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:35:33,480 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-06-24 01:35:36,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1625202.0, ans=0.2 2023-06-24 01:35:55,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1625262.0, ans=0.125 2023-06-24 01:36:19,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1625322.0, ans=0.0 2023-06-24 01:36:22,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1625322.0, ans=0.0 2023-06-24 01:36:37,221 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.966e+02 5.702e+02 7.755e+02 1.168e+03 3.900e+03, threshold=1.551e+03, percent-clipped=4.0 2023-06-24 01:36:41,752 INFO [train.py:996] (3/4) Epoch 9, batch 26950, loss[loss=0.2695, simple_loss=0.3592, pruned_loss=0.08991, over 21769.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3027, pruned_loss=0.07963, over 4271088.88 frames. ], batch size: 282, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:37:59,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1625622.0, ans=0.125 2023-06-24 01:38:23,717 INFO [train.py:996] (3/4) Epoch 9, batch 27000, loss[loss=0.2042, simple_loss=0.2924, pruned_loss=0.05805, over 21686.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3035, pruned_loss=0.07776, over 4264173.19 frames. ], batch size: 298, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:38:23,718 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 01:38:43,005 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2397, simple_loss=0.3375, pruned_loss=0.07102, over 1796401.00 frames. 2023-06-24 01:38:43,006 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-24 01:38:48,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1625742.0, ans=0.125 2023-06-24 01:39:26,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1625802.0, ans=10.0 2023-06-24 01:39:27,220 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=22.5 2023-06-24 01:39:56,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1625922.0, ans=0.1 2023-06-24 01:39:57,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1625922.0, ans=0.0 2023-06-24 01:40:02,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1625982.0, ans=0.125 2023-06-24 01:40:18,511 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.845e+02 5.629e+02 9.070e+02 1.221e+03 2.568e+03, threshold=1.814e+03, percent-clipped=16.0 2023-06-24 01:40:23,207 INFO [train.py:996] (3/4) Epoch 9, batch 27050, loss[loss=0.2222, simple_loss=0.3115, pruned_loss=0.06645, over 21861.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3062, pruned_loss=0.07488, over 4266708.85 frames. ], batch size: 351, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:41:04,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1626102.0, ans=0.125 2023-06-24 01:41:20,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1626162.0, ans=0.125 2023-06-24 01:41:25,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1626222.0, ans=0.2 2023-06-24 01:42:04,825 INFO [train.py:996] (3/4) Epoch 9, batch 27100, loss[loss=0.2268, simple_loss=0.3248, pruned_loss=0.06442, over 21608.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3086, pruned_loss=0.07665, over 4277900.68 frames. ], batch size: 230, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:42:23,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1626342.0, ans=0.125 2023-06-24 01:42:26,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1626402.0, ans=0.2 2023-06-24 01:42:54,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1626462.0, ans=0.125 2023-06-24 01:42:55,064 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-24 01:43:33,780 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-06-24 01:43:41,610 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.78 vs. limit=15.0 2023-06-24 01:43:42,164 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.346e+02 6.138e+02 8.870e+02 1.301e+03 2.299e+03, threshold=1.774e+03, percent-clipped=4.0 2023-06-24 01:43:47,576 INFO [train.py:996] (3/4) Epoch 9, batch 27150, loss[loss=0.2707, simple_loss=0.357, pruned_loss=0.09225, over 21620.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3232, pruned_loss=0.08163, over 4276794.66 frames. ], batch size: 263, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:44:43,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-24 01:45:09,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1626882.0, ans=0.0 2023-06-24 01:45:17,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1626882.0, ans=0.125 2023-06-24 01:45:32,203 INFO [train.py:996] (3/4) Epoch 9, batch 27200, loss[loss=0.2698, simple_loss=0.3477, pruned_loss=0.09597, over 21434.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3305, pruned_loss=0.08402, over 4279743.86 frames. ], batch size: 131, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 01:45:45,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-24 01:45:53,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1626942.0, ans=0.0 2023-06-24 01:45:59,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1627002.0, ans=0.2 2023-06-24 01:46:32,005 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.29 vs. limit=15.0 2023-06-24 01:46:44,336 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.60 vs. limit=15.0 2023-06-24 01:47:17,485 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.204e+02 6.892e+02 9.906e+02 1.306e+03 3.118e+03, threshold=1.981e+03, percent-clipped=15.0 2023-06-24 01:47:22,709 INFO [train.py:996] (3/4) Epoch 9, batch 27250, loss[loss=0.2938, simple_loss=0.3541, pruned_loss=0.1168, over 21408.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3324, pruned_loss=0.08782, over 4280110.87 frames. ], batch size: 471, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 01:48:00,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1627362.0, ans=0.125 2023-06-24 01:48:49,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1627482.0, ans=0.1 2023-06-24 01:49:03,878 INFO [train.py:996] (3/4) Epoch 9, batch 27300, loss[loss=0.2251, simple_loss=0.3165, pruned_loss=0.0669, over 21791.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3342, pruned_loss=0.08869, over 4281465.96 frames. ], batch size: 332, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:49:07,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1627542.0, ans=0.125 2023-06-24 01:50:24,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1627722.0, ans=0.125 2023-06-24 01:50:37,311 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:50:40,409 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.677e+02 5.255e+02 6.671e+02 8.856e+02 1.687e+03, threshold=1.334e+03, percent-clipped=0.0 2023-06-24 01:50:43,350 INFO [train.py:996] (3/4) Epoch 9, batch 27350, loss[loss=0.2682, simple_loss=0.3403, pruned_loss=0.09802, over 21625.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3357, pruned_loss=0.08904, over 4276653.04 frames. ], batch size: 112, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:51:13,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1627902.0, ans=0.0 2023-06-24 01:51:13,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1627902.0, ans=0.2 2023-06-24 01:52:04,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1628082.0, ans=0.2 2023-06-24 01:52:09,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1628082.0, ans=0.125 2023-06-24 01:52:21,407 INFO [train.py:996] (3/4) Epoch 9, batch 27400, loss[loss=0.2075, simple_loss=0.2718, pruned_loss=0.07156, over 21609.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3294, pruned_loss=0.0882, over 4283724.72 frames. ], batch size: 263, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:52:23,874 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=15.0 2023-06-24 01:52:24,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1628142.0, ans=0.0 2023-06-24 01:53:00,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1628262.0, ans=0.0 2023-06-24 01:53:57,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1628382.0, ans=0.0 2023-06-24 01:53:58,829 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.992e+02 5.150e+02 6.408e+02 1.007e+03 1.892e+03, threshold=1.282e+03, percent-clipped=7.0 2023-06-24 01:54:01,965 INFO [train.py:996] (3/4) Epoch 9, batch 27450, loss[loss=0.2519, simple_loss=0.3331, pruned_loss=0.08534, over 21885.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3232, pruned_loss=0.08628, over 4282381.54 frames. ], batch size: 372, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:54:16,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1628442.0, ans=0.04949747468305833 2023-06-24 01:54:16,732 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=22.5 2023-06-24 01:54:24,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1628502.0, ans=0.125 2023-06-24 01:54:25,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1628502.0, ans=0.5 2023-06-24 01:54:54,581 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=22.5 2023-06-24 01:55:30,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1628682.0, ans=0.1 2023-06-24 01:55:39,261 INFO [train.py:996] (3/4) Epoch 9, batch 27500, loss[loss=0.2386, simple_loss=0.298, pruned_loss=0.08961, over 21843.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.322, pruned_loss=0.08682, over 4283912.00 frames. ], batch size: 247, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:55:52,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1628742.0, ans=0.09899494936611666 2023-06-24 01:56:36,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1628862.0, ans=0.125 2023-06-24 01:56:38,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1628862.0, ans=0.125 2023-06-24 01:57:05,568 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.44 vs. limit=15.0 2023-06-24 01:57:11,063 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.736e+02 5.167e+02 6.606e+02 9.536e+02 1.970e+03, threshold=1.321e+03, percent-clipped=8.0 2023-06-24 01:57:11,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1628982.0, ans=0.125 2023-06-24 01:57:17,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1629042.0, ans=0.1 2023-06-24 01:57:18,524 INFO [train.py:996] (3/4) Epoch 9, batch 27550, loss[loss=0.3102, simple_loss=0.4085, pruned_loss=0.106, over 19951.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3173, pruned_loss=0.08429, over 4279950.03 frames. ], batch size: 702, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:57:34,961 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-24 01:57:42,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1629102.0, ans=0.125 2023-06-24 01:58:12,378 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=15.0 2023-06-24 01:58:22,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1629222.0, ans=0.0 2023-06-24 01:58:32,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1629222.0, ans=0.0 2023-06-24 01:58:36,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1629282.0, ans=0.07 2023-06-24 01:58:56,721 INFO [train.py:996] (3/4) Epoch 9, batch 27600, loss[loss=0.2591, simple_loss=0.3311, pruned_loss=0.09353, over 19983.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3101, pruned_loss=0.08332, over 4280652.66 frames. ], batch size: 702, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 01:59:16,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1629402.0, ans=0.2 2023-06-24 01:59:28,695 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:00:01,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1629522.0, ans=0.5 2023-06-24 02:00:02,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1629522.0, ans=0.125 2023-06-24 02:00:08,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1629522.0, ans=0.0 2023-06-24 02:00:12,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1629522.0, ans=0.0 2023-06-24 02:00:21,181 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.93 vs. limit=10.0 2023-06-24 02:00:27,992 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.819e+02 6.454e+02 1.076e+03 1.748e+03 4.788e+03, threshold=2.152e+03, percent-clipped=40.0 2023-06-24 02:00:31,082 INFO [train.py:996] (3/4) Epoch 9, batch 27650, loss[loss=0.2351, simple_loss=0.2957, pruned_loss=0.08721, over 21616.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3051, pruned_loss=0.08277, over 4268805.07 frames. ], batch size: 389, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 02:00:42,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1629642.0, ans=0.125 2023-06-24 02:00:46,820 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:00:50,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1629702.0, ans=0.125 2023-06-24 02:01:31,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1629762.0, ans=0.125 2023-06-24 02:01:44,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1629822.0, ans=22.5 2023-06-24 02:02:14,150 INFO [train.py:996] (3/4) Epoch 9, batch 27700, loss[loss=0.2769, simple_loss=0.3706, pruned_loss=0.0916, over 20841.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3058, pruned_loss=0.08096, over 4274330.41 frames. ], batch size: 608, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:02:26,826 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-24 02:02:41,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1630002.0, ans=0.0 2023-06-24 02:02:51,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1630002.0, ans=0.125 2023-06-24 02:03:09,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1630062.0, ans=0.125 2023-06-24 02:03:53,630 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.601e+02 6.018e+02 9.406e+02 1.396e+03 2.807e+03, threshold=1.881e+03, percent-clipped=3.0 2023-06-24 02:03:55,109 INFO [train.py:996] (3/4) Epoch 9, batch 27750, loss[loss=0.2626, simple_loss=0.3348, pruned_loss=0.09521, over 21749.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3082, pruned_loss=0.08042, over 4271561.92 frames. ], batch size: 441, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:04:24,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1630302.0, ans=0.0 2023-06-24 02:04:46,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1630362.0, ans=0.2 2023-06-24 02:04:54,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1630362.0, ans=0.0 2023-06-24 02:05:06,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1630422.0, ans=0.125 2023-06-24 02:05:28,014 INFO [train.py:996] (3/4) Epoch 9, batch 27800, loss[loss=0.2697, simple_loss=0.3395, pruned_loss=0.09993, over 21852.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3074, pruned_loss=0.08075, over 4274242.97 frames. ], batch size: 107, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:05:45,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1630542.0, ans=0.125 2023-06-24 02:05:59,137 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-24 02:06:30,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1630662.0, ans=0.2 2023-06-24 02:06:34,079 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.25 vs. limit=15.0 2023-06-24 02:07:10,669 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.786e+02 6.303e+02 8.930e+02 1.307e+03 2.305e+03, threshold=1.786e+03, percent-clipped=6.0 2023-06-24 02:07:12,451 INFO [train.py:996] (3/4) Epoch 9, batch 27850, loss[loss=0.2693, simple_loss=0.3451, pruned_loss=0.09678, over 21740.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3062, pruned_loss=0.08163, over 4282713.57 frames. ], batch size: 441, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:07:26,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1630842.0, ans=0.125 2023-06-24 02:07:38,750 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.82 vs. limit=15.0 2023-06-24 02:07:49,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1630902.0, ans=0.02 2023-06-24 02:08:49,364 INFO [train.py:996] (3/4) Epoch 9, batch 27900, loss[loss=0.208, simple_loss=0.2957, pruned_loss=0.06013, over 21676.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3155, pruned_loss=0.08249, over 4285855.96 frames. ], batch size: 247, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:09:04,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1631142.0, ans=0.125 2023-06-24 02:09:04,842 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-24 02:09:29,100 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.89 vs. limit=15.0 2023-06-24 02:09:30,908 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=22.5 2023-06-24 02:09:36,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1631262.0, ans=0.125 2023-06-24 02:10:44,055 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.905e+02 5.823e+02 7.958e+02 1.181e+03 2.526e+03, threshold=1.592e+03, percent-clipped=8.0 2023-06-24 02:10:45,591 INFO [train.py:996] (3/4) Epoch 9, batch 27950, loss[loss=0.1683, simple_loss=0.2526, pruned_loss=0.04199, over 21493.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3156, pruned_loss=0.07917, over 4282838.93 frames. ], batch size: 195, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:11:08,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1631502.0, ans=0.0 2023-06-24 02:11:35,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1631562.0, ans=0.125 2023-06-24 02:11:37,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1631622.0, ans=0.0 2023-06-24 02:12:17,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1631682.0, ans=0.0 2023-06-24 02:12:20,022 INFO [train.py:996] (3/4) Epoch 9, batch 28000, loss[loss=0.2363, simple_loss=0.3026, pruned_loss=0.08505, over 21218.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3123, pruned_loss=0.07657, over 4288086.33 frames. ], batch size: 143, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 02:12:20,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1631742.0, ans=0.1 2023-06-24 02:12:47,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1631802.0, ans=0.0 2023-06-24 02:12:59,651 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.55 vs. limit=22.5 2023-06-24 02:13:32,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1631982.0, ans=0.0 2023-06-24 02:14:06,379 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.113e+02 6.367e+02 8.563e+02 1.195e+03 2.843e+03, threshold=1.713e+03, percent-clipped=10.0 2023-06-24 02:14:06,400 INFO [train.py:996] (3/4) Epoch 9, batch 28050, loss[loss=0.2399, simple_loss=0.3318, pruned_loss=0.07407, over 21713.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3106, pruned_loss=0.07805, over 4288639.74 frames. ], batch size: 414, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:14:13,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1632042.0, ans=0.125 2023-06-24 02:15:45,596 INFO [train.py:996] (3/4) Epoch 9, batch 28100, loss[loss=0.2175, simple_loss=0.3174, pruned_loss=0.05877, over 20833.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3082, pruned_loss=0.07819, over 4288710.41 frames. ], batch size: 608, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:15:50,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1632342.0, ans=0.04949747468305833 2023-06-24 02:16:05,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1632402.0, ans=0.125 2023-06-24 02:16:07,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1632402.0, ans=0.5 2023-06-24 02:16:32,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1632462.0, ans=0.0 2023-06-24 02:16:54,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1632522.0, ans=0.0 2023-06-24 02:17:22,317 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.058e+02 5.561e+02 8.006e+02 1.245e+03 3.714e+03, threshold=1.601e+03, percent-clipped=14.0 2023-06-24 02:17:22,348 INFO [train.py:996] (3/4) Epoch 9, batch 28150, loss[loss=0.1799, simple_loss=0.246, pruned_loss=0.05694, over 21460.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3021, pruned_loss=0.07759, over 4290673.16 frames. ], batch size: 212, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:17:37,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1632702.0, ans=0.1 2023-06-24 02:18:19,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1632822.0, ans=0.025 2023-06-24 02:18:42,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1632882.0, ans=0.125 2023-06-24 02:18:56,302 INFO [train.py:996] (3/4) Epoch 9, batch 28200, loss[loss=0.2668, simple_loss=0.3351, pruned_loss=0.09931, over 21777.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3013, pruned_loss=0.07931, over 4275180.91 frames. ], batch size: 124, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:19:11,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1633002.0, ans=0.04949747468305833 2023-06-24 02:19:16,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1633002.0, ans=0.1 2023-06-24 02:19:44,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1633062.0, ans=0.0 2023-06-24 02:19:52,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1633122.0, ans=0.125 2023-06-24 02:20:12,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1633122.0, ans=0.0 2023-06-24 02:20:35,755 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.373e+02 7.070e+02 1.041e+03 1.545e+03 2.791e+03, threshold=2.082e+03, percent-clipped=22.0 2023-06-24 02:20:35,776 INFO [train.py:996] (3/4) Epoch 9, batch 28250, loss[loss=0.2149, simple_loss=0.2792, pruned_loss=0.0753, over 21642.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3041, pruned_loss=0.08088, over 4264240.56 frames. ], batch size: 298, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:20:52,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1633302.0, ans=0.0 2023-06-24 02:22:17,407 INFO [train.py:996] (3/4) Epoch 9, batch 28300, loss[loss=0.2604, simple_loss=0.3438, pruned_loss=0.08845, over 21509.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3027, pruned_loss=0.08035, over 4261379.40 frames. ], batch size: 508, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:22:47,803 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-06-24 02:23:53,374 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.72 vs. limit=10.0 2023-06-24 02:23:56,792 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.614e+02 5.879e+02 9.596e+02 1.261e+03 3.601e+03, threshold=1.919e+03, percent-clipped=6.0 2023-06-24 02:23:56,813 INFO [train.py:996] (3/4) Epoch 9, batch 28350, loss[loss=0.1853, simple_loss=0.253, pruned_loss=0.05878, over 21843.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2975, pruned_loss=0.07413, over 4258678.44 frames. ], batch size: 98, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:24:04,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1633842.0, ans=0.0 2023-06-24 02:24:29,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1633902.0, ans=0.1 2023-06-24 02:24:54,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1633962.0, ans=0.125 2023-06-24 02:25:12,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1634022.0, ans=0.125 2023-06-24 02:25:17,793 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=12.0 2023-06-24 02:25:40,914 INFO [train.py:996] (3/4) Epoch 9, batch 28400, loss[loss=0.1965, simple_loss=0.3314, pruned_loss=0.0308, over 20795.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2966, pruned_loss=0.07354, over 4259052.11 frames. ], batch size: 607, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 02:25:50,663 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:25:53,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1634142.0, ans=0.0 2023-06-24 02:27:03,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-06-24 02:27:14,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1634382.0, ans=0.2 2023-06-24 02:27:18,254 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.912e+02 5.958e+02 8.832e+02 1.277e+03 2.222e+03, threshold=1.766e+03, percent-clipped=4.0 2023-06-24 02:27:18,275 INFO [train.py:996] (3/4) Epoch 9, batch 28450, loss[loss=0.2923, simple_loss=0.3617, pruned_loss=0.1115, over 21556.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3011, pruned_loss=0.07755, over 4272426.56 frames. ], batch size: 414, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:27:27,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1634442.0, ans=0.0 2023-06-24 02:27:40,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1634502.0, ans=0.1 2023-06-24 02:27:44,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1634502.0, ans=0.2 2023-06-24 02:28:33,800 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.41 vs. limit=10.0 2023-06-24 02:28:55,994 INFO [train.py:996] (3/4) Epoch 9, batch 28500, loss[loss=0.2696, simple_loss=0.331, pruned_loss=0.1041, over 21337.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.304, pruned_loss=0.08051, over 4279261.31 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:29:15,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=1634802.0, ans=0.1 2023-06-24 02:29:28,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1634802.0, ans=0.2 2023-06-24 02:29:51,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1634862.0, ans=0.09899494936611666 2023-06-24 02:30:29,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1634982.0, ans=0.0 2023-06-24 02:30:40,763 INFO [train.py:996] (3/4) Epoch 9, batch 28550, loss[loss=0.2234, simple_loss=0.2986, pruned_loss=0.0741, over 20753.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3117, pruned_loss=0.08236, over 4284755.03 frames. ], batch size: 607, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:30:42,304 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.957e+02 5.998e+02 7.738e+02 1.217e+03 2.112e+03, threshold=1.548e+03, percent-clipped=6.0 2023-06-24 02:30:59,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1635102.0, ans=0.0 2023-06-24 02:31:02,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1635102.0, ans=0.125 2023-06-24 02:31:11,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1635102.0, ans=0.125 2023-06-24 02:31:15,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1635102.0, ans=0.125 2023-06-24 02:31:27,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1635162.0, ans=0.125 2023-06-24 02:32:22,146 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.05 vs. limit=10.0 2023-06-24 02:32:24,243 INFO [train.py:996] (3/4) Epoch 9, batch 28600, loss[loss=0.2724, simple_loss=0.3378, pruned_loss=0.1035, over 21506.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3169, pruned_loss=0.08361, over 4272068.52 frames. ], batch size: 194, lr: 3.20e-03, grad_scale: 8.0 2023-06-24 02:33:25,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1635522.0, ans=0.125 2023-06-24 02:33:44,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1635582.0, ans=0.125 2023-06-24 02:34:02,746 INFO [train.py:996] (3/4) Epoch 9, batch 28650, loss[loss=0.2265, simple_loss=0.2891, pruned_loss=0.08198, over 21811.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3119, pruned_loss=0.08329, over 4278232.86 frames. ], batch size: 352, lr: 3.20e-03, grad_scale: 8.0 2023-06-24 02:34:11,194 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.746e+02 6.069e+02 8.380e+02 1.162e+03 2.307e+03, threshold=1.676e+03, percent-clipped=7.0 2023-06-24 02:34:13,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1635642.0, ans=0.2 2023-06-24 02:34:27,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1635702.0, ans=0.0 2023-06-24 02:35:30,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1635882.0, ans=0.0 2023-06-24 02:35:46,696 INFO [train.py:996] (3/4) Epoch 9, batch 28700, loss[loss=0.2499, simple_loss=0.3147, pruned_loss=0.09254, over 21775.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3113, pruned_loss=0.08496, over 4276958.97 frames. ], batch size: 441, lr: 3.20e-03, grad_scale: 8.0 2023-06-24 02:35:59,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1635942.0, ans=0.0 2023-06-24 02:36:31,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1636062.0, ans=0.125 2023-06-24 02:37:25,556 INFO [train.py:996] (3/4) Epoch 9, batch 28750, loss[loss=0.2398, simple_loss=0.3086, pruned_loss=0.08553, over 21407.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3124, pruned_loss=0.08621, over 4283195.72 frames. ], batch size: 143, lr: 3.20e-03, grad_scale: 8.0 2023-06-24 02:37:27,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1636242.0, ans=0.125 2023-06-24 02:37:28,865 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.971e+02 6.417e+02 8.454e+02 1.129e+03 2.571e+03, threshold=1.691e+03, percent-clipped=6.0 2023-06-24 02:37:30,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1636242.0, ans=0.125 2023-06-24 02:37:34,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1636242.0, ans=0.125 2023-06-24 02:37:35,130 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.08 vs. limit=8.0 2023-06-24 02:37:45,608 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-24 02:38:14,833 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.73 vs. limit=15.0 2023-06-24 02:38:58,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1636482.0, ans=0.125 2023-06-24 02:39:04,496 INFO [train.py:996] (3/4) Epoch 9, batch 28800, loss[loss=0.2787, simple_loss=0.3451, pruned_loss=0.1061, over 21641.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3161, pruned_loss=0.08677, over 4289092.46 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:39:23,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1636602.0, ans=0.0 2023-06-24 02:39:25,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1636602.0, ans=0.125 2023-06-24 02:39:26,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1636602.0, ans=0.1 2023-06-24 02:39:31,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1636602.0, ans=0.1 2023-06-24 02:39:48,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1636662.0, ans=0.1 2023-06-24 02:40:30,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1636782.0, ans=0.1 2023-06-24 02:40:43,043 INFO [train.py:996] (3/4) Epoch 9, batch 28850, loss[loss=0.2537, simple_loss=0.3261, pruned_loss=0.09061, over 21867.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3159, pruned_loss=0.08753, over 4290363.24 frames. ], batch size: 118, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:40:46,361 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.643e+02 7.030e+02 9.281e+02 1.224e+03 2.045e+03, threshold=1.856e+03, percent-clipped=4.0 2023-06-24 02:41:44,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1637022.0, ans=0.0 2023-06-24 02:42:23,113 INFO [train.py:996] (3/4) Epoch 9, batch 28900, loss[loss=0.2993, simple_loss=0.3653, pruned_loss=0.1167, over 21515.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3191, pruned_loss=0.08916, over 4294988.84 frames. ], batch size: 471, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:42:25,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1637142.0, ans=0.125 2023-06-24 02:42:46,738 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.35 vs. limit=10.0 2023-06-24 02:42:52,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1637202.0, ans=0.1 2023-06-24 02:43:24,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1637322.0, ans=0.1 2023-06-24 02:43:36,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1637322.0, ans=0.0 2023-06-24 02:43:36,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1637322.0, ans=0.125 2023-06-24 02:44:08,014 INFO [train.py:996] (3/4) Epoch 9, batch 28950, loss[loss=0.2732, simple_loss=0.3587, pruned_loss=0.09384, over 21705.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3198, pruned_loss=0.08827, over 4283683.80 frames. ], batch size: 441, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:44:11,409 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.380e+02 7.524e+02 1.128e+03 1.793e+03 3.083e+03, threshold=2.257e+03, percent-clipped=23.0 2023-06-24 02:45:17,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1637622.0, ans=0.95 2023-06-24 02:45:17,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1637622.0, ans=0.125 2023-06-24 02:45:24,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1637622.0, ans=0.125 2023-06-24 02:45:31,121 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-24 02:45:47,869 INFO [train.py:996] (3/4) Epoch 9, batch 29000, loss[loss=0.2807, simple_loss=0.3433, pruned_loss=0.1091, over 21640.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3229, pruned_loss=0.0881, over 4275806.05 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:46:08,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1637742.0, ans=0.125 2023-06-24 02:46:30,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1637802.0, ans=0.0 2023-06-24 02:46:42,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1637862.0, ans=0.125 2023-06-24 02:47:13,945 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-24 02:47:20,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1637982.0, ans=0.1 2023-06-24 02:47:32,715 INFO [train.py:996] (3/4) Epoch 9, batch 29050, loss[loss=0.2703, simple_loss=0.3226, pruned_loss=0.109, over 21786.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3235, pruned_loss=0.08915, over 4280665.46 frames. ], batch size: 508, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:47:40,516 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.219e+02 6.530e+02 1.098e+03 1.738e+03 3.592e+03, threshold=2.195e+03, percent-clipped=7.0 2023-06-24 02:48:14,560 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=22.5 2023-06-24 02:48:48,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2023-06-24 02:48:59,503 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-24 02:49:09,279 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=12.0 2023-06-24 02:49:11,758 INFO [train.py:996] (3/4) Epoch 9, batch 29100, loss[loss=0.1994, simple_loss=0.267, pruned_loss=0.06592, over 20133.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3154, pruned_loss=0.08668, over 4271647.00 frames. ], batch size: 702, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:49:12,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1638342.0, ans=0.125 2023-06-24 02:49:49,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1638402.0, ans=0.125 2023-06-24 02:50:35,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1638582.0, ans=0.95 2023-06-24 02:50:54,249 INFO [train.py:996] (3/4) Epoch 9, batch 29150, loss[loss=0.2744, simple_loss=0.369, pruned_loss=0.08991, over 21228.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3135, pruned_loss=0.0851, over 4272461.46 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:50:57,277 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.333e+02 5.735e+02 8.237e+02 1.411e+03 3.649e+03, threshold=1.647e+03, percent-clipped=7.0 2023-06-24 02:51:02,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1638642.0, ans=0.2 2023-06-24 02:51:36,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1638762.0, ans=0.125 2023-06-24 02:51:59,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1638822.0, ans=0.125 2023-06-24 02:52:28,610 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=12.0 2023-06-24 02:52:29,553 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:52:32,382 INFO [train.py:996] (3/4) Epoch 9, batch 29200, loss[loss=0.2057, simple_loss=0.2623, pruned_loss=0.07449, over 21162.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.309, pruned_loss=0.08405, over 4274042.47 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:53:14,541 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-24 02:54:06,528 INFO [train.py:996] (3/4) Epoch 9, batch 29250, loss[loss=0.216, simple_loss=0.2975, pruned_loss=0.06723, over 21232.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.308, pruned_loss=0.0818, over 4273134.97 frames. ], batch size: 159, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:54:09,741 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.813e+02 6.323e+02 1.080e+03 1.364e+03 2.361e+03, threshold=2.161e+03, percent-clipped=10.0 2023-06-24 02:54:10,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1639242.0, ans=0.125 2023-06-24 02:54:34,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1639302.0, ans=0.125 2023-06-24 02:54:36,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1639302.0, ans=0.0 2023-06-24 02:54:43,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1639362.0, ans=0.0 2023-06-24 02:55:42,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1639482.0, ans=0.1 2023-06-24 02:55:45,680 INFO [train.py:996] (3/4) Epoch 9, batch 29300, loss[loss=0.2089, simple_loss=0.2667, pruned_loss=0.07551, over 21818.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3079, pruned_loss=0.08028, over 4273617.93 frames. ], batch size: 98, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:56:07,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1639602.0, ans=0.0 2023-06-24 02:56:14,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1639602.0, ans=0.2 2023-06-24 02:56:21,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1639662.0, ans=0.0 2023-06-24 02:57:07,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1639782.0, ans=0.2 2023-06-24 02:57:10,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1639782.0, ans=0.1 2023-06-24 02:57:21,357 INFO [train.py:996] (3/4) Epoch 9, batch 29350, loss[loss=0.2538, simple_loss=0.308, pruned_loss=0.09987, over 21772.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3038, pruned_loss=0.0797, over 4275639.07 frames. ], batch size: 102, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:57:29,480 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.886e+02 5.877e+02 8.410e+02 1.271e+03 3.253e+03, threshold=1.682e+03, percent-clipped=5.0 2023-06-24 02:57:31,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1639842.0, ans=0.125 2023-06-24 02:58:22,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1640022.0, ans=0.125 2023-06-24 02:59:02,510 INFO [train.py:996] (3/4) Epoch 9, batch 29400, loss[loss=0.1693, simple_loss=0.2308, pruned_loss=0.05393, over 21266.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3008, pruned_loss=0.07668, over 4274725.34 frames. ], batch size: 176, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:59:51,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1640262.0, ans=0.2 2023-06-24 03:00:06,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1640322.0, ans=0.1 2023-06-24 03:00:38,187 INFO [train.py:996] (3/4) Epoch 9, batch 29450, loss[loss=0.2779, simple_loss=0.3394, pruned_loss=0.1082, over 21352.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3033, pruned_loss=0.07702, over 4272959.81 frames. ], batch size: 176, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:00:43,276 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.242e+02 8.018e+02 1.543e+03 2.395e+03 4.126e+03, threshold=3.085e+03, percent-clipped=41.0 2023-06-24 03:01:04,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1640502.0, ans=0.2 2023-06-24 03:01:33,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1640562.0, ans=0.0 2023-06-24 03:01:48,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1640622.0, ans=0.0 2023-06-24 03:02:13,441 INFO [train.py:996] (3/4) Epoch 9, batch 29500, loss[loss=0.2226, simple_loss=0.2912, pruned_loss=0.07695, over 21925.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3077, pruned_loss=0.08018, over 4277820.12 frames. ], batch size: 333, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:02:55,257 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-24 03:03:24,177 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-24 03:03:52,436 INFO [train.py:996] (3/4) Epoch 9, batch 29550, loss[loss=0.2047, simple_loss=0.2722, pruned_loss=0.06862, over 21168.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3073, pruned_loss=0.08177, over 4288029.08 frames. ], batch size: 608, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:03:57,062 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.109e+02 5.276e+02 6.489e+02 8.099e+02 1.842e+03, threshold=1.298e+03, percent-clipped=0.0 2023-06-24 03:05:28,103 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-24 03:05:33,524 INFO [train.py:996] (3/4) Epoch 9, batch 29600, loss[loss=0.275, simple_loss=0.3664, pruned_loss=0.09182, over 21728.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3157, pruned_loss=0.08477, over 4290563.51 frames. ], batch size: 351, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 03:05:40,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1641342.0, ans=0.125 2023-06-24 03:06:24,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1641462.0, ans=0.125 2023-06-24 03:06:38,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-24 03:06:40,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1641522.0, ans=0.2 2023-06-24 03:06:45,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1641522.0, ans=0.1 2023-06-24 03:07:16,578 INFO [train.py:996] (3/4) Epoch 9, batch 29650, loss[loss=0.2436, simple_loss=0.3615, pruned_loss=0.06287, over 19799.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3117, pruned_loss=0.08033, over 4284710.17 frames. ], batch size: 702, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 03:07:21,398 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.535e+02 6.438e+02 9.787e+02 1.351e+03 2.800e+03, threshold=1.957e+03, percent-clipped=29.0 2023-06-24 03:07:21,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1641642.0, ans=0.0 2023-06-24 03:07:29,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1641642.0, ans=0.1 2023-06-24 03:08:31,965 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-24 03:08:56,640 INFO [train.py:996] (3/4) Epoch 9, batch 29700, loss[loss=0.2795, simple_loss=0.3636, pruned_loss=0.09771, over 21433.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3137, pruned_loss=0.08082, over 4287367.69 frames. ], batch size: 211, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:09:23,862 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=15.0 2023-06-24 03:10:19,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1642182.0, ans=0.1 2023-06-24 03:10:19,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1642182.0, ans=0.04949747468305833 2023-06-24 03:10:34,572 INFO [train.py:996] (3/4) Epoch 9, batch 29750, loss[loss=0.2534, simple_loss=0.3401, pruned_loss=0.0833, over 21442.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3182, pruned_loss=0.08014, over 4282112.79 frames. ], batch size: 194, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:10:40,760 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.742e+02 5.529e+02 6.914e+02 9.553e+02 2.350e+03, threshold=1.383e+03, percent-clipped=6.0 2023-06-24 03:11:25,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1642362.0, ans=0.125 2023-06-24 03:12:13,535 INFO [train.py:996] (3/4) Epoch 9, batch 29800, loss[loss=0.2581, simple_loss=0.3239, pruned_loss=0.09617, over 21758.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3203, pruned_loss=0.08164, over 4292284.70 frames. ], batch size: 389, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:12:39,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1642602.0, ans=0.125 2023-06-24 03:12:42,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1642602.0, ans=10.0 2023-06-24 03:12:43,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1642602.0, ans=0.125 2023-06-24 03:13:28,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1642722.0, ans=0.125 2023-06-24 03:13:32,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1642782.0, ans=0.125 2023-06-24 03:13:51,063 INFO [train.py:996] (3/4) Epoch 9, batch 29850, loss[loss=0.2252, simple_loss=0.2956, pruned_loss=0.07742, over 21850.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3167, pruned_loss=0.07923, over 4285985.90 frames. ], batch size: 118, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:13:57,484 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.819e+02 7.548e+02 1.159e+03 1.635e+03 3.345e+03, threshold=2.317e+03, percent-clipped=36.0 2023-06-24 03:14:07,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1642902.0, ans=0.5 2023-06-24 03:14:10,113 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=15.0 2023-06-24 03:14:55,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1643022.0, ans=0.1 2023-06-24 03:15:07,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1643082.0, ans=0.0 2023-06-24 03:15:27,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1643142.0, ans=0.125 2023-06-24 03:15:29,135 INFO [train.py:996] (3/4) Epoch 9, batch 29900, loss[loss=0.2387, simple_loss=0.3061, pruned_loss=0.08564, over 21398.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3136, pruned_loss=0.07988, over 4287774.73 frames. ], batch size: 176, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:15:36,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1643142.0, ans=0.07 2023-06-24 03:15:48,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1643202.0, ans=0.125 2023-06-24 03:16:59,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1643382.0, ans=0.125 2023-06-24 03:17:08,335 INFO [train.py:996] (3/4) Epoch 9, batch 29950, loss[loss=0.2936, simple_loss=0.3575, pruned_loss=0.1148, over 21470.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3174, pruned_loss=0.0837, over 4288034.34 frames. ], batch size: 471, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:17:19,366 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.107e+02 5.748e+02 7.806e+02 1.232e+03 2.482e+03, threshold=1.561e+03, percent-clipped=2.0 2023-06-24 03:17:24,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1643442.0, ans=0.125 2023-06-24 03:17:54,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1643562.0, ans=0.125 2023-06-24 03:19:00,235 INFO [train.py:996] (3/4) Epoch 9, batch 30000, loss[loss=0.2629, simple_loss=0.3553, pruned_loss=0.0853, over 21466.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3202, pruned_loss=0.08422, over 4287036.29 frames. ], batch size: 471, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 03:19:00,236 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 03:19:13,457 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.6076, 3.2443, 1.8887, 1.5050], device='cuda:3') 2023-06-24 03:19:17,058 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2502, simple_loss=0.3471, pruned_loss=0.07663, over 1796401.00 frames. 2023-06-24 03:19:17,059 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-24 03:20:25,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=15.0 2023-06-24 03:21:04,956 INFO [train.py:996] (3/4) Epoch 9, batch 30050, loss[loss=0.1413, simple_loss=0.183, pruned_loss=0.04983, over 16143.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3212, pruned_loss=0.08098, over 4273853.42 frames. ], batch size: 60, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 03:21:11,486 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.143e+02 7.535e+02 1.024e+03 1.337e+03 2.624e+03, threshold=2.049e+03, percent-clipped=15.0 2023-06-24 03:21:44,648 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-24 03:22:12,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1644222.0, ans=0.125 2023-06-24 03:22:44,508 INFO [train.py:996] (3/4) Epoch 9, batch 30100, loss[loss=0.2023, simple_loss=0.2652, pruned_loss=0.06975, over 21513.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3197, pruned_loss=0.0807, over 4273179.95 frames. ], batch size: 195, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:23:04,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1644402.0, ans=0.1 2023-06-24 03:23:18,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1644402.0, ans=0.125 2023-06-24 03:23:29,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1644462.0, ans=0.125 2023-06-24 03:23:29,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1644462.0, ans=0.1 2023-06-24 03:23:36,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1644462.0, ans=0.2 2023-06-24 03:23:40,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1644522.0, ans=0.125 2023-06-24 03:23:47,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1644522.0, ans=0.1 2023-06-24 03:23:57,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1644522.0, ans=0.0 2023-06-24 03:24:19,240 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-06-24 03:24:21,142 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-06-24 03:24:29,637 INFO [train.py:996] (3/4) Epoch 9, batch 30150, loss[loss=0.2626, simple_loss=0.3321, pruned_loss=0.09655, over 21560.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3165, pruned_loss=0.08259, over 4275883.27 frames. ], batch size: 415, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:24:34,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1644642.0, ans=0.125 2023-06-24 03:24:36,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1644642.0, ans=0.125 2023-06-24 03:24:38,061 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.432e+02 6.683e+02 1.059e+03 1.463e+03 4.541e+03, threshold=2.119e+03, percent-clipped=12.0 2023-06-24 03:24:47,702 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-24 03:25:48,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1644822.0, ans=0.125 2023-06-24 03:26:12,257 INFO [train.py:996] (3/4) Epoch 9, batch 30200, loss[loss=0.276, simple_loss=0.3758, pruned_loss=0.08809, over 21636.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3186, pruned_loss=0.08144, over 4280874.58 frames. ], batch size: 414, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:26:58,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1645062.0, ans=0.125 2023-06-24 03:26:59,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1645062.0, ans=0.0 2023-06-24 03:27:06,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1645062.0, ans=0.1 2023-06-24 03:27:09,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1645062.0, ans=0.2 2023-06-24 03:27:13,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1645062.0, ans=0.125 2023-06-24 03:27:46,368 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.24 vs. limit=10.0 2023-06-24 03:27:50,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1645182.0, ans=0.0 2023-06-24 03:27:57,818 INFO [train.py:996] (3/4) Epoch 9, batch 30250, loss[loss=0.3006, simple_loss=0.4023, pruned_loss=0.09949, over 21677.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3275, pruned_loss=0.08444, over 4275392.98 frames. ], batch size: 441, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:28:04,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1645242.0, ans=0.125 2023-06-24 03:28:05,415 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.017e+02 5.727e+02 7.436e+02 1.048e+03 2.592e+03, threshold=1.487e+03, percent-clipped=2.0 2023-06-24 03:28:12,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1645242.0, ans=0.125 2023-06-24 03:28:14,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=15.0 2023-06-24 03:28:17,873 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2023-06-24 03:28:45,077 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=12.0 2023-06-24 03:29:36,890 INFO [train.py:996] (3/4) Epoch 9, batch 30300, loss[loss=0.1994, simple_loss=0.2661, pruned_loss=0.06635, over 20709.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3237, pruned_loss=0.0835, over 4272543.82 frames. ], batch size: 607, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:30:01,029 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.03 vs. limit=12.0 2023-06-24 03:30:33,529 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.45 vs. limit=6.0 2023-06-24 03:30:52,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1645722.0, ans=0.125 2023-06-24 03:31:28,042 INFO [train.py:996] (3/4) Epoch 9, batch 30350, loss[loss=0.2582, simple_loss=0.3367, pruned_loss=0.08985, over 21692.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3254, pruned_loss=0.08496, over 4272955.46 frames. ], batch size: 298, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:31:36,356 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.062e+02 6.729e+02 9.654e+02 1.457e+03 3.930e+03, threshold=1.931e+03, percent-clipped=23.0 2023-06-24 03:31:48,833 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-24 03:32:06,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1645962.0, ans=0.025 2023-06-24 03:32:47,133 INFO [train.py:996] (3/4) Epoch 9, batch 30400, loss[loss=0.217, simple_loss=0.2655, pruned_loss=0.08426, over 20343.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3195, pruned_loss=0.08311, over 4265601.25 frames. ], batch size: 703, lr: 3.19e-03, grad_scale: 32.0 2023-06-24 03:32:48,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1646142.0, ans=0.0 2023-06-24 03:33:31,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1646262.0, ans=0.125 2023-06-24 03:33:57,688 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-24 03:34:00,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1646382.0, ans=0.2 2023-06-24 03:34:05,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1646382.0, ans=0.1 2023-06-24 03:34:12,608 INFO [train.py:996] (3/4) Epoch 9, batch 30450, loss[loss=0.2832, simple_loss=0.4094, pruned_loss=0.07851, over 19931.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3208, pruned_loss=0.08247, over 4205224.53 frames. ], batch size: 702, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:34:21,422 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 7.756e+02 1.127e+03 2.078e+03 9.482e+03, threshold=2.254e+03, percent-clipped=27.0 2023-06-24 03:34:51,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1646562.0, ans=0.2 2023-06-24 03:35:13,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1646682.0, ans=0.125 2023-06-24 03:36:54,763 INFO [train.py:996] (3/4) Epoch 10, batch 0, loss[loss=0.214, simple_loss=0.2852, pruned_loss=0.07135, over 21866.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2852, pruned_loss=0.07135, over 21866.00 frames. ], batch size: 373, lr: 3.02e-03, grad_scale: 32.0 2023-06-24 03:36:54,764 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 03:37:01,980 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.6833, 3.5535, 1.9645, 1.5057], device='cuda:3') 2023-06-24 03:37:07,388 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.9933, 3.7601, 2.2257, 1.7466], device='cuda:3') 2023-06-24 03:37:10,565 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2396, simple_loss=0.3488, pruned_loss=0.06521, over 1796401.00 frames. 2023-06-24 03:37:10,565 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-24 03:37:23,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1646712.0, ans=0.125 2023-06-24 03:38:17,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1646892.0, ans=0.04949747468305833 2023-06-24 03:38:22,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1646892.0, ans=0.0 2023-06-24 03:38:49,155 INFO [train.py:996] (3/4) Epoch 10, batch 50, loss[loss=0.2269, simple_loss=0.32, pruned_loss=0.0669, over 21589.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3279, pruned_loss=0.0857, over 962448.04 frames. ], batch size: 230, lr: 3.02e-03, grad_scale: 16.0 2023-06-24 03:39:18,462 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.516e+02 8.479e+02 1.547e+03 2.623e+03 5.891e+03, threshold=3.095e+03, percent-clipped=28.0 2023-06-24 03:39:49,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1647132.0, ans=0.0 2023-06-24 03:39:53,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1647192.0, ans=0.125 2023-06-24 03:40:17,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1647252.0, ans=0.125 2023-06-24 03:40:17,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1647252.0, ans=0.0 2023-06-24 03:40:29,197 INFO [train.py:996] (3/4) Epoch 10, batch 100, loss[loss=0.2228, simple_loss=0.3301, pruned_loss=0.05775, over 20717.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3409, pruned_loss=0.08603, over 1699734.69 frames. ], batch size: 607, lr: 3.02e-03, grad_scale: 16.0 2023-06-24 03:41:07,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1647372.0, ans=0.1 2023-06-24 03:41:30,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1647492.0, ans=0.0 2023-06-24 03:41:38,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1647492.0, ans=0.125 2023-06-24 03:42:02,777 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:42:05,166 INFO [train.py:996] (3/4) Epoch 10, batch 150, loss[loss=0.2533, simple_loss=0.3354, pruned_loss=0.08567, over 21631.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3432, pruned_loss=0.08643, over 2271173.05 frames. ], batch size: 230, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:42:12,420 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-24 03:42:39,493 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.031e+02 6.057e+02 8.805e+02 1.461e+03 2.839e+03, threshold=1.761e+03, percent-clipped=0.0 2023-06-24 03:42:39,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1647672.0, ans=0.0 2023-06-24 03:42:50,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1647732.0, ans=0.125 2023-06-24 03:43:07,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1647792.0, ans=0.1 2023-06-24 03:43:17,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1647792.0, ans=0.125 2023-06-24 03:43:46,538 INFO [train.py:996] (3/4) Epoch 10, batch 200, loss[loss=0.2368, simple_loss=0.3046, pruned_loss=0.08453, over 21921.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3386, pruned_loss=0.08454, over 2720755.18 frames. ], batch size: 316, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:44:22,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1647972.0, ans=0.0 2023-06-24 03:44:24,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1647972.0, ans=0.2 2023-06-24 03:44:27,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1648032.0, ans=0.0 2023-06-24 03:44:28,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1648032.0, ans=0.0 2023-06-24 03:44:54,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1648092.0, ans=0.1 2023-06-24 03:45:18,471 INFO [train.py:996] (3/4) Epoch 10, batch 250, loss[loss=0.2162, simple_loss=0.279, pruned_loss=0.07673, over 22039.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3322, pruned_loss=0.08382, over 3070132.65 frames. ], batch size: 103, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:45:20,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1648212.0, ans=0.125 2023-06-24 03:45:32,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1648212.0, ans=0.125 2023-06-24 03:45:48,952 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.127e+02 6.491e+02 8.586e+02 1.362e+03 2.608e+03, threshold=1.717e+03, percent-clipped=13.0 2023-06-24 03:46:02,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1648332.0, ans=0.0 2023-06-24 03:46:21,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1648392.0, ans=0.125 2023-06-24 03:46:41,800 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=15.0 2023-06-24 03:46:58,106 INFO [train.py:996] (3/4) Epoch 10, batch 300, loss[loss=0.2212, simple_loss=0.2935, pruned_loss=0.0744, over 20000.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3264, pruned_loss=0.08385, over 3333792.15 frames. ], batch size: 702, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:47:00,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1648512.0, ans=0.07 2023-06-24 03:47:21,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1648572.0, ans=0.1 2023-06-24 03:47:25,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1648572.0, ans=0.125 2023-06-24 03:47:47,316 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.13 vs. limit=15.0 2023-06-24 03:48:01,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1648692.0, ans=0.125 2023-06-24 03:48:25,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1648752.0, ans=0.1 2023-06-24 03:48:27,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1648752.0, ans=0.125 2023-06-24 03:48:34,864 INFO [train.py:996] (3/4) Epoch 10, batch 350, loss[loss=0.2351, simple_loss=0.3259, pruned_loss=0.07216, over 21729.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3169, pruned_loss=0.0821, over 3539623.86 frames. ], batch size: 118, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:48:58,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1648812.0, ans=0.0 2023-06-24 03:49:06,276 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.104e+02 6.936e+02 9.570e+02 1.355e+03 2.301e+03, threshold=1.914e+03, percent-clipped=7.0 2023-06-24 03:49:23,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1648932.0, ans=0.125 2023-06-24 03:49:37,124 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-24 03:50:14,757 INFO [train.py:996] (3/4) Epoch 10, batch 400, loss[loss=0.2249, simple_loss=0.3226, pruned_loss=0.06364, over 21803.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3103, pruned_loss=0.07982, over 3685826.03 frames. ], batch size: 371, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 03:50:51,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1649172.0, ans=0.1 2023-06-24 03:51:25,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1649292.0, ans=0.95 2023-06-24 03:51:41,026 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:51:57,933 INFO [train.py:996] (3/4) Epoch 10, batch 450, loss[loss=0.2828, simple_loss=0.393, pruned_loss=0.08635, over 21721.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3085, pruned_loss=0.07863, over 3816566.44 frames. ], batch size: 414, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 03:52:18,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1649472.0, ans=0.2 2023-06-24 03:52:22,593 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.065e+02 6.467e+02 1.066e+03 1.544e+03 3.388e+03, threshold=2.132e+03, percent-clipped=13.0 2023-06-24 03:53:29,645 INFO [train.py:996] (3/4) Epoch 10, batch 500, loss[loss=0.2039, simple_loss=0.2789, pruned_loss=0.06446, over 21657.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3081, pruned_loss=0.07695, over 3922648.29 frames. ], batch size: 298, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 03:53:50,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1649712.0, ans=0.2 2023-06-24 03:54:30,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1649892.0, ans=0.1 2023-06-24 03:55:07,315 INFO [train.py:996] (3/4) Epoch 10, batch 550, loss[loss=0.1923, simple_loss=0.2469, pruned_loss=0.06885, over 19958.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3117, pruned_loss=0.07743, over 4001202.77 frames. ], batch size: 704, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 03:55:15,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1650012.0, ans=0.0 2023-06-24 03:55:19,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1650012.0, ans=0.0 2023-06-24 03:55:32,658 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.909e+02 8.917e+02 1.249e+03 2.003e+03 3.580e+03, threshold=2.497e+03, percent-clipped=21.0 2023-06-24 03:56:40,149 INFO [train.py:996] (3/4) Epoch 10, batch 600, loss[loss=0.2536, simple_loss=0.3621, pruned_loss=0.07251, over 21705.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3146, pruned_loss=0.07755, over 4070131.41 frames. ], batch size: 247, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:58:02,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1650552.0, ans=0.05 2023-06-24 03:58:13,507 INFO [train.py:996] (3/4) Epoch 10, batch 650, loss[loss=0.2381, simple_loss=0.3129, pruned_loss=0.08167, over 21898.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.318, pruned_loss=0.07823, over 4115644.18 frames. ], batch size: 124, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:58:39,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1650672.0, ans=0.0 2023-06-24 03:58:44,450 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.130e+02 7.027e+02 1.084e+03 1.748e+03 3.374e+03, threshold=2.167e+03, percent-clipped=5.0 2023-06-24 03:58:49,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1650672.0, ans=0.125 2023-06-24 03:59:33,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=15.0 2023-06-24 03:59:37,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1650852.0, ans=0.1 2023-06-24 03:59:41,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1650852.0, ans=0.125 2023-06-24 03:59:41,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1650852.0, ans=0.1 2023-06-24 03:59:45,375 INFO [train.py:996] (3/4) Epoch 10, batch 700, loss[loss=0.2445, simple_loss=0.3024, pruned_loss=0.09336, over 21350.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3195, pruned_loss=0.07899, over 4153560.50 frames. ], batch size: 159, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 04:00:18,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1650972.0, ans=0.125 2023-06-24 04:01:04,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1651152.0, ans=0.0 2023-06-24 04:01:07,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1651152.0, ans=0.125 2023-06-24 04:01:09,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1651152.0, ans=0.1 2023-06-24 04:01:14,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1651152.0, ans=0.0 2023-06-24 04:01:27,485 INFO [train.py:996] (3/4) Epoch 10, batch 750, loss[loss=0.302, simple_loss=0.4291, pruned_loss=0.0875, over 19709.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3176, pruned_loss=0.07901, over 4183859.81 frames. ], batch size: 702, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 04:01:49,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1651272.0, ans=0.1 2023-06-24 04:01:53,755 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.183e+02 6.075e+02 9.989e+02 1.388e+03 3.247e+03, threshold=1.998e+03, percent-clipped=7.0 2023-06-24 04:01:59,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1651272.0, ans=0.125 2023-06-24 04:02:03,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1651332.0, ans=0.1 2023-06-24 04:02:12,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1651332.0, ans=0.0 2023-06-24 04:02:41,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1651452.0, ans=0.125 2023-06-24 04:02:43,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1651452.0, ans=0.0 2023-06-24 04:02:46,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1651452.0, ans=0.125 2023-06-24 04:02:50,414 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=22.5 2023-06-24 04:03:01,135 INFO [train.py:996] (3/4) Epoch 10, batch 800, loss[loss=0.2131, simple_loss=0.2735, pruned_loss=0.07634, over 21624.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3126, pruned_loss=0.07931, over 4208031.66 frames. ], batch size: 247, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:03:39,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1651572.0, ans=0.125 2023-06-24 04:03:40,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1651632.0, ans=0.125 2023-06-24 04:03:41,337 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=22.5 2023-06-24 04:03:43,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1651632.0, ans=0.0 2023-06-24 04:03:56,258 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-24 04:04:02,350 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.68 vs. limit=6.0 2023-06-24 04:04:36,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1651752.0, ans=0.05 2023-06-24 04:04:39,120 INFO [train.py:996] (3/4) Epoch 10, batch 850, loss[loss=0.2428, simple_loss=0.3038, pruned_loss=0.09087, over 21757.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3094, pruned_loss=0.07965, over 4227895.61 frames. ], batch size: 247, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:05:10,503 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 6.435e+02 1.007e+03 1.415e+03 2.798e+03, threshold=2.014e+03, percent-clipped=8.0 2023-06-24 04:05:15,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1651872.0, ans=0.125 2023-06-24 04:05:34,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1651932.0, ans=0.0 2023-06-24 04:06:15,695 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=22.5 2023-06-24 04:06:16,296 INFO [train.py:996] (3/4) Epoch 10, batch 900, loss[loss=0.2042, simple_loss=0.2801, pruned_loss=0.06417, over 21057.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3082, pruned_loss=0.07941, over 4233691.56 frames. ], batch size: 143, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:06:37,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1652112.0, ans=0.1 2023-06-24 04:06:56,769 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=22.5 2023-06-24 04:07:15,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1652232.0, ans=0.0 2023-06-24 04:08:04,373 INFO [train.py:996] (3/4) Epoch 10, batch 950, loss[loss=0.235, simple_loss=0.2981, pruned_loss=0.08592, over 20342.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3072, pruned_loss=0.08068, over 4249659.62 frames. ], batch size: 703, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:08:19,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1652472.0, ans=0.0 2023-06-24 04:08:27,071 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.827e+02 5.619e+02 7.658e+02 1.220e+03 3.060e+03, threshold=1.532e+03, percent-clipped=1.0 2023-06-24 04:08:29,657 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.31 vs. limit=12.0 2023-06-24 04:09:38,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1652712.0, ans=0.2 2023-06-24 04:09:39,790 INFO [train.py:996] (3/4) Epoch 10, batch 1000, loss[loss=0.2557, simple_loss=0.3249, pruned_loss=0.09322, over 21777.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.306, pruned_loss=0.08095, over 4257791.97 frames. ], batch size: 441, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:09:42,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1652712.0, ans=0.0 2023-06-24 04:10:37,954 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-24 04:11:19,948 INFO [train.py:996] (3/4) Epoch 10, batch 1050, loss[loss=0.2187, simple_loss=0.3002, pruned_loss=0.06863, over 21395.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3061, pruned_loss=0.08087, over 4267653.45 frames. ], batch size: 194, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:11:46,771 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.344e+02 8.659e+02 1.096e+03 1.679e+03 3.356e+03, threshold=2.191e+03, percent-clipped=32.0 2023-06-24 04:11:57,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1653132.0, ans=15.0 2023-06-24 04:12:53,884 INFO [train.py:996] (3/4) Epoch 10, batch 1100, loss[loss=0.2402, simple_loss=0.3193, pruned_loss=0.08053, over 21693.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3073, pruned_loss=0.07994, over 4276398.66 frames. ], batch size: 298, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:13:10,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1653312.0, ans=0.0 2023-06-24 04:13:11,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1653312.0, ans=0.1 2023-06-24 04:13:23,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1653372.0, ans=0.0 2023-06-24 04:13:26,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1653372.0, ans=0.07 2023-06-24 04:13:48,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1653432.0, ans=0.015 2023-06-24 04:13:57,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1653492.0, ans=22.5 2023-06-24 04:14:04,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1653492.0, ans=0.125 2023-06-24 04:14:32,906 INFO [train.py:996] (3/4) Epoch 10, batch 1150, loss[loss=0.2176, simple_loss=0.2751, pruned_loss=0.08006, over 20709.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3059, pruned_loss=0.07804, over 4282387.26 frames. ], batch size: 608, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:14:53,898 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:15:00,221 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.939e+02 5.998e+02 8.488e+02 1.315e+03 2.677e+03, threshold=1.698e+03, percent-clipped=3.0 2023-06-24 04:15:20,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1653732.0, ans=0.1 2023-06-24 04:15:36,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1653792.0, ans=0.2 2023-06-24 04:16:08,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1653852.0, ans=0.125 2023-06-24 04:16:15,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1653852.0, ans=0.125 2023-06-24 04:16:17,785 INFO [train.py:996] (3/4) Epoch 10, batch 1200, loss[loss=0.2481, simple_loss=0.3188, pruned_loss=0.08874, over 21442.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3096, pruned_loss=0.0796, over 4289394.01 frames. ], batch size: 211, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:16:28,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1653912.0, ans=0.0 2023-06-24 04:17:06,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1654032.0, ans=0.1 2023-06-24 04:17:27,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1654092.0, ans=0.125 2023-06-24 04:17:28,008 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-24 04:17:52,552 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=12.0 2023-06-24 04:17:56,438 INFO [train.py:996] (3/4) Epoch 10, batch 1250, loss[loss=0.2379, simple_loss=0.3105, pruned_loss=0.08262, over 21496.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3111, pruned_loss=0.07976, over 4293721.23 frames. ], batch size: 548, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:18:19,038 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.265e+02 6.637e+02 9.533e+02 1.248e+03 2.697e+03, threshold=1.907e+03, percent-clipped=13.0 2023-06-24 04:19:21,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1654452.0, ans=0.125 2023-06-24 04:19:34,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1654512.0, ans=0.2 2023-06-24 04:19:35,640 INFO [train.py:996] (3/4) Epoch 10, batch 1300, loss[loss=0.2115, simple_loss=0.2931, pruned_loss=0.0649, over 21473.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3123, pruned_loss=0.0793, over 4286417.31 frames. ], batch size: 131, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:19:49,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1654572.0, ans=0.0 2023-06-24 04:20:21,213 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.69 vs. limit=10.0 2023-06-24 04:21:02,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1654752.0, ans=0.0 2023-06-24 04:21:08,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1654752.0, ans=0.05 2023-06-24 04:21:14,462 INFO [train.py:996] (3/4) Epoch 10, batch 1350, loss[loss=0.259, simple_loss=0.3373, pruned_loss=0.0904, over 21591.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3132, pruned_loss=0.07997, over 4292398.72 frames. ], batch size: 414, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:21:33,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1654872.0, ans=0.125 2023-06-24 04:21:42,999 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 6.930e+02 9.209e+02 1.385e+03 4.036e+03, threshold=1.842e+03, percent-clipped=12.0 2023-06-24 04:22:20,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1654992.0, ans=0.125 2023-06-24 04:22:23,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1654992.0, ans=0.2 2023-06-24 04:22:49,043 INFO [train.py:996] (3/4) Epoch 10, batch 1400, loss[loss=0.2389, simple_loss=0.3035, pruned_loss=0.08716, over 21416.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3119, pruned_loss=0.08041, over 4278663.12 frames. ], batch size: 211, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:22:49,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1655112.0, ans=0.125 2023-06-24 04:23:21,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1655172.0, ans=0.0 2023-06-24 04:23:23,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1655172.0, ans=0.2 2023-06-24 04:24:13,945 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=22.5 2023-06-24 04:24:19,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1655352.0, ans=0.0 2023-06-24 04:24:28,472 INFO [train.py:996] (3/4) Epoch 10, batch 1450, loss[loss=0.2635, simple_loss=0.3314, pruned_loss=0.09779, over 21866.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3148, pruned_loss=0.08159, over 4281836.16 frames. ], batch size: 316, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:24:39,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1655412.0, ans=0.2 2023-06-24 04:24:56,574 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.216e+02 6.383e+02 1.021e+03 1.504e+03 2.934e+03, threshold=2.041e+03, percent-clipped=11.0 2023-06-24 04:26:01,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1655652.0, ans=0.125 2023-06-24 04:26:05,367 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=22.5 2023-06-24 04:26:07,440 INFO [train.py:996] (3/4) Epoch 10, batch 1500, loss[loss=0.2197, simple_loss=0.2939, pruned_loss=0.07276, over 21943.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3142, pruned_loss=0.08207, over 4289554.38 frames. ], batch size: 333, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:26:12,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.whiten.whitening_limit, batch_count=1655712.0, ans=12.0 2023-06-24 04:26:17,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1655712.0, ans=0.125 2023-06-24 04:27:43,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1655952.0, ans=0.125 2023-06-24 04:27:50,589 INFO [train.py:996] (3/4) Epoch 10, batch 1550, loss[loss=0.2345, simple_loss=0.3059, pruned_loss=0.08157, over 21624.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3127, pruned_loss=0.08194, over 4295957.07 frames. ], batch size: 298, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:28:24,606 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.204e+02 5.706e+02 8.802e+02 1.256e+03 2.211e+03, threshold=1.760e+03, percent-clipped=1.0 2023-06-24 04:28:26,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1656072.0, ans=0.125 2023-06-24 04:29:15,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1656252.0, ans=0.0 2023-06-24 04:29:23,969 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.21 vs. limit=22.5 2023-06-24 04:29:35,766 INFO [train.py:996] (3/4) Epoch 10, batch 1600, loss[loss=0.2595, simple_loss=0.3301, pruned_loss=0.09447, over 21543.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3128, pruned_loss=0.08226, over 4292871.71 frames. ], batch size: 548, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:29:39,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1656312.0, ans=0.125 2023-06-24 04:29:50,000 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-24 04:30:06,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1656372.0, ans=0.1 2023-06-24 04:30:34,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1656432.0, ans=0.0 2023-06-24 04:30:40,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1656492.0, ans=0.015 2023-06-24 04:31:15,779 INFO [train.py:996] (3/4) Epoch 10, batch 1650, loss[loss=0.2032, simple_loss=0.2644, pruned_loss=0.07104, over 21283.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3097, pruned_loss=0.08149, over 4293659.79 frames. ], batch size: 548, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:31:44,373 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 6.141e+02 9.207e+02 1.280e+03 2.509e+03, threshold=1.841e+03, percent-clipped=8.0 2023-06-24 04:31:54,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1656672.0, ans=0.125 2023-06-24 04:32:26,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1656792.0, ans=0.125 2023-06-24 04:32:57,149 INFO [train.py:996] (3/4) Epoch 10, batch 1700, loss[loss=0.2673, simple_loss=0.3114, pruned_loss=0.1116, over 21256.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3117, pruned_loss=0.08208, over 4297915.65 frames. ], batch size: 471, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:32:57,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1656912.0, ans=0.1 2023-06-24 04:33:40,546 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:34:17,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1657092.0, ans=0.125 2023-06-24 04:34:20,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1657092.0, ans=0.0 2023-06-24 04:34:29,957 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.76 vs. limit=8.0 2023-06-24 04:34:45,113 INFO [train.py:996] (3/4) Epoch 10, batch 1750, loss[loss=0.273, simple_loss=0.341, pruned_loss=0.1025, over 21414.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3143, pruned_loss=0.08119, over 4290400.09 frames. ], batch size: 211, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:34:46,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1657212.0, ans=15.0 2023-06-24 04:34:47,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1657212.0, ans=0.125 2023-06-24 04:35:21,551 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.068e+02 6.335e+02 9.144e+02 1.525e+03 4.256e+03, threshold=1.829e+03, percent-clipped=17.0 2023-06-24 04:35:30,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1657332.0, ans=0.5 2023-06-24 04:35:44,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1657392.0, ans=0.125 2023-06-24 04:35:46,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1657392.0, ans=0.1 2023-06-24 04:36:05,348 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-24 04:36:15,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1657452.0, ans=15.0 2023-06-24 04:36:32,784 INFO [train.py:996] (3/4) Epoch 10, batch 1800, loss[loss=0.1823, simple_loss=0.2607, pruned_loss=0.052, over 21375.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3111, pruned_loss=0.07851, over 4280383.70 frames. ], batch size: 211, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:36:34,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1657512.0, ans=0.125 2023-06-24 04:36:41,897 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-24 04:37:14,538 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.69 vs. limit=6.0 2023-06-24 04:37:50,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1657752.0, ans=0.2 2023-06-24 04:38:13,201 INFO [train.py:996] (3/4) Epoch 10, batch 1850, loss[loss=0.2067, simple_loss=0.3042, pruned_loss=0.05459, over 21247.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.313, pruned_loss=0.07766, over 4280802.94 frames. ], batch size: 549, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:38:32,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1657872.0, ans=0.04949747468305833 2023-06-24 04:38:40,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1657872.0, ans=0.0 2023-06-24 04:38:43,369 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.154e+02 6.426e+02 1.042e+03 1.664e+03 4.444e+03, threshold=2.085e+03, percent-clipped=25.0 2023-06-24 04:39:41,464 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:39:52,057 INFO [train.py:996] (3/4) Epoch 10, batch 1900, loss[loss=0.2622, simple_loss=0.3625, pruned_loss=0.08093, over 21528.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3134, pruned_loss=0.07781, over 4287837.90 frames. ], batch size: 471, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:40:23,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1658172.0, ans=0.125 2023-06-24 04:40:25,269 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.76 vs. limit=10.0 2023-06-24 04:40:46,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1658232.0, ans=0.125 2023-06-24 04:41:14,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1658352.0, ans=0.1 2023-06-24 04:41:31,873 INFO [train.py:996] (3/4) Epoch 10, batch 1950, loss[loss=0.2266, simple_loss=0.2869, pruned_loss=0.08319, over 21860.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3087, pruned_loss=0.07757, over 4288944.01 frames. ], batch size: 107, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 04:41:46,385 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.35 vs. limit=15.0 2023-06-24 04:41:57,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1658472.0, ans=0.1 2023-06-24 04:41:57,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1658472.0, ans=0.125 2023-06-24 04:42:02,918 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.830e+02 7.074e+02 9.115e+02 1.415e+03 2.823e+03, threshold=1.823e+03, percent-clipped=5.0 2023-06-24 04:42:20,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1658532.0, ans=0.0 2023-06-24 04:42:32,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1658592.0, ans=0.125 2023-06-24 04:42:46,724 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.27 vs. limit=15.0 2023-06-24 04:43:12,630 INFO [train.py:996] (3/4) Epoch 10, batch 2000, loss[loss=0.2064, simple_loss=0.3018, pruned_loss=0.05547, over 21618.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3035, pruned_loss=0.07543, over 4286065.08 frames. ], batch size: 263, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:43:21,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1658712.0, ans=0.0 2023-06-24 04:43:51,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1658772.0, ans=0.0 2023-06-24 04:43:53,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1658832.0, ans=0.125 2023-06-24 04:43:53,774 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=15.0 2023-06-24 04:44:51,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1659012.0, ans=0.015 2023-06-24 04:44:57,314 INFO [train.py:996] (3/4) Epoch 10, batch 2050, loss[loss=0.2678, simple_loss=0.3518, pruned_loss=0.09196, over 21704.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3073, pruned_loss=0.07564, over 4287426.94 frames. ], batch size: 351, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:45:17,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1659072.0, ans=0.025 2023-06-24 04:45:28,481 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.378e+02 7.380e+02 1.174e+03 1.683e+03 3.998e+03, threshold=2.349e+03, percent-clipped=22.0 2023-06-24 04:45:33,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1659132.0, ans=0.5 2023-06-24 04:45:55,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1659192.0, ans=0.95 2023-06-24 04:46:03,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1659192.0, ans=0.5 2023-06-24 04:46:12,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1659192.0, ans=0.125 2023-06-24 04:46:37,781 INFO [train.py:996] (3/4) Epoch 10, batch 2100, loss[loss=0.2437, simple_loss=0.3129, pruned_loss=0.08729, over 21734.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3134, pruned_loss=0.07789, over 4282184.36 frames. ], batch size: 112, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:47:28,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1659432.0, ans=0.125 2023-06-24 04:48:13,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1659552.0, ans=0.1 2023-06-24 04:48:18,001 INFO [train.py:996] (3/4) Epoch 10, batch 2150, loss[loss=0.219, simple_loss=0.3033, pruned_loss=0.06738, over 21180.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3145, pruned_loss=0.07933, over 4272714.77 frames. ], batch size: 159, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:48:18,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1659612.0, ans=0.125 2023-06-24 04:48:31,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1659612.0, ans=0.0 2023-06-24 04:48:48,465 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.248e+02 6.485e+02 1.170e+03 1.690e+03 3.411e+03, threshold=2.340e+03, percent-clipped=8.0 2023-06-24 04:48:50,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1659672.0, ans=0.125 2023-06-24 04:49:15,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1659792.0, ans=0.1 2023-06-24 04:49:22,466 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-24 04:49:53,104 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=22.5 2023-06-24 04:49:56,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1659912.0, ans=0.2 2023-06-24 04:49:58,081 INFO [train.py:996] (3/4) Epoch 10, batch 2200, loss[loss=0.2028, simple_loss=0.2777, pruned_loss=0.06396, over 21357.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3145, pruned_loss=0.07899, over 4269226.42 frames. ], batch size: 194, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:50:26,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1659972.0, ans=0.1 2023-06-24 04:51:07,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1660092.0, ans=0.0 2023-06-24 04:51:37,239 INFO [train.py:996] (3/4) Epoch 10, batch 2250, loss[loss=0.2271, simple_loss=0.3424, pruned_loss=0.05589, over 20751.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3111, pruned_loss=0.07677, over 4268340.07 frames. ], batch size: 608, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 04:52:08,789 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.133e+02 6.896e+02 1.012e+03 1.519e+03 4.116e+03, threshold=2.025e+03, percent-clipped=5.0 2023-06-24 04:52:26,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1660332.0, ans=0.0 2023-06-24 04:53:15,531 INFO [train.py:996] (3/4) Epoch 10, batch 2300, loss[loss=0.2069, simple_loss=0.2762, pruned_loss=0.06873, over 21797.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3047, pruned_loss=0.07611, over 4276400.90 frames. ], batch size: 98, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 04:53:43,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1660572.0, ans=0.0 2023-06-24 04:53:59,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=15.0 2023-06-24 04:54:55,355 INFO [train.py:996] (3/4) Epoch 10, batch 2350, loss[loss=0.2824, simple_loss=0.3499, pruned_loss=0.1074, over 21882.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3044, pruned_loss=0.07681, over 4269969.47 frames. ], batch size: 372, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 04:55:32,516 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.196e+02 7.285e+02 1.033e+03 1.548e+03 3.497e+03, threshold=2.065e+03, percent-clipped=14.0 2023-06-24 04:55:55,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1660932.0, ans=0.125 2023-06-24 04:55:56,062 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.56 vs. limit=15.0 2023-06-24 04:56:34,563 INFO [train.py:996] (3/4) Epoch 10, batch 2400, loss[loss=0.1961, simple_loss=0.2536, pruned_loss=0.06934, over 21446.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3064, pruned_loss=0.07927, over 4271167.30 frames. ], batch size: 212, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:56:53,214 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-24 04:57:51,636 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.19 vs. limit=10.0 2023-06-24 04:58:08,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1661352.0, ans=15.0 2023-06-24 04:58:18,977 INFO [train.py:996] (3/4) Epoch 10, batch 2450, loss[loss=0.2499, simple_loss=0.3044, pruned_loss=0.09771, over 21153.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3095, pruned_loss=0.08058, over 4269035.81 frames. ], batch size: 143, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:58:25,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1661412.0, ans=0.125 2023-06-24 04:58:46,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1661472.0, ans=0.1 2023-06-24 04:58:50,921 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.358e+02 7.393e+02 1.203e+03 1.868e+03 3.512e+03, threshold=2.405e+03, percent-clipped=21.0 2023-06-24 04:58:56,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1661532.0, ans=0.125 2023-06-24 04:59:17,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-24 04:59:18,856 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:59:58,050 INFO [train.py:996] (3/4) Epoch 10, batch 2500, loss[loss=0.2246, simple_loss=0.2948, pruned_loss=0.07723, over 21829.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3079, pruned_loss=0.08067, over 4275826.40 frames. ], batch size: 107, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:00:19,868 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.22 vs. limit=10.0 2023-06-24 05:00:29,666 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.69 vs. limit=15.0 2023-06-24 05:00:55,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1661892.0, ans=0.0 2023-06-24 05:01:39,203 INFO [train.py:996] (3/4) Epoch 10, batch 2550, loss[loss=0.2395, simple_loss=0.3295, pruned_loss=0.07474, over 21092.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3075, pruned_loss=0.08012, over 4273785.80 frames. ], batch size: 607, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:02:11,271 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.20 vs. limit=10.0 2023-06-24 05:02:11,848 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.129e+02 7.144e+02 9.794e+02 1.361e+03 2.807e+03, threshold=1.959e+03, percent-clipped=4.0 2023-06-24 05:02:12,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1662072.0, ans=0.125 2023-06-24 05:02:15,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1662132.0, ans=0.125 2023-06-24 05:02:18,990 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:02:41,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1662192.0, ans=0.0 2023-06-24 05:02:43,688 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=22.5 2023-06-24 05:03:00,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1662252.0, ans=0.125 2023-06-24 05:03:02,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1662252.0, ans=0.125 2023-06-24 05:03:17,553 INFO [train.py:996] (3/4) Epoch 10, batch 2600, loss[loss=0.2382, simple_loss=0.3224, pruned_loss=0.07704, over 21523.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3071, pruned_loss=0.08079, over 4270583.87 frames. ], batch size: 389, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:03:21,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1662312.0, ans=0.1 2023-06-24 05:03:34,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1662312.0, ans=0.1 2023-06-24 05:03:46,485 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-24 05:03:55,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1662432.0, ans=0.125 2023-06-24 05:04:58,966 INFO [train.py:996] (3/4) Epoch 10, batch 2650, loss[loss=0.2599, simple_loss=0.3364, pruned_loss=0.09169, over 21851.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3069, pruned_loss=0.0811, over 4271253.16 frames. ], batch size: 371, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:05:04,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1662612.0, ans=0.125 2023-06-24 05:05:19,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1662672.0, ans=0.0 2023-06-24 05:05:26,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1662672.0, ans=0.025 2023-06-24 05:05:32,838 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.149e+02 7.187e+02 9.516e+02 1.311e+03 3.015e+03, threshold=1.903e+03, percent-clipped=11.0 2023-06-24 05:05:46,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1662732.0, ans=0.0 2023-06-24 05:05:49,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1662732.0, ans=0.125 2023-06-24 05:05:50,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1662732.0, ans=0.125 2023-06-24 05:06:18,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1662852.0, ans=0.125 2023-06-24 05:06:22,461 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-24 05:06:40,807 INFO [train.py:996] (3/4) Epoch 10, batch 2700, loss[loss=0.2127, simple_loss=0.2818, pruned_loss=0.0718, over 21795.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3063, pruned_loss=0.08108, over 4273973.38 frames. ], batch size: 282, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:07:26,415 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.75 vs. limit=12.0 2023-06-24 05:07:52,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1663092.0, ans=0.0 2023-06-24 05:08:21,429 INFO [train.py:996] (3/4) Epoch 10, batch 2750, loss[loss=0.2373, simple_loss=0.3038, pruned_loss=0.0854, over 21803.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3082, pruned_loss=0.08115, over 4277234.19 frames. ], batch size: 282, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:08:42,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1663272.0, ans=0.0 2023-06-24 05:08:54,972 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.507e+02 8.097e+02 1.070e+03 1.539e+03 2.944e+03, threshold=2.139e+03, percent-clipped=11.0 2023-06-24 05:09:12,197 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.26 vs. limit=12.0 2023-06-24 05:10:05,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1663512.0, ans=0.2 2023-06-24 05:10:07,006 INFO [train.py:996] (3/4) Epoch 10, batch 2800, loss[loss=0.2554, simple_loss=0.3267, pruned_loss=0.09205, over 21669.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3137, pruned_loss=0.0824, over 4278371.43 frames. ], batch size: 230, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 05:11:01,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1663632.0, ans=0.125 2023-06-24 05:11:47,573 INFO [train.py:996] (3/4) Epoch 10, batch 2850, loss[loss=0.3264, simple_loss=0.3825, pruned_loss=0.1352, over 21485.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3145, pruned_loss=0.0838, over 4276905.17 frames. ], batch size: 508, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:11:54,939 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.50 vs. limit=10.0 2023-06-24 05:12:27,733 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.081e+02 7.888e+02 1.288e+03 1.995e+03 6.558e+03, threshold=2.577e+03, percent-clipped=20.0 2023-06-24 05:12:41,675 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-24 05:12:45,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1663992.0, ans=0.2 2023-06-24 05:13:01,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1663992.0, ans=0.1 2023-06-24 05:13:04,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1664052.0, ans=0.1 2023-06-24 05:13:24,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1664052.0, ans=0.07 2023-06-24 05:13:27,317 INFO [train.py:996] (3/4) Epoch 10, batch 2900, loss[loss=0.2112, simple_loss=0.2815, pruned_loss=0.07043, over 21809.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.312, pruned_loss=0.08324, over 4284452.63 frames. ], batch size: 247, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:13:27,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1664112.0, ans=0.125 2023-06-24 05:13:35,990 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.17 vs. limit=6.0 2023-06-24 05:13:37,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1664112.0, ans=0.0 2023-06-24 05:13:56,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1664172.0, ans=0.125 2023-06-24 05:14:29,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1664292.0, ans=0.125 2023-06-24 05:14:30,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1664292.0, ans=0.125 2023-06-24 05:14:32,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1664292.0, ans=0.125 2023-06-24 05:15:00,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1664352.0, ans=0.015 2023-06-24 05:15:03,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1664352.0, ans=0.2 2023-06-24 05:15:03,732 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=12.0 2023-06-24 05:15:05,613 INFO [train.py:996] (3/4) Epoch 10, batch 2950, loss[loss=0.2075, simple_loss=0.2906, pruned_loss=0.06221, over 21289.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.312, pruned_loss=0.0831, over 4288476.99 frames. ], batch size: 159, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:15:10,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1664412.0, ans=0.0 2023-06-24 05:15:35,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1664472.0, ans=0.0 2023-06-24 05:15:45,389 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.895e+02 6.745e+02 8.632e+02 1.337e+03 3.191e+03, threshold=1.726e+03, percent-clipped=2.0 2023-06-24 05:15:46,561 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=15.0 2023-06-24 05:16:01,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1664592.0, ans=0.125 2023-06-24 05:16:17,073 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2023-06-24 05:16:25,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1664652.0, ans=0.0 2023-06-24 05:16:44,598 INFO [train.py:996] (3/4) Epoch 10, batch 3000, loss[loss=0.2578, simple_loss=0.3297, pruned_loss=0.09291, over 21603.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3159, pruned_loss=0.08403, over 4288093.47 frames. ], batch size: 230, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:16:44,599 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 05:17:00,554 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2505, simple_loss=0.3452, pruned_loss=0.07794, over 1796401.00 frames. 2023-06-24 05:17:00,554 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-24 05:17:33,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1664772.0, ans=0.1 2023-06-24 05:18:15,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1664892.0, ans=0.0 2023-06-24 05:18:45,523 INFO [train.py:996] (3/4) Epoch 10, batch 3050, loss[loss=0.1829, simple_loss=0.259, pruned_loss=0.05345, over 21154.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3169, pruned_loss=0.08221, over 4291133.19 frames. ], batch size: 143, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:19:04,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1665012.0, ans=0.125 2023-06-24 05:19:21,821 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.132e+02 6.265e+02 9.515e+02 1.393e+03 2.651e+03, threshold=1.903e+03, percent-clipped=13.0 2023-06-24 05:20:18,646 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=12.0 2023-06-24 05:20:25,941 INFO [train.py:996] (3/4) Epoch 10, batch 3100, loss[loss=0.1992, simple_loss=0.291, pruned_loss=0.05374, over 21394.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3161, pruned_loss=0.08093, over 4289712.56 frames. ], batch size: 211, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:21:30,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1665492.0, ans=0.1 2023-06-24 05:22:10,427 INFO [train.py:996] (3/4) Epoch 10, batch 3150, loss[loss=0.1945, simple_loss=0.2768, pruned_loss=0.05609, over 21652.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3162, pruned_loss=0.08109, over 4283594.09 frames. ], batch size: 263, lr: 3.00e-03, grad_scale: 8.0 2023-06-24 05:22:55,166 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.369e+02 7.727e+02 1.146e+03 1.592e+03 4.239e+03, threshold=2.292e+03, percent-clipped=10.0 2023-06-24 05:23:53,256 INFO [train.py:996] (3/4) Epoch 10, batch 3200, loss[loss=0.1788, simple_loss=0.2641, pruned_loss=0.04674, over 21264.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3177, pruned_loss=0.08133, over 4281721.85 frames. ], batch size: 176, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:24:27,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1665972.0, ans=0.0 2023-06-24 05:24:41,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1666032.0, ans=0.125 2023-06-24 05:25:00,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1666092.0, ans=0.125 2023-06-24 05:25:33,999 INFO [train.py:996] (3/4) Epoch 10, batch 3250, loss[loss=0.2653, simple_loss=0.3282, pruned_loss=0.1012, over 21186.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.318, pruned_loss=0.08267, over 4287422.89 frames. ], batch size: 143, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:26:16,580 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.011e+02 5.714e+02 8.207e+02 1.474e+03 3.383e+03, threshold=1.641e+03, percent-clipped=8.0 2023-06-24 05:26:26,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1666332.0, ans=0.2 2023-06-24 05:26:47,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1666392.0, ans=0.2 2023-06-24 05:26:48,389 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-24 05:27:15,028 INFO [train.py:996] (3/4) Epoch 10, batch 3300, loss[loss=0.2472, simple_loss=0.344, pruned_loss=0.07518, over 21309.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3152, pruned_loss=0.08214, over 4279572.74 frames. ], batch size: 548, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:27:17,485 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.66 vs. limit=12.0 2023-06-24 05:27:54,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1666572.0, ans=0.125 2023-06-24 05:27:58,030 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.98 vs. limit=22.5 2023-06-24 05:28:55,169 INFO [train.py:996] (3/4) Epoch 10, batch 3350, loss[loss=0.2436, simple_loss=0.3149, pruned_loss=0.08617, over 21720.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3175, pruned_loss=0.083, over 4282597.28 frames. ], batch size: 112, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:29:15,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1666812.0, ans=0.04949747468305833 2023-06-24 05:29:42,336 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.528e+02 7.146e+02 1.056e+03 1.768e+03 3.632e+03, threshold=2.111e+03, percent-clipped=30.0 2023-06-24 05:30:38,687 INFO [train.py:996] (3/4) Epoch 10, batch 3400, loss[loss=0.2245, simple_loss=0.3079, pruned_loss=0.07059, over 21532.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3198, pruned_loss=0.08414, over 4285533.82 frames. ], batch size: 389, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:31:01,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1667172.0, ans=0.0 2023-06-24 05:31:22,314 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:31:50,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1667292.0, ans=0.015 2023-06-24 05:32:18,950 INFO [train.py:996] (3/4) Epoch 10, batch 3450, loss[loss=0.2005, simple_loss=0.2731, pruned_loss=0.064, over 16319.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3149, pruned_loss=0.08316, over 4274675.65 frames. ], batch size: 62, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:32:44,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1667472.0, ans=0.0 2023-06-24 05:32:53,677 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=12.0 2023-06-24 05:33:00,167 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 9.142e+02 1.242e+03 1.836e+03 3.790e+03, threshold=2.483e+03, percent-clipped=19.0 2023-06-24 05:33:01,414 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.72 vs. limit=15.0 2023-06-24 05:33:56,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1667652.0, ans=0.0 2023-06-24 05:34:02,166 INFO [train.py:996] (3/4) Epoch 10, batch 3500, loss[loss=0.2782, simple_loss=0.3474, pruned_loss=0.1045, over 21657.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3219, pruned_loss=0.08619, over 4280279.96 frames. ], batch size: 263, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:34:29,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1667772.0, ans=0.1 2023-06-24 05:35:06,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1667892.0, ans=0.125 2023-06-24 05:35:41,127 INFO [train.py:996] (3/4) Epoch 10, batch 3550, loss[loss=0.2567, simple_loss=0.3121, pruned_loss=0.1006, over 21307.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3234, pruned_loss=0.08797, over 4284426.09 frames. ], batch size: 143, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:35:43,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1668012.0, ans=0.125 2023-06-24 05:35:55,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1668012.0, ans=0.0 2023-06-24 05:36:03,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1668072.0, ans=0.0 2023-06-24 05:36:17,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1668072.0, ans=0.0 2023-06-24 05:36:22,736 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.441e+02 8.129e+02 1.130e+03 1.802e+03 3.924e+03, threshold=2.259e+03, percent-clipped=11.0 2023-06-24 05:36:26,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1668132.0, ans=0.2 2023-06-24 05:36:32,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1668132.0, ans=0.2 2023-06-24 05:37:06,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1668252.0, ans=0.125 2023-06-24 05:37:20,190 INFO [train.py:996] (3/4) Epoch 10, batch 3600, loss[loss=0.2619, simple_loss=0.3284, pruned_loss=0.09768, over 21231.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3172, pruned_loss=0.08682, over 4280546.12 frames. ], batch size: 143, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 05:37:20,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1668312.0, ans=0.1 2023-06-24 05:38:08,298 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.57 vs. limit=15.0 2023-06-24 05:38:16,106 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.44 vs. limit=22.5 2023-06-24 05:38:36,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1668492.0, ans=0.125 2023-06-24 05:38:54,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1668552.0, ans=15.0 2023-06-24 05:39:01,651 INFO [train.py:996] (3/4) Epoch 10, batch 3650, loss[loss=0.2403, simple_loss=0.2916, pruned_loss=0.09453, over 21240.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3171, pruned_loss=0.08642, over 4283173.39 frames. ], batch size: 471, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:39:22,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1668672.0, ans=0.125 2023-06-24 05:39:24,657 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.69 vs. limit=22.5 2023-06-24 05:39:43,917 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.119e+02 6.332e+02 8.468e+02 1.461e+03 3.139e+03, threshold=1.694e+03, percent-clipped=4.0 2023-06-24 05:40:06,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1668792.0, ans=0.125 2023-06-24 05:40:39,693 INFO [train.py:996] (3/4) Epoch 10, batch 3700, loss[loss=0.2682, simple_loss=0.3556, pruned_loss=0.09039, over 21008.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3171, pruned_loss=0.08567, over 4286886.78 frames. ], batch size: 608, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:41:23,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1669032.0, ans=0.1 2023-06-24 05:41:41,299 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2023-06-24 05:42:00,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1669152.0, ans=0.125 2023-06-24 05:42:20,637 INFO [train.py:996] (3/4) Epoch 10, batch 3750, loss[loss=0.2238, simple_loss=0.3093, pruned_loss=0.06916, over 21705.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.317, pruned_loss=0.08531, over 4286797.71 frames. ], batch size: 389, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:42:50,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1669272.0, ans=0.2 2023-06-24 05:43:00,067 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.152e+02 7.119e+02 9.667e+02 1.381e+03 3.413e+03, threshold=1.933e+03, percent-clipped=11.0 2023-06-24 05:43:03,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1669332.0, ans=0.125 2023-06-24 05:43:34,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1669452.0, ans=0.125 2023-06-24 05:44:00,635 INFO [train.py:996] (3/4) Epoch 10, batch 3800, loss[loss=0.2307, simple_loss=0.3026, pruned_loss=0.07942, over 20055.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3156, pruned_loss=0.08413, over 4279681.58 frames. ], batch size: 703, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:44:08,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1669512.0, ans=0.2 2023-06-24 05:45:34,314 INFO [train.py:996] (3/4) Epoch 10, batch 3850, loss[loss=0.1978, simple_loss=0.2629, pruned_loss=0.06634, over 21357.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3122, pruned_loss=0.08435, over 4284915.14 frames. ], batch size: 211, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:45:36,907 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.28 vs. limit=22.5 2023-06-24 05:45:51,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1669872.0, ans=0.125 2023-06-24 05:46:06,322 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.754e+02 6.682e+02 9.971e+02 1.611e+03 3.519e+03, threshold=1.994e+03, percent-clipped=16.0 2023-06-24 05:46:11,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1669932.0, ans=0.125 2023-06-24 05:46:53,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1670052.0, ans=0.04949747468305833 2023-06-24 05:46:53,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1670052.0, ans=0.2 2023-06-24 05:46:54,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1670052.0, ans=0.0 2023-06-24 05:47:06,822 INFO [train.py:996] (3/4) Epoch 10, batch 3900, loss[loss=0.2397, simple_loss=0.3111, pruned_loss=0.08421, over 21802.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3075, pruned_loss=0.08396, over 4290923.33 frames. ], batch size: 391, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:47:07,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1670112.0, ans=0.1 2023-06-24 05:47:33,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1670172.0, ans=0.125 2023-06-24 05:47:35,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1670172.0, ans=0.2 2023-06-24 05:47:58,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1670232.0, ans=0.0 2023-06-24 05:48:14,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1670292.0, ans=22.5 2023-06-24 05:48:51,303 INFO [train.py:996] (3/4) Epoch 10, batch 3950, loss[loss=0.207, simple_loss=0.302, pruned_loss=0.05598, over 21615.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3097, pruned_loss=0.08274, over 4292008.51 frames. ], batch size: 389, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:49:14,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1670472.0, ans=0.125 2023-06-24 05:49:16,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1670472.0, ans=0.125 2023-06-24 05:49:28,744 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.660e+02 7.100e+02 1.207e+03 1.862e+03 3.460e+03, threshold=2.413e+03, percent-clipped=21.0 2023-06-24 05:50:19,381 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.64 vs. limit=10.0 2023-06-24 05:50:24,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1670652.0, ans=0.1 2023-06-24 05:50:26,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1670652.0, ans=0.2 2023-06-24 05:50:29,173 INFO [train.py:996] (3/4) Epoch 10, batch 4000, loss[loss=0.2156, simple_loss=0.275, pruned_loss=0.07812, over 21554.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.303, pruned_loss=0.07929, over 4281322.59 frames. ], batch size: 247, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 05:50:33,803 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-24 05:50:52,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1670772.0, ans=0.0 2023-06-24 05:51:35,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1670892.0, ans=0.1 2023-06-24 05:51:35,730 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.04 vs. limit=6.0 2023-06-24 05:51:56,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=1670952.0, ans=22.5 2023-06-24 05:52:01,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1670952.0, ans=0.125 2023-06-24 05:52:09,789 INFO [train.py:996] (3/4) Epoch 10, batch 4050, loss[loss=0.211, simple_loss=0.2758, pruned_loss=0.07305, over 16838.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3013, pruned_loss=0.0776, over 4264504.18 frames. ], batch size: 61, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 05:52:14,185 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-24 05:52:41,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1671072.0, ans=0.125 2023-06-24 05:52:45,177 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=22.5 2023-06-24 05:52:51,704 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.639e+02 6.579e+02 8.856e+02 1.407e+03 2.917e+03, threshold=1.771e+03, percent-clipped=4.0 2023-06-24 05:53:12,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1671192.0, ans=0.5 2023-06-24 05:53:49,042 INFO [train.py:996] (3/4) Epoch 10, batch 4100, loss[loss=0.2495, simple_loss=0.3383, pruned_loss=0.0803, over 21664.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3022, pruned_loss=0.07675, over 4276020.10 frames. ], batch size: 389, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 05:54:43,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1671432.0, ans=0.125 2023-06-24 05:54:45,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1671432.0, ans=0.0 2023-06-24 05:54:45,922 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-06-24 05:54:56,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1671492.0, ans=0.0 2023-06-24 05:54:56,970 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-24 05:55:11,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1671552.0, ans=0.125 2023-06-24 05:55:13,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1671552.0, ans=0.0 2023-06-24 05:55:27,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1671612.0, ans=0.0 2023-06-24 05:55:28,589 INFO [train.py:996] (3/4) Epoch 10, batch 4150, loss[loss=0.2466, simple_loss=0.3238, pruned_loss=0.08471, over 21720.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3026, pruned_loss=0.07473, over 4270833.92 frames. ], batch size: 333, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 05:55:33,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1671612.0, ans=0.0 2023-06-24 05:56:15,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1671732.0, ans=0.0 2023-06-24 05:56:17,908 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.253e+02 6.615e+02 8.834e+02 1.095e+03 2.475e+03, threshold=1.767e+03, percent-clipped=7.0 2023-06-24 05:56:18,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1671732.0, ans=0.125 2023-06-24 05:56:20,102 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:56:22,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-24 05:56:33,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1671792.0, ans=0.0 2023-06-24 05:56:33,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1671792.0, ans=0.2 2023-06-24 05:56:35,871 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-24 05:56:48,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1671792.0, ans=0.1 2023-06-24 05:56:53,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1671852.0, ans=0.0 2023-06-24 05:57:05,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1671852.0, ans=0.2 2023-06-24 05:57:09,374 INFO [train.py:996] (3/4) Epoch 10, batch 4200, loss[loss=0.2082, simple_loss=0.28, pruned_loss=0.0682, over 21555.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3038, pruned_loss=0.07495, over 4262025.32 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:57:09,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1671912.0, ans=0.125 2023-06-24 05:58:37,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1672152.0, ans=0.1 2023-06-24 05:58:57,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1672152.0, ans=0.1 2023-06-24 05:59:00,776 INFO [train.py:996] (3/4) Epoch 10, batch 4250, loss[loss=0.2251, simple_loss=0.3064, pruned_loss=0.0719, over 21387.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3097, pruned_loss=0.07709, over 4257900.18 frames. ], batch size: 211, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:59:13,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1672212.0, ans=0.0 2023-06-24 05:59:39,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1672332.0, ans=0.0 2023-06-24 05:59:40,880 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.102e+02 7.026e+02 9.985e+02 1.582e+03 3.548e+03, threshold=1.997e+03, percent-clipped=19.0 2023-06-24 05:59:54,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1672332.0, ans=0.1 2023-06-24 06:00:40,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1672452.0, ans=0.0 2023-06-24 06:00:42,489 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.57 vs. limit=15.0 2023-06-24 06:00:43,115 INFO [train.py:996] (3/4) Epoch 10, batch 4300, loss[loss=0.3046, simple_loss=0.3964, pruned_loss=0.1065, over 21468.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3157, pruned_loss=0.07864, over 4261904.31 frames. ], batch size: 507, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:01:25,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1672632.0, ans=0.1 2023-06-24 06:01:37,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1672632.0, ans=0.125 2023-06-24 06:01:37,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1672632.0, ans=0.0 2023-06-24 06:01:48,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1672692.0, ans=0.2 2023-06-24 06:02:26,287 INFO [train.py:996] (3/4) Epoch 10, batch 4350, loss[loss=0.2194, simple_loss=0.2775, pruned_loss=0.08063, over 21222.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3167, pruned_loss=0.07828, over 4261011.30 frames. ], batch size: 144, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:02:37,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1672812.0, ans=0.2 2023-06-24 06:03:05,529 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.531e+02 6.687e+02 1.042e+03 1.785e+03 5.548e+03, threshold=2.083e+03, percent-clipped=20.0 2023-06-24 06:03:37,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1672992.0, ans=0.07 2023-06-24 06:03:40,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1672992.0, ans=0.0 2023-06-24 06:03:56,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1673052.0, ans=0.125 2023-06-24 06:04:01,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1673052.0, ans=0.125 2023-06-24 06:04:01,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1673052.0, ans=0.0 2023-06-24 06:04:04,308 INFO [train.py:996] (3/4) Epoch 10, batch 4400, loss[loss=0.2198, simple_loss=0.3091, pruned_loss=0.06522, over 21316.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3131, pruned_loss=0.07798, over 4261805.18 frames. ], batch size: 176, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:04:47,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1673232.0, ans=0.1 2023-06-24 06:04:56,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1673232.0, ans=0.125 2023-06-24 06:05:17,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1673292.0, ans=0.125 2023-06-24 06:05:45,291 INFO [train.py:996] (3/4) Epoch 10, batch 4450, loss[loss=0.2495, simple_loss=0.3352, pruned_loss=0.08193, over 21369.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3204, pruned_loss=0.07933, over 4256604.64 frames. ], batch size: 194, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:06:09,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1673472.0, ans=0.0 2023-06-24 06:06:27,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1673532.0, ans=0.125 2023-06-24 06:06:30,817 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.122e+02 7.408e+02 1.039e+03 1.368e+03 2.536e+03, threshold=2.077e+03, percent-clipped=7.0 2023-06-24 06:07:12,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1673652.0, ans=0.1 2023-06-24 06:07:14,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1673652.0, ans=0.1 2023-06-24 06:07:23,218 INFO [train.py:996] (3/4) Epoch 10, batch 4500, loss[loss=0.2683, simple_loss=0.3349, pruned_loss=0.1008, over 21878.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3208, pruned_loss=0.08135, over 4260230.79 frames. ], batch size: 118, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:07:38,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1673712.0, ans=0.125 2023-06-24 06:08:19,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1673832.0, ans=0.0 2023-06-24 06:08:41,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1673892.0, ans=0.125 2023-06-24 06:08:55,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=22.5 2023-06-24 06:09:08,452 INFO [train.py:996] (3/4) Epoch 10, batch 4550, loss[loss=0.2513, simple_loss=0.3346, pruned_loss=0.08404, over 21751.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3252, pruned_loss=0.08291, over 4267144.84 frames. ], batch size: 332, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:09:34,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1674072.0, ans=0.125 2023-06-24 06:09:55,277 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.449e+02 7.044e+02 9.475e+02 1.574e+03 2.834e+03, threshold=1.895e+03, percent-clipped=10.0 2023-06-24 06:09:55,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1674132.0, ans=0.05 2023-06-24 06:10:28,181 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-24 06:10:49,453 INFO [train.py:996] (3/4) Epoch 10, batch 4600, loss[loss=0.3758, simple_loss=0.4966, pruned_loss=0.1275, over 19716.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3297, pruned_loss=0.08528, over 4271441.66 frames. ], batch size: 702, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:11:33,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1674432.0, ans=0.0 2023-06-24 06:12:01,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1674492.0, ans=0.0 2023-06-24 06:12:06,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1674552.0, ans=0.0 2023-06-24 06:12:15,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1674552.0, ans=0.125 2023-06-24 06:12:27,161 INFO [train.py:996] (3/4) Epoch 10, batch 4650, loss[loss=0.1872, simple_loss=0.2661, pruned_loss=0.05414, over 21796.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.322, pruned_loss=0.08346, over 4285703.98 frames. ], batch size: 298, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:13:02,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1674672.0, ans=0.1 2023-06-24 06:13:05,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1674672.0, ans=0.015 2023-06-24 06:13:18,204 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.211e+02 6.703e+02 9.514e+02 1.360e+03 2.442e+03, threshold=1.903e+03, percent-clipped=9.0 2023-06-24 06:13:40,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1674792.0, ans=0.125 2023-06-24 06:13:53,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1674852.0, ans=0.1 2023-06-24 06:14:05,738 INFO [train.py:996] (3/4) Epoch 10, batch 4700, loss[loss=0.2582, simple_loss=0.2996, pruned_loss=0.1084, over 21441.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3118, pruned_loss=0.0808, over 4272819.55 frames. ], batch size: 509, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:14:32,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1674972.0, ans=0.2 2023-06-24 06:14:46,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1675032.0, ans=0.125 2023-06-24 06:15:11,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1675092.0, ans=0.0 2023-06-24 06:15:15,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1675092.0, ans=0.125 2023-06-24 06:15:19,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1675092.0, ans=0.0 2023-06-24 06:15:44,549 INFO [train.py:996] (3/4) Epoch 10, batch 4750, loss[loss=0.2378, simple_loss=0.3083, pruned_loss=0.08361, over 21765.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3063, pruned_loss=0.0812, over 4274995.15 frames. ], batch size: 112, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:15:49,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1675212.0, ans=0.035 2023-06-24 06:15:58,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1675212.0, ans=0.125 2023-06-24 06:16:24,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1675332.0, ans=0.0 2023-06-24 06:16:25,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1675332.0, ans=0.0 2023-06-24 06:16:35,094 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.386e+02 6.636e+02 9.717e+02 1.456e+03 3.310e+03, threshold=1.943e+03, percent-clipped=9.0 2023-06-24 06:16:51,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1675392.0, ans=0.125 2023-06-24 06:17:07,865 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:17:26,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1675512.0, ans=0.125 2023-06-24 06:17:27,575 INFO [train.py:996] (3/4) Epoch 10, batch 4800, loss[loss=0.1988, simple_loss=0.281, pruned_loss=0.05833, over 21269.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3069, pruned_loss=0.08077, over 4273189.74 frames. ], batch size: 159, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:17:37,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1675512.0, ans=0.0 2023-06-24 06:17:51,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1675572.0, ans=0.125 2023-06-24 06:17:55,183 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-24 06:18:21,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1675632.0, ans=0.0 2023-06-24 06:18:25,170 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-24 06:18:48,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1675752.0, ans=0.05 2023-06-24 06:18:50,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.58 vs. limit=12.0 2023-06-24 06:18:52,110 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=22.5 2023-06-24 06:19:05,391 INFO [train.py:996] (3/4) Epoch 10, batch 4850, loss[loss=0.2497, simple_loss=0.3284, pruned_loss=0.08546, over 20839.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3055, pruned_loss=0.07998, over 4271776.59 frames. ], batch size: 608, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:19:53,820 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.198e+02 7.060e+02 1.085e+03 1.594e+03 2.809e+03, threshold=2.169e+03, percent-clipped=13.0 2023-06-24 06:20:16,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1675992.0, ans=0.1 2023-06-24 06:20:24,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1676052.0, ans=0.0 2023-06-24 06:20:27,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1676052.0, ans=0.125 2023-06-24 06:20:29,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1676052.0, ans=0.035 2023-06-24 06:20:45,005 INFO [train.py:996] (3/4) Epoch 10, batch 4900, loss[loss=0.2091, simple_loss=0.288, pruned_loss=0.06509, over 21602.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3069, pruned_loss=0.08023, over 4276363.48 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:21:22,846 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.12 vs. limit=15.0 2023-06-24 06:21:49,847 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-24 06:22:29,688 INFO [train.py:996] (3/4) Epoch 10, batch 4950, loss[loss=0.1921, simple_loss=0.2602, pruned_loss=0.06201, over 21861.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3109, pruned_loss=0.07878, over 4272056.03 frames. ], batch size: 107, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:22:37,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1676412.0, ans=0.0 2023-06-24 06:23:12,526 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.855e+02 5.943e+02 9.817e+02 1.512e+03 3.334e+03, threshold=1.963e+03, percent-clipped=7.0 2023-06-24 06:23:41,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-24 06:24:02,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1676712.0, ans=0.0 2023-06-24 06:24:03,548 INFO [train.py:996] (3/4) Epoch 10, batch 5000, loss[loss=0.2086, simple_loss=0.2857, pruned_loss=0.06578, over 21506.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.309, pruned_loss=0.07534, over 4273283.86 frames. ], batch size: 211, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:24:38,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1676772.0, ans=0.0 2023-06-24 06:24:42,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1676772.0, ans=0.09899494936611666 2023-06-24 06:24:45,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1676832.0, ans=0.125 2023-06-24 06:25:41,898 INFO [train.py:996] (3/4) Epoch 10, batch 5050, loss[loss=0.2377, simple_loss=0.3038, pruned_loss=0.08578, over 21883.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3089, pruned_loss=0.07678, over 4279282.33 frames. ], batch size: 298, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:25:45,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1677012.0, ans=0.05 2023-06-24 06:26:01,887 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:26:30,078 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.758e+02 6.527e+02 8.985e+02 1.399e+03 2.450e+03, threshold=1.797e+03, percent-clipped=5.0 2023-06-24 06:27:20,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1677312.0, ans=15.0 2023-06-24 06:27:21,190 INFO [train.py:996] (3/4) Epoch 10, batch 5100, loss[loss=0.2162, simple_loss=0.2951, pruned_loss=0.06866, over 21781.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3083, pruned_loss=0.07816, over 4281104.12 frames. ], batch size: 414, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:28:33,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1677492.0, ans=0.125 2023-06-24 06:29:01,697 INFO [train.py:996] (3/4) Epoch 10, batch 5150, loss[loss=0.2213, simple_loss=0.2903, pruned_loss=0.07618, over 21957.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3078, pruned_loss=0.0792, over 4285175.92 frames. ], batch size: 316, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:29:11,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1677612.0, ans=0.0 2023-06-24 06:29:47,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1677732.0, ans=0.1 2023-06-24 06:29:47,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1677732.0, ans=0.0 2023-06-24 06:29:50,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.999e+02 6.243e+02 9.247e+02 1.548e+03 4.552e+03, threshold=1.849e+03, percent-clipped=17.0 2023-06-24 06:30:41,395 INFO [train.py:996] (3/4) Epoch 10, batch 5200, loss[loss=0.2827, simple_loss=0.3785, pruned_loss=0.09343, over 21515.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3101, pruned_loss=0.08004, over 4285985.27 frames. ], batch size: 471, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:31:08,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1677972.0, ans=0.125 2023-06-24 06:31:13,319 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.24 vs. limit=15.0 2023-06-24 06:31:15,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1677972.0, ans=0.0 2023-06-24 06:31:49,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1678092.0, ans=0.1 2023-06-24 06:32:20,301 INFO [train.py:996] (3/4) Epoch 10, batch 5250, loss[loss=0.2477, simple_loss=0.3326, pruned_loss=0.0814, over 21826.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3136, pruned_loss=0.07804, over 4280951.58 frames. ], batch size: 371, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:32:21,361 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-24 06:32:43,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1678212.0, ans=0.0 2023-06-24 06:32:49,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1678272.0, ans=0.125 2023-06-24 06:33:09,824 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.205e+02 5.630e+02 8.256e+02 1.146e+03 2.990e+03, threshold=1.651e+03, percent-clipped=4.0 2023-06-24 06:33:39,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1678392.0, ans=0.125 2023-06-24 06:34:00,821 INFO [train.py:996] (3/4) Epoch 10, batch 5300, loss[loss=0.2177, simple_loss=0.2965, pruned_loss=0.06945, over 21875.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.312, pruned_loss=0.07877, over 4276215.89 frames. ], batch size: 107, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:34:04,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1678512.0, ans=10.0 2023-06-24 06:34:29,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1678572.0, ans=0.125 2023-06-24 06:34:53,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1678632.0, ans=0.125 2023-06-24 06:35:04,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1678692.0, ans=0.2 2023-06-24 06:35:07,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1678692.0, ans=0.1 2023-06-24 06:35:21,832 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=15.0 2023-06-24 06:35:23,007 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:35:32,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1678752.0, ans=0.1 2023-06-24 06:35:34,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1678752.0, ans=0.2 2023-06-24 06:35:38,816 INFO [train.py:996] (3/4) Epoch 10, batch 5350, loss[loss=0.2431, simple_loss=0.3092, pruned_loss=0.08846, over 21955.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3108, pruned_loss=0.07979, over 4276871.75 frames. ], batch size: 333, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:36:00,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1678872.0, ans=0.0 2023-06-24 06:36:23,611 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.431e+02 6.516e+02 8.462e+02 1.218e+03 2.526e+03, threshold=1.692e+03, percent-clipped=10.0 2023-06-24 06:37:13,517 INFO [train.py:996] (3/4) Epoch 10, batch 5400, loss[loss=0.2195, simple_loss=0.2901, pruned_loss=0.07445, over 21657.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.31, pruned_loss=0.08128, over 4281684.63 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:38:58,872 INFO [train.py:996] (3/4) Epoch 10, batch 5450, loss[loss=0.2676, simple_loss=0.4102, pruned_loss=0.06251, over 19788.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3118, pruned_loss=0.08009, over 4283098.40 frames. ], batch size: 702, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:38:59,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1679412.0, ans=0.125 2023-06-24 06:38:59,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1679412.0, ans=0.125 2023-06-24 06:39:31,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1679472.0, ans=0.125 2023-06-24 06:39:53,557 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.102e+02 6.727e+02 1.128e+03 1.838e+03 3.883e+03, threshold=2.256e+03, percent-clipped=29.0 2023-06-24 06:40:01,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1679592.0, ans=0.125 2023-06-24 06:40:08,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1679592.0, ans=0.1 2023-06-24 06:40:26,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1679652.0, ans=0.125 2023-06-24 06:40:28,349 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:40:41,150 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:40:42,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1679712.0, ans=0.0 2023-06-24 06:40:43,772 INFO [train.py:996] (3/4) Epoch 10, batch 5500, loss[loss=0.2416, simple_loss=0.3366, pruned_loss=0.07328, over 21660.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3166, pruned_loss=0.07666, over 4283662.36 frames. ], batch size: 389, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:40:46,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1679712.0, ans=0.125 2023-06-24 06:41:16,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1679772.0, ans=0.1 2023-06-24 06:41:19,915 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:42:05,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1679892.0, ans=0.1 2023-06-24 06:42:31,483 INFO [train.py:996] (3/4) Epoch 10, batch 5550, loss[loss=0.1718, simple_loss=0.2676, pruned_loss=0.03794, over 21791.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3147, pruned_loss=0.0738, over 4272545.29 frames. ], batch size: 316, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:42:58,399 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.72 vs. limit=5.0 2023-06-24 06:43:16,337 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 5.799e+02 8.739e+02 1.452e+03 3.739e+03, threshold=1.748e+03, percent-clipped=10.0 2023-06-24 06:43:18,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1680132.0, ans=0.025 2023-06-24 06:44:05,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1680252.0, ans=0.125 2023-06-24 06:44:10,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1680312.0, ans=0.2 2023-06-24 06:44:16,149 INFO [train.py:996] (3/4) Epoch 10, batch 5600, loss[loss=0.1966, simple_loss=0.2815, pruned_loss=0.05586, over 21197.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3119, pruned_loss=0.07092, over 4277252.21 frames. ], batch size: 176, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:44:31,743 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-24 06:44:44,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1680372.0, ans=0.125 2023-06-24 06:44:45,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1680372.0, ans=0.0 2023-06-24 06:45:07,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1680432.0, ans=0.0 2023-06-24 06:45:12,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1680492.0, ans=0.0 2023-06-24 06:45:15,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1680492.0, ans=0.0 2023-06-24 06:45:24,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=22.5 2023-06-24 06:45:55,093 INFO [train.py:996] (3/4) Epoch 10, batch 5650, loss[loss=0.2404, simple_loss=0.3078, pruned_loss=0.08655, over 21204.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3143, pruned_loss=0.07223, over 4275330.36 frames. ], batch size: 607, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:46:04,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1680612.0, ans=0.0 2023-06-24 06:46:08,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1680612.0, ans=0.0 2023-06-24 06:46:46,044 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.674e+02 6.342e+02 8.278e+02 1.256e+03 3.323e+03, threshold=1.656e+03, percent-clipped=10.0 2023-06-24 06:47:26,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1680852.0, ans=0.125 2023-06-24 06:47:34,637 INFO [train.py:996] (3/4) Epoch 10, batch 5700, loss[loss=0.2575, simple_loss=0.338, pruned_loss=0.08857, over 21802.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3139, pruned_loss=0.0743, over 4284380.06 frames. ], batch size: 351, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:48:22,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1681032.0, ans=0.125 2023-06-24 06:49:01,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1681152.0, ans=0.0 2023-06-24 06:49:09,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=1681152.0, ans=0.1 2023-06-24 06:49:15,852 INFO [train.py:996] (3/4) Epoch 10, batch 5750, loss[loss=0.1882, simple_loss=0.2861, pruned_loss=0.04513, over 21758.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3123, pruned_loss=0.07215, over 4274841.35 frames. ], batch size: 282, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:49:20,003 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-24 06:50:12,187 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.981e+02 6.985e+02 1.085e+03 1.966e+03 4.482e+03, threshold=2.170e+03, percent-clipped=31.0 2023-06-24 06:50:35,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1681392.0, ans=0.125 2023-06-24 06:50:55,672 INFO [train.py:996] (3/4) Epoch 10, batch 5800, loss[loss=0.2288, simple_loss=0.3226, pruned_loss=0.06749, over 21724.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3109, pruned_loss=0.0709, over 4267677.05 frames. ], batch size: 247, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:51:02,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1681512.0, ans=0.125 2023-06-24 06:51:10,964 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-24 06:52:40,266 INFO [train.py:996] (3/4) Epoch 10, batch 5850, loss[loss=0.1734, simple_loss=0.2824, pruned_loss=0.03215, over 21767.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3093, pruned_loss=0.06746, over 4265277.92 frames. ], batch size: 282, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:52:56,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1681812.0, ans=0.07 2023-06-24 06:53:36,614 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.591e+02 5.271e+02 8.161e+02 1.450e+03 2.978e+03, threshold=1.632e+03, percent-clipped=6.0 2023-06-24 06:53:41,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1681992.0, ans=0.125 2023-06-24 06:54:03,028 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.11 vs. limit=15.0 2023-06-24 06:54:08,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1682052.0, ans=0.2 2023-06-24 06:54:10,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1682052.0, ans=0.125 2023-06-24 06:54:23,925 INFO [train.py:996] (3/4) Epoch 10, batch 5900, loss[loss=0.1473, simple_loss=0.2458, pruned_loss=0.02441, over 21732.00 frames. ], tot_loss[loss=0.212, simple_loss=0.3009, pruned_loss=0.06153, over 4269271.32 frames. ], batch size: 298, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:55:03,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1682232.0, ans=0.0 2023-06-24 06:55:05,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1682232.0, ans=0.125 2023-06-24 06:56:02,443 INFO [train.py:996] (3/4) Epoch 10, batch 5950, loss[loss=0.2045, simple_loss=0.2748, pruned_loss=0.0671, over 21481.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2996, pruned_loss=0.06538, over 4281509.38 frames. ], batch size: 211, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:56:25,420 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=12.0 2023-06-24 06:56:38,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1682472.0, ans=0.125 2023-06-24 06:56:52,535 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.371e+02 5.849e+02 7.878e+02 1.100e+03 2.007e+03, threshold=1.576e+03, percent-clipped=6.0 2023-06-24 06:57:04,057 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-24 06:57:40,426 INFO [train.py:996] (3/4) Epoch 10, batch 6000, loss[loss=0.2112, simple_loss=0.2728, pruned_loss=0.07474, over 21668.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2956, pruned_loss=0.06832, over 4276902.97 frames. ], batch size: 299, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 06:57:40,427 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 06:57:59,725 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2611, simple_loss=0.3564, pruned_loss=0.0829, over 1796401.00 frames. 2023-06-24 06:57:59,725 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-24 06:58:23,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1682772.0, ans=0.125 2023-06-24 06:58:31,599 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=12.0 2023-06-24 06:59:09,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=1682892.0, ans=0.2 2023-06-24 06:59:38,434 INFO [train.py:996] (3/4) Epoch 10, batch 6050, loss[loss=0.2204, simple_loss=0.2787, pruned_loss=0.0811, over 21528.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2907, pruned_loss=0.06998, over 4275527.85 frames. ], batch size: 442, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:00:26,520 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.624e+02 5.765e+02 7.695e+02 1.085e+03 2.275e+03, threshold=1.539e+03, percent-clipped=10.0 2023-06-24 07:01:17,074 INFO [train.py:996] (3/4) Epoch 10, batch 6100, loss[loss=0.2311, simple_loss=0.3014, pruned_loss=0.08041, over 21978.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2903, pruned_loss=0.06886, over 4273900.72 frames. ], batch size: 113, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:01:35,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1683372.0, ans=0.0 2023-06-24 07:01:51,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1683432.0, ans=0.2 2023-06-24 07:02:01,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1683432.0, ans=0.125 2023-06-24 07:02:18,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1683492.0, ans=0.125 2023-06-24 07:02:59,554 INFO [train.py:996] (3/4) Epoch 10, batch 6150, loss[loss=0.2099, simple_loss=0.282, pruned_loss=0.06893, over 21150.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2938, pruned_loss=0.07101, over 4276547.26 frames. ], batch size: 143, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:03:03,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1683612.0, ans=0.0 2023-06-24 07:03:16,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1683672.0, ans=0.125 2023-06-24 07:03:36,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1683732.0, ans=0.1 2023-06-24 07:03:51,698 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.961e+02 6.743e+02 9.296e+02 1.382e+03 3.230e+03, threshold=1.859e+03, percent-clipped=18.0 2023-06-24 07:03:55,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1683792.0, ans=0.125 2023-06-24 07:04:06,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1683792.0, ans=0.125 2023-06-24 07:04:38,153 INFO [train.py:996] (3/4) Epoch 10, batch 6200, loss[loss=0.1831, simple_loss=0.2402, pruned_loss=0.06303, over 17221.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2989, pruned_loss=0.07282, over 4279648.55 frames. ], batch size: 66, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:04:55,718 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:05:05,543 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:05:05,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1683972.0, ans=0.09899494936611666 2023-06-24 07:05:08,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1684032.0, ans=0.0 2023-06-24 07:05:49,799 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-06-24 07:06:17,930 INFO [train.py:996] (3/4) Epoch 10, batch 6250, loss[loss=0.2186, simple_loss=0.3258, pruned_loss=0.05569, over 21843.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3044, pruned_loss=0.07294, over 4286174.14 frames. ], batch size: 371, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:06:19,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1684212.0, ans=0.125 2023-06-24 07:06:21,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1684212.0, ans=0.1 2023-06-24 07:07:01,414 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=22.5 2023-06-24 07:07:09,759 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.397e+02 7.386e+02 1.187e+03 1.704e+03 4.027e+03, threshold=2.375e+03, percent-clipped=21.0 2023-06-24 07:07:13,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1684392.0, ans=0.09899494936611666 2023-06-24 07:07:56,044 INFO [train.py:996] (3/4) Epoch 10, batch 6300, loss[loss=0.2356, simple_loss=0.3416, pruned_loss=0.06485, over 20837.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3074, pruned_loss=0.07189, over 4288590.00 frames. ], batch size: 608, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:07:58,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1684512.0, ans=0.125 2023-06-24 07:08:03,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1684512.0, ans=0.1 2023-06-24 07:09:34,386 INFO [train.py:996] (3/4) Epoch 10, batch 6350, loss[loss=0.2538, simple_loss=0.3615, pruned_loss=0.07303, over 20882.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3113, pruned_loss=0.07596, over 4290735.81 frames. ], batch size: 608, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:09:38,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1684812.0, ans=0.0 2023-06-24 07:10:27,521 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.068e+02 5.999e+02 8.518e+02 1.321e+03 2.305e+03, threshold=1.704e+03, percent-clipped=0.0 2023-06-24 07:10:53,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1684992.0, ans=0.05 2023-06-24 07:11:00,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1685052.0, ans=0.09899494936611666 2023-06-24 07:11:14,589 INFO [train.py:996] (3/4) Epoch 10, batch 6400, loss[loss=0.2345, simple_loss=0.3086, pruned_loss=0.0802, over 21496.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.318, pruned_loss=0.08081, over 4288121.36 frames. ], batch size: 211, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:11:39,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1685172.0, ans=0.0 2023-06-24 07:12:59,589 INFO [train.py:996] (3/4) Epoch 10, batch 6450, loss[loss=0.2183, simple_loss=0.2881, pruned_loss=0.07427, over 21116.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3184, pruned_loss=0.07892, over 4279877.62 frames. ], batch size: 143, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:13:56,601 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.370e+02 9.137e+02 1.216e+03 1.629e+03 2.950e+03, threshold=2.432e+03, percent-clipped=21.0 2023-06-24 07:13:58,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1685532.0, ans=0.1 2023-06-24 07:14:10,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1685592.0, ans=0.125 2023-06-24 07:14:30,445 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-24 07:14:41,230 INFO [train.py:996] (3/4) Epoch 10, batch 6500, loss[loss=0.2083, simple_loss=0.2733, pruned_loss=0.07169, over 21325.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3109, pruned_loss=0.07742, over 4280853.68 frames. ], batch size: 159, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:14:50,979 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-24 07:14:59,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1685772.0, ans=0.125 2023-06-24 07:15:02,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1685772.0, ans=0.125 2023-06-24 07:15:05,167 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=15.0 2023-06-24 07:15:06,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1685772.0, ans=0.0 2023-06-24 07:15:43,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1685892.0, ans=0.125 2023-06-24 07:15:45,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1685892.0, ans=0.125 2023-06-24 07:15:56,164 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.69 vs. limit=10.0 2023-06-24 07:16:19,602 INFO [train.py:996] (3/4) Epoch 10, batch 6550, loss[loss=0.2434, simple_loss=0.3116, pruned_loss=0.08762, over 21758.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3089, pruned_loss=0.07681, over 4280528.24 frames. ], batch size: 247, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:16:20,908 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.80 vs. limit=8.0 2023-06-24 07:16:47,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1686072.0, ans=0.5 2023-06-24 07:16:48,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1686072.0, ans=0.125 2023-06-24 07:16:57,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1686072.0, ans=0.0 2023-06-24 07:17:18,466 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.101e+02 5.692e+02 8.877e+02 1.428e+03 2.273e+03, threshold=1.775e+03, percent-clipped=0.0 2023-06-24 07:17:18,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1686132.0, ans=0.0 2023-06-24 07:17:33,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1686192.0, ans=0.07 2023-06-24 07:17:35,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1686192.0, ans=0.125 2023-06-24 07:17:43,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1686252.0, ans=0.0 2023-06-24 07:18:03,358 INFO [train.py:996] (3/4) Epoch 10, batch 6600, loss[loss=0.2139, simple_loss=0.2744, pruned_loss=0.07668, over 21776.00 frames. ], tot_loss[loss=0.23, simple_loss=0.305, pruned_loss=0.07748, over 4276457.42 frames. ], batch size: 102, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:18:14,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1686312.0, ans=0.125 2023-06-24 07:18:14,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1686312.0, ans=0.04949747468305833 2023-06-24 07:19:05,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1686492.0, ans=0.1 2023-06-24 07:19:37,642 INFO [train.py:996] (3/4) Epoch 10, batch 6650, loss[loss=0.2062, simple_loss=0.2825, pruned_loss=0.06493, over 21796.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2976, pruned_loss=0.07417, over 4276171.27 frames. ], batch size: 317, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:20:25,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1686732.0, ans=0.125 2023-06-24 07:20:31,553 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.031e+02 5.733e+02 1.031e+03 1.471e+03 3.342e+03, threshold=2.062e+03, percent-clipped=12.0 2023-06-24 07:21:15,678 INFO [train.py:996] (3/4) Epoch 10, batch 6700, loss[loss=0.204, simple_loss=0.2619, pruned_loss=0.07311, over 21842.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2934, pruned_loss=0.07371, over 4278994.87 frames. ], batch size: 107, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:21:22,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1686912.0, ans=0.0 2023-06-24 07:21:33,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1686972.0, ans=0.05 2023-06-24 07:22:04,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1687032.0, ans=0.125 2023-06-24 07:22:15,426 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.30 vs. limit=12.0 2023-06-24 07:22:53,748 INFO [train.py:996] (3/4) Epoch 10, batch 6750, loss[loss=0.2053, simple_loss=0.268, pruned_loss=0.07132, over 21405.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2917, pruned_loss=0.07389, over 4268923.77 frames. ], batch size: 131, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:22:58,202 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-06-24 07:23:32,824 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.20 vs. limit=10.0 2023-06-24 07:23:47,786 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.403e+02 6.480e+02 8.146e+02 1.101e+03 1.861e+03, threshold=1.629e+03, percent-clipped=0.0 2023-06-24 07:23:52,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1687392.0, ans=0.0 2023-06-24 07:24:08,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1687392.0, ans=0.035 2023-06-24 07:24:32,352 INFO [train.py:996] (3/4) Epoch 10, batch 6800, loss[loss=0.2202, simple_loss=0.2777, pruned_loss=0.08139, over 21620.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2945, pruned_loss=0.07709, over 4273376.81 frames. ], batch size: 263, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:24:49,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1687572.0, ans=0.0 2023-06-24 07:25:21,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1687632.0, ans=0.125 2023-06-24 07:26:10,766 INFO [train.py:996] (3/4) Epoch 10, batch 6850, loss[loss=0.2598, simple_loss=0.3116, pruned_loss=0.104, over 21279.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2922, pruned_loss=0.07799, over 4271080.23 frames. ], batch size: 176, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:26:16,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1687812.0, ans=0.1 2023-06-24 07:26:19,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1687812.0, ans=0.0 2023-06-24 07:26:21,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1687812.0, ans=0.0 2023-06-24 07:26:28,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-24 07:26:40,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1687872.0, ans=0.07 2023-06-24 07:26:55,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1687932.0, ans=0.125 2023-06-24 07:27:08,124 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.455e+02 6.000e+02 9.900e+02 1.479e+03 3.025e+03, threshold=1.980e+03, percent-clipped=16.0 2023-06-24 07:27:51,228 INFO [train.py:996] (3/4) Epoch 10, batch 6900, loss[loss=0.2354, simple_loss=0.3103, pruned_loss=0.08024, over 21803.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.294, pruned_loss=0.07787, over 4278746.22 frames. ], batch size: 112, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:28:16,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1688172.0, ans=0.125 2023-06-24 07:28:17,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1688172.0, ans=0.0 2023-06-24 07:28:22,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1688172.0, ans=0.0 2023-06-24 07:28:36,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1688232.0, ans=0.5 2023-06-24 07:29:32,980 INFO [train.py:996] (3/4) Epoch 10, batch 6950, loss[loss=0.2344, simple_loss=0.3095, pruned_loss=0.07968, over 21605.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2968, pruned_loss=0.07536, over 4281523.61 frames. ], batch size: 230, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:29:34,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1688412.0, ans=0.125 2023-06-24 07:29:58,118 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.22 vs. limit=15.0 2023-06-24 07:30:33,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1688532.0, ans=0.1 2023-06-24 07:30:34,214 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.410e+02 6.339e+02 9.938e+02 1.554e+03 2.681e+03, threshold=1.988e+03, percent-clipped=10.0 2023-06-24 07:31:06,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1688652.0, ans=0.1 2023-06-24 07:31:08,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1688652.0, ans=0.5 2023-06-24 07:31:12,642 INFO [train.py:996] (3/4) Epoch 10, batch 7000, loss[loss=0.1889, simple_loss=0.2658, pruned_loss=0.05595, over 21809.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3013, pruned_loss=0.07889, over 4285106.67 frames. ], batch size: 118, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:31:13,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1688712.0, ans=0.125 2023-06-24 07:32:27,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1688892.0, ans=0.1 2023-06-24 07:32:52,791 INFO [train.py:996] (3/4) Epoch 10, batch 7050, loss[loss=0.2282, simple_loss=0.3206, pruned_loss=0.0679, over 21646.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2977, pruned_loss=0.07707, over 4281694.61 frames. ], batch size: 441, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:33:05,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1689012.0, ans=0.2 2023-06-24 07:34:00,079 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.677e+02 8.213e+02 1.284e+03 1.969e+03 3.755e+03, threshold=2.569e+03, percent-clipped=21.0 2023-06-24 07:34:18,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1689252.0, ans=0.0 2023-06-24 07:34:41,804 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=12.0 2023-06-24 07:34:43,732 INFO [train.py:996] (3/4) Epoch 10, batch 7100, loss[loss=0.1916, simple_loss=0.259, pruned_loss=0.06213, over 16554.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.304, pruned_loss=0.07875, over 4272955.22 frames. ], batch size: 61, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:34:47,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1689312.0, ans=0.0 2023-06-24 07:34:52,893 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=22.5 2023-06-24 07:34:56,228 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.49 vs. limit=12.0 2023-06-24 07:35:15,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1689372.0, ans=0.1 2023-06-24 07:35:18,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1689372.0, ans=0.125 2023-06-24 07:35:44,314 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.24 vs. limit=12.0 2023-06-24 07:36:01,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1689552.0, ans=0.125 2023-06-24 07:36:24,351 INFO [train.py:996] (3/4) Epoch 10, batch 7150, loss[loss=0.2606, simple_loss=0.3354, pruned_loss=0.09287, over 21634.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3029, pruned_loss=0.07749, over 4274091.04 frames. ], batch size: 389, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:36:44,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1689672.0, ans=0.0 2023-06-24 07:36:54,306 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.23 vs. limit=12.0 2023-06-24 07:36:55,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1689672.0, ans=0.0 2023-06-24 07:37:20,920 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.107e+02 6.261e+02 9.228e+02 1.360e+03 3.235e+03, threshold=1.846e+03, percent-clipped=6.0 2023-06-24 07:37:22,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1689792.0, ans=0.125 2023-06-24 07:37:27,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1689792.0, ans=0.125 2023-06-24 07:38:04,263 INFO [train.py:996] (3/4) Epoch 10, batch 7200, loss[loss=0.222, simple_loss=0.2813, pruned_loss=0.08129, over 21364.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3068, pruned_loss=0.08069, over 4265907.49 frames. ], batch size: 473, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:38:14,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1689912.0, ans=0.125 2023-06-24 07:38:14,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1689912.0, ans=0.125 2023-06-24 07:39:00,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1690092.0, ans=0.2 2023-06-24 07:39:26,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1690152.0, ans=0.2 2023-06-24 07:39:33,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1690152.0, ans=0.125 2023-06-24 07:39:33,881 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2023-06-24 07:39:38,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1690152.0, ans=0.0 2023-06-24 07:39:44,053 INFO [train.py:996] (3/4) Epoch 10, batch 7250, loss[loss=0.2228, simple_loss=0.292, pruned_loss=0.07682, over 21743.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3019, pruned_loss=0.07983, over 4270500.70 frames. ], batch size: 112, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:39:44,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1690212.0, ans=0.1 2023-06-24 07:40:22,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1690272.0, ans=0.1 2023-06-24 07:40:45,627 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.252e+02 6.130e+02 8.584e+02 1.247e+03 2.821e+03, threshold=1.717e+03, percent-clipped=3.0 2023-06-24 07:41:22,770 INFO [train.py:996] (3/4) Epoch 10, batch 7300, loss[loss=0.2002, simple_loss=0.2638, pruned_loss=0.06832, over 21656.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2955, pruned_loss=0.07852, over 4269240.39 frames. ], batch size: 333, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:42:37,670 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.51 vs. limit=22.5 2023-06-24 07:42:44,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1690752.0, ans=0.125 2023-06-24 07:42:45,494 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=12.0 2023-06-24 07:43:08,164 INFO [train.py:996] (3/4) Epoch 10, batch 7350, loss[loss=0.2429, simple_loss=0.3078, pruned_loss=0.08897, over 21303.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2946, pruned_loss=0.07907, over 4272492.41 frames. ], batch size: 176, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:43:11,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1690812.0, ans=0.1 2023-06-24 07:43:13,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1690812.0, ans=0.125 2023-06-24 07:43:29,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1690872.0, ans=0.0 2023-06-24 07:43:36,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1690872.0, ans=0.125 2023-06-24 07:43:41,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1690872.0, ans=0.125 2023-06-24 07:43:41,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1690872.0, ans=0.0 2023-06-24 07:43:55,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1690932.0, ans=10.0 2023-06-24 07:44:00,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1690932.0, ans=0.125 2023-06-24 07:44:06,751 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.919e+02 7.422e+02 1.084e+03 1.496e+03 4.269e+03, threshold=2.168e+03, percent-clipped=20.0 2023-06-24 07:44:49,765 INFO [train.py:996] (3/4) Epoch 10, batch 7400, loss[loss=0.2226, simple_loss=0.3187, pruned_loss=0.06324, over 21624.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3006, pruned_loss=0.08114, over 4274702.08 frames. ], batch size: 389, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:45:14,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1691172.0, ans=0.0 2023-06-24 07:45:26,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1691172.0, ans=0.125 2023-06-24 07:45:34,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1691232.0, ans=0.125 2023-06-24 07:46:07,211 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-24 07:46:23,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1691352.0, ans=0.0 2023-06-24 07:46:29,126 INFO [train.py:996] (3/4) Epoch 10, batch 7450, loss[loss=0.2027, simple_loss=0.2706, pruned_loss=0.0674, over 21826.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2997, pruned_loss=0.07988, over 4275753.70 frames. ], batch size: 352, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:46:58,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1691472.0, ans=0.1 2023-06-24 07:47:32,869 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.371e+02 6.014e+02 8.284e+02 1.438e+03 2.557e+03, threshold=1.657e+03, percent-clipped=4.0 2023-06-24 07:47:38,793 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.66 vs. limit=15.0 2023-06-24 07:48:15,227 INFO [train.py:996] (3/4) Epoch 10, batch 7500, loss[loss=0.2385, simple_loss=0.3353, pruned_loss=0.07091, over 21444.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3053, pruned_loss=0.08054, over 4276788.93 frames. ], batch size: 211, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:48:26,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1691712.0, ans=0.125 2023-06-24 07:49:01,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1691832.0, ans=0.0 2023-06-24 07:49:17,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1691892.0, ans=0.2 2023-06-24 07:49:42,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1691952.0, ans=0.125 2023-06-24 07:49:56,497 INFO [train.py:996] (3/4) Epoch 10, batch 7550, loss[loss=0.1984, simple_loss=0.2562, pruned_loss=0.0703, over 20239.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3116, pruned_loss=0.07964, over 4273392.02 frames. ], batch size: 702, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:50:23,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1692072.0, ans=0.125 2023-06-24 07:50:26,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1692072.0, ans=0.5 2023-06-24 07:50:44,943 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:50:52,204 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.375e+02 7.111e+02 1.164e+03 1.789e+03 2.953e+03, threshold=2.328e+03, percent-clipped=32.0 2023-06-24 07:51:02,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1692192.0, ans=0.0 2023-06-24 07:51:34,167 INFO [train.py:996] (3/4) Epoch 10, batch 7600, loss[loss=0.2241, simple_loss=0.2964, pruned_loss=0.07592, over 21801.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.309, pruned_loss=0.07865, over 4272987.39 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 07:51:41,651 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.37 vs. limit=10.0 2023-06-24 07:51:49,078 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=15.0 2023-06-24 07:51:51,952 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=22.5 2023-06-24 07:51:55,374 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-24 07:51:59,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1692372.0, ans=0.2 2023-06-24 07:52:07,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1692372.0, ans=0.025 2023-06-24 07:52:32,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1692492.0, ans=0.125 2023-06-24 07:53:05,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1692552.0, ans=0.125 2023-06-24 07:53:09,040 INFO [train.py:996] (3/4) Epoch 10, batch 7650, loss[loss=0.2519, simple_loss=0.3176, pruned_loss=0.09315, over 21891.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3082, pruned_loss=0.07976, over 4278739.27 frames. ], batch size: 124, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 07:54:11,424 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.476e+02 5.649e+02 6.864e+02 9.584e+02 1.979e+03, threshold=1.373e+03, percent-clipped=0.0 2023-06-24 07:54:58,719 INFO [train.py:996] (3/4) Epoch 10, batch 7700, loss[loss=0.2257, simple_loss=0.2996, pruned_loss=0.07586, over 21613.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3099, pruned_loss=0.08189, over 4279458.97 frames. ], batch size: 263, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 07:55:32,591 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.88 vs. limit=8.0 2023-06-24 07:55:34,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1693032.0, ans=0.125 2023-06-24 07:55:38,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1693032.0, ans=0.2 2023-06-24 07:55:58,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1693092.0, ans=0.1 2023-06-24 07:56:00,821 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-06-24 07:56:38,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1693152.0, ans=0.2 2023-06-24 07:56:40,746 INFO [train.py:996] (3/4) Epoch 10, batch 7750, loss[loss=0.2319, simple_loss=0.3222, pruned_loss=0.0708, over 21260.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3152, pruned_loss=0.08188, over 4277393.12 frames. ], batch size: 176, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 07:56:45,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.56 vs. limit=22.5 2023-06-24 07:57:31,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.92 vs. limit=22.5 2023-06-24 07:57:42,754 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.643e+02 8.175e+02 1.275e+03 1.822e+03 5.282e+03, threshold=2.550e+03, percent-clipped=41.0 2023-06-24 07:58:21,722 INFO [train.py:996] (3/4) Epoch 10, batch 7800, loss[loss=0.2446, simple_loss=0.3257, pruned_loss=0.08177, over 21836.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3185, pruned_loss=0.08289, over 4281588.05 frames. ], batch size: 372, lr: 2.97e-03, grad_scale: 8.0 2023-06-24 07:58:50,186 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.51 vs. limit=15.0 2023-06-24 07:58:59,746 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=22.5 2023-06-24 07:59:31,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1693692.0, ans=0.0 2023-06-24 07:59:32,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1693692.0, ans=0.125 2023-06-24 08:00:00,781 INFO [train.py:996] (3/4) Epoch 10, batch 7850, loss[loss=0.2174, simple_loss=0.2774, pruned_loss=0.07868, over 21487.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3107, pruned_loss=0.08135, over 4276226.19 frames. ], batch size: 195, lr: 2.97e-03, grad_scale: 8.0 2023-06-24 08:00:02,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1693812.0, ans=0.035 2023-06-24 08:01:02,399 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 6.400e+02 8.758e+02 1.300e+03 4.376e+03, threshold=1.752e+03, percent-clipped=3.0 2023-06-24 08:01:02,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1693992.0, ans=0.0 2023-06-24 08:01:19,520 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.08 vs. limit=12.0 2023-06-24 08:01:26,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1694052.0, ans=0.0 2023-06-24 08:01:33,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1694052.0, ans=0.1 2023-06-24 08:01:33,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1694052.0, ans=0.125 2023-06-24 08:01:41,729 INFO [train.py:996] (3/4) Epoch 10, batch 7900, loss[loss=0.2344, simple_loss=0.3105, pruned_loss=0.07914, over 21573.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3044, pruned_loss=0.07993, over 4264563.02 frames. ], batch size: 230, lr: 2.97e-03, grad_scale: 8.0 2023-06-24 08:02:16,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1694172.0, ans=0.125 2023-06-24 08:02:40,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1694232.0, ans=0.04949747468305833 2023-06-24 08:03:29,365 INFO [train.py:996] (3/4) Epoch 10, batch 7950, loss[loss=0.2916, simple_loss=0.3604, pruned_loss=0.1115, over 21519.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3088, pruned_loss=0.07882, over 4262194.64 frames. ], batch size: 507, lr: 2.97e-03, grad_scale: 8.0 2023-06-24 08:04:36,159 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.376e+02 7.152e+02 9.880e+02 1.480e+03 4.841e+03, threshold=1.976e+03, percent-clipped=16.0 2023-06-24 08:05:16,259 INFO [train.py:996] (3/4) Epoch 10, batch 8000, loss[loss=0.2974, simple_loss=0.3838, pruned_loss=0.1054, over 21400.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3145, pruned_loss=0.08177, over 4264281.24 frames. ], batch size: 507, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:05:22,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1694712.0, ans=0.2 2023-06-24 08:05:47,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1694772.0, ans=0.125 2023-06-24 08:05:55,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1694772.0, ans=0.0 2023-06-24 08:06:02,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1694832.0, ans=0.1 2023-06-24 08:06:18,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1694832.0, ans=0.2 2023-06-24 08:06:23,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1694892.0, ans=0.125 2023-06-24 08:06:45,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1694952.0, ans=0.0 2023-06-24 08:07:06,344 INFO [train.py:996] (3/4) Epoch 10, batch 8050, loss[loss=0.2122, simple_loss=0.2654, pruned_loss=0.07949, over 21248.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3169, pruned_loss=0.08183, over 4263568.08 frames. ], batch size: 143, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:07:32,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1695072.0, ans=0.125 2023-06-24 08:07:45,883 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-24 08:08:07,471 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.312e+02 7.120e+02 8.807e+02 1.513e+03 2.630e+03, threshold=1.761e+03, percent-clipped=8.0 2023-06-24 08:08:46,515 INFO [train.py:996] (3/4) Epoch 10, batch 8100, loss[loss=0.1869, simple_loss=0.2499, pruned_loss=0.062, over 20009.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3154, pruned_loss=0.08243, over 4270705.23 frames. ], batch size: 703, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:08:49,733 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-24 08:09:18,572 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.56 vs. limit=10.0 2023-06-24 08:10:36,952 INFO [train.py:996] (3/4) Epoch 10, batch 8150, loss[loss=0.2087, simple_loss=0.264, pruned_loss=0.07669, over 20283.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3241, pruned_loss=0.08527, over 4265757.33 frames. ], batch size: 703, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:11:43,628 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.039e+02 7.483e+02 1.136e+03 1.755e+03 5.961e+03, threshold=2.271e+03, percent-clipped=24.0 2023-06-24 08:12:08,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1695852.0, ans=0.125 2023-06-24 08:12:08,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1695852.0, ans=0.0 2023-06-24 08:12:17,680 INFO [train.py:996] (3/4) Epoch 10, batch 8200, loss[loss=0.1865, simple_loss=0.2509, pruned_loss=0.06102, over 21609.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3138, pruned_loss=0.08224, over 4252744.56 frames. ], batch size: 231, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:12:46,928 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:12:48,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1695972.0, ans=0.0 2023-06-24 08:13:16,852 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=15.0 2023-06-24 08:13:54,885 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=15.0 2023-06-24 08:13:56,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-24 08:13:57,372 INFO [train.py:996] (3/4) Epoch 10, batch 8250, loss[loss=0.275, simple_loss=0.4037, pruned_loss=0.07311, over 20760.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3135, pruned_loss=0.08165, over 4258725.75 frames. ], batch size: 607, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:14:11,312 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=22.5 2023-06-24 08:14:23,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1696272.0, ans=0.2 2023-06-24 08:15:01,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1696392.0, ans=0.125 2023-06-24 08:15:04,592 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.899e+02 6.467e+02 8.535e+02 1.267e+03 3.280e+03, threshold=1.707e+03, percent-clipped=4.0 2023-06-24 08:15:38,161 INFO [train.py:996] (3/4) Epoch 10, batch 8300, loss[loss=0.2258, simple_loss=0.312, pruned_loss=0.06976, over 21712.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3135, pruned_loss=0.0788, over 4262627.24 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:15:47,657 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-24 08:16:09,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1696572.0, ans=0.0 2023-06-24 08:16:27,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-06-24 08:16:38,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1696632.0, ans=0.0 2023-06-24 08:16:57,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1696692.0, ans=0.09899494936611666 2023-06-24 08:16:58,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1696692.0, ans=0.0 2023-06-24 08:17:18,877 INFO [train.py:996] (3/4) Epoch 10, batch 8350, loss[loss=0.1875, simple_loss=0.2698, pruned_loss=0.05263, over 21451.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3132, pruned_loss=0.07737, over 4264806.20 frames. ], batch size: 212, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:17:55,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1696872.0, ans=0.125 2023-06-24 08:18:09,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1696932.0, ans=0.2 2023-06-24 08:18:11,434 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-24 08:18:29,880 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.321e+02 6.071e+02 7.237e+02 1.103e+03 3.221e+03, threshold=1.447e+03, percent-clipped=5.0 2023-06-24 08:18:31,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1696992.0, ans=0.0 2023-06-24 08:18:36,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1696992.0, ans=0.125 2023-06-24 08:18:57,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1697052.0, ans=0.125 2023-06-24 08:18:59,489 INFO [train.py:996] (3/4) Epoch 10, batch 8400, loss[loss=0.2521, simple_loss=0.3402, pruned_loss=0.08202, over 21502.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3121, pruned_loss=0.07597, over 4265142.60 frames. ], batch size: 508, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 08:19:08,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1697112.0, ans=0.0 2023-06-24 08:19:37,814 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.40 vs. limit=22.5 2023-06-24 08:19:38,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1697172.0, ans=0.0 2023-06-24 08:19:46,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1697232.0, ans=0.0 2023-06-24 08:19:49,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1697232.0, ans=0.0 2023-06-24 08:20:01,877 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.82 vs. limit=22.5 2023-06-24 08:20:20,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1697292.0, ans=0.05 2023-06-24 08:20:33,298 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2023-06-24 08:20:36,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1697352.0, ans=0.1 2023-06-24 08:20:39,131 INFO [train.py:996] (3/4) Epoch 10, batch 8450, loss[loss=0.2352, simple_loss=0.301, pruned_loss=0.08466, over 21828.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3119, pruned_loss=0.07577, over 4276079.06 frames. ], batch size: 124, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:20:59,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1697412.0, ans=0.1 2023-06-24 08:21:51,745 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.782e+02 9.053e+02 1.298e+03 1.951e+03 3.847e+03, threshold=2.596e+03, percent-clipped=39.0 2023-06-24 08:22:23,927 INFO [train.py:996] (3/4) Epoch 10, batch 8500, loss[loss=0.2057, simple_loss=0.2678, pruned_loss=0.07185, over 21477.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3089, pruned_loss=0.07727, over 4275901.93 frames. ], batch size: 212, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:22:25,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1697712.0, ans=0.1 2023-06-24 08:22:43,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1697772.0, ans=0.04949747468305833 2023-06-24 08:22:53,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1697772.0, ans=0.0 2023-06-24 08:23:31,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=12.0 2023-06-24 08:24:03,931 INFO [train.py:996] (3/4) Epoch 10, batch 8550, loss[loss=0.2366, simple_loss=0.3224, pruned_loss=0.0754, over 21799.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3118, pruned_loss=0.07963, over 4271304.46 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:24:37,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1698072.0, ans=0.2 2023-06-24 08:24:43,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1698132.0, ans=0.125 2023-06-24 08:25:09,428 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.72 vs. limit=15.0 2023-06-24 08:25:09,874 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.596e+02 7.231e+02 1.146e+03 1.740e+03 4.216e+03, threshold=2.291e+03, percent-clipped=13.0 2023-06-24 08:25:19,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1698192.0, ans=0.2 2023-06-24 08:25:22,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1698252.0, ans=0.125 2023-06-24 08:25:25,720 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:25:25,760 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:25:48,536 INFO [train.py:996] (3/4) Epoch 10, batch 8600, loss[loss=0.2749, simple_loss=0.3504, pruned_loss=0.09971, over 21820.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3175, pruned_loss=0.08154, over 4275526.61 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:26:58,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1698492.0, ans=0.0 2023-06-24 08:27:16,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1698552.0, ans=0.1 2023-06-24 08:27:25,805 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=15.0 2023-06-24 08:27:28,159 INFO [train.py:996] (3/4) Epoch 10, batch 8650, loss[loss=0.2287, simple_loss=0.3295, pruned_loss=0.06394, over 21627.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3234, pruned_loss=0.08239, over 4275978.64 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:27:46,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1698612.0, ans=0.0 2023-06-24 08:28:14,311 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-06-24 08:28:22,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1698792.0, ans=0.0 2023-06-24 08:28:28,426 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.818e+02 6.800e+02 1.041e+03 1.467e+03 2.492e+03, threshold=2.082e+03, percent-clipped=1.0 2023-06-24 08:28:59,081 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=22.5 2023-06-24 08:28:59,762 INFO [train.py:996] (3/4) Epoch 10, batch 8700, loss[loss=0.2179, simple_loss=0.2796, pruned_loss=0.07808, over 21578.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3137, pruned_loss=0.07873, over 4273267.40 frames. ], batch size: 414, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:29:26,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1698972.0, ans=0.09899494936611666 2023-06-24 08:30:46,986 INFO [train.py:996] (3/4) Epoch 10, batch 8750, loss[loss=0.237, simple_loss=0.2991, pruned_loss=0.0875, over 21835.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3085, pruned_loss=0.0796, over 4281005.31 frames. ], batch size: 391, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:31:27,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1699332.0, ans=0.2 2023-06-24 08:31:50,103 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.900e+02 7.163e+02 1.016e+03 1.538e+03 3.044e+03, threshold=2.032e+03, percent-clipped=7.0 2023-06-24 08:32:05,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1699452.0, ans=0.125 2023-06-24 08:32:32,734 INFO [train.py:996] (3/4) Epoch 10, batch 8800, loss[loss=0.2688, simple_loss=0.3576, pruned_loss=0.09002, over 21610.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3186, pruned_loss=0.08322, over 4279007.31 frames. ], batch size: 389, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 08:32:33,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1699512.0, ans=0.125 2023-06-24 08:32:34,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1699512.0, ans=0.2 2023-06-24 08:34:18,596 INFO [train.py:996] (3/4) Epoch 10, batch 8850, loss[loss=0.2168, simple_loss=0.2917, pruned_loss=0.07089, over 21404.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3242, pruned_loss=0.08376, over 4279247.88 frames. ], batch size: 194, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:35:17,534 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.300e+02 6.165e+02 8.058e+02 1.044e+03 1.938e+03, threshold=1.612e+03, percent-clipped=0.0 2023-06-24 08:35:24,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1699992.0, ans=0.1 2023-06-24 08:35:26,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1699992.0, ans=0.125 2023-06-24 08:35:33,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1700052.0, ans=0.125 2023-06-24 08:35:59,438 INFO [train.py:996] (3/4) Epoch 10, batch 8900, loss[loss=0.2425, simple_loss=0.3046, pruned_loss=0.09023, over 22042.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.318, pruned_loss=0.08243, over 4279978.25 frames. ], batch size: 103, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:36:44,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1700232.0, ans=0.125 2023-06-24 08:37:43,121 INFO [train.py:996] (3/4) Epoch 10, batch 8950, loss[loss=0.2327, simple_loss=0.2967, pruned_loss=0.08432, over 21479.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3192, pruned_loss=0.08193, over 4271483.98 frames. ], batch size: 195, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:38:03,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1700472.0, ans=0.0 2023-06-24 08:38:48,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1700592.0, ans=0.125 2023-06-24 08:38:56,963 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.364e+02 1.062e+03 1.597e+03 2.316e+03 4.236e+03, threshold=3.193e+03, percent-clipped=50.0 2023-06-24 08:39:02,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1700592.0, ans=0.125 2023-06-24 08:39:22,932 INFO [train.py:996] (3/4) Epoch 10, batch 9000, loss[loss=0.2267, simple_loss=0.3019, pruned_loss=0.07578, over 21664.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3154, pruned_loss=0.08193, over 4275487.12 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:39:22,932 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 08:39:39,588 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2679, simple_loss=0.3599, pruned_loss=0.08793, over 1796401.00 frames. 2023-06-24 08:39:39,589 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-24 08:41:21,492 INFO [train.py:996] (3/4) Epoch 10, batch 9050, loss[loss=0.2881, simple_loss=0.3509, pruned_loss=0.1127, over 21403.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3093, pruned_loss=0.07813, over 4275699.64 frames. ], batch size: 509, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:41:23,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1701012.0, ans=0.035 2023-06-24 08:41:56,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1701072.0, ans=0.125 2023-06-24 08:42:31,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1701192.0, ans=0.125 2023-06-24 08:42:37,324 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.440e+02 8.915e+02 1.381e+03 2.151e+03 3.467e+03, threshold=2.763e+03, percent-clipped=3.0 2023-06-24 08:42:39,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1701192.0, ans=0.0 2023-06-24 08:42:52,990 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.32 vs. limit=8.0 2023-06-24 08:42:55,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1701252.0, ans=0.04949747468305833 2023-06-24 08:43:08,678 INFO [train.py:996] (3/4) Epoch 10, batch 9100, loss[loss=0.2746, simple_loss=0.3589, pruned_loss=0.09517, over 21660.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3127, pruned_loss=0.08047, over 4271607.01 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:43:12,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1701312.0, ans=0.125 2023-06-24 08:44:49,835 INFO [train.py:996] (3/4) Epoch 10, batch 9150, loss[loss=0.245, simple_loss=0.3411, pruned_loss=0.07444, over 21655.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3179, pruned_loss=0.07906, over 4269502.21 frames. ], batch size: 389, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:45:03,428 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-24 08:45:45,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1701732.0, ans=0.0 2023-06-24 08:45:59,080 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.914e+02 7.155e+02 1.028e+03 1.622e+03 3.048e+03, threshold=2.056e+03, percent-clipped=2.0 2023-06-24 08:46:01,969 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-24 08:46:21,738 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:46:40,280 INFO [train.py:996] (3/4) Epoch 10, batch 9200, loss[loss=0.264, simple_loss=0.3292, pruned_loss=0.09943, over 21818.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3203, pruned_loss=0.07839, over 4265985.56 frames. ], batch size: 118, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 08:47:01,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1701972.0, ans=0.1 2023-06-24 08:48:20,289 INFO [train.py:996] (3/4) Epoch 10, batch 9250, loss[loss=0.2083, simple_loss=0.2748, pruned_loss=0.07091, over 21982.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3226, pruned_loss=0.08154, over 4268979.20 frames. ], batch size: 103, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:48:34,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1702212.0, ans=0.0 2023-06-24 08:48:38,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1702212.0, ans=0.125 2023-06-24 08:48:43,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1702272.0, ans=0.07 2023-06-24 08:48:53,132 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:49:08,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1702332.0, ans=0.04949747468305833 2023-06-24 08:49:19,257 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.45 vs. limit=15.0 2023-06-24 08:49:20,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1702392.0, ans=0.125 2023-06-24 08:49:21,333 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.372e+02 7.455e+02 9.438e+02 1.547e+03 2.905e+03, threshold=1.888e+03, percent-clipped=9.0 2023-06-24 08:49:35,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1702452.0, ans=0.125 2023-06-24 08:50:06,165 INFO [train.py:996] (3/4) Epoch 10, batch 9300, loss[loss=0.1978, simple_loss=0.2743, pruned_loss=0.06065, over 21559.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3156, pruned_loss=0.08029, over 4270311.82 frames. ], batch size: 247, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:50:12,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1702512.0, ans=0.07 2023-06-24 08:51:03,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1702692.0, ans=0.125 2023-06-24 08:51:43,756 INFO [train.py:996] (3/4) Epoch 10, batch 9350, loss[loss=0.2695, simple_loss=0.3523, pruned_loss=0.09337, over 21434.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3225, pruned_loss=0.08161, over 4269665.90 frames. ], batch size: 131, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:51:50,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1702812.0, ans=0.125 2023-06-24 08:51:54,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1702812.0, ans=0.125 2023-06-24 08:51:57,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1702812.0, ans=0.0 2023-06-24 08:52:19,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1702932.0, ans=0.125 2023-06-24 08:52:22,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1702932.0, ans=0.125 2023-06-24 08:53:01,209 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.560e+02 6.681e+02 9.424e+02 1.664e+03 4.543e+03, threshold=1.885e+03, percent-clipped=14.0 2023-06-24 08:53:05,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1702992.0, ans=0.125 2023-06-24 08:53:17,816 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:53:25,340 INFO [train.py:996] (3/4) Epoch 10, batch 9400, loss[loss=0.1979, simple_loss=0.2669, pruned_loss=0.0644, over 21556.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3266, pruned_loss=0.08246, over 4269906.55 frames. ], batch size: 195, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:53:57,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1703232.0, ans=0.125 2023-06-24 08:54:26,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1703292.0, ans=0.125 2023-06-24 08:54:35,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1703292.0, ans=0.125 2023-06-24 08:54:37,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1703292.0, ans=0.04949747468305833 2023-06-24 08:55:04,510 INFO [train.py:996] (3/4) Epoch 10, batch 9450, loss[loss=0.2256, simple_loss=0.285, pruned_loss=0.08308, over 21451.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3179, pruned_loss=0.08142, over 4266813.31 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:55:19,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=15.0 2023-06-24 08:56:19,380 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.658e+02 7.388e+02 1.013e+03 1.627e+03 3.415e+03, threshold=2.026e+03, percent-clipped=14.0 2023-06-24 08:56:43,439 INFO [train.py:996] (3/4) Epoch 10, batch 9500, loss[loss=0.1851, simple_loss=0.274, pruned_loss=0.04807, over 21623.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3097, pruned_loss=0.07945, over 4273954.55 frames. ], batch size: 263, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 08:57:48,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1703892.0, ans=0.0 2023-06-24 08:57:50,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1703892.0, ans=0.0 2023-06-24 08:58:20,278 INFO [train.py:996] (3/4) Epoch 10, batch 9550, loss[loss=0.2421, simple_loss=0.3187, pruned_loss=0.08272, over 21948.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3132, pruned_loss=0.0811, over 4275491.10 frames. ], batch size: 372, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 08:59:29,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1704192.0, ans=0.1 2023-06-24 08:59:33,962 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.417e+02 6.895e+02 1.023e+03 1.419e+03 2.349e+03, threshold=2.046e+03, percent-clipped=3.0 2023-06-24 08:59:57,568 INFO [train.py:996] (3/4) Epoch 10, batch 9600, loss[loss=0.2085, simple_loss=0.2882, pruned_loss=0.06442, over 21854.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.317, pruned_loss=0.0831, over 4282180.15 frames. ], batch size: 332, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 09:00:29,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1704372.0, ans=0.1 2023-06-24 09:00:34,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1704432.0, ans=0.2 2023-06-24 09:01:07,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1704492.0, ans=0.0 2023-06-24 09:01:37,633 INFO [train.py:996] (3/4) Epoch 10, batch 9650, loss[loss=0.2441, simple_loss=0.3185, pruned_loss=0.08481, over 21630.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3173, pruned_loss=0.08376, over 4288017.62 frames. ], batch size: 263, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 09:02:07,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-24 09:02:15,617 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.05 vs. limit=15.0 2023-06-24 09:02:46,277 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=22.5 2023-06-24 09:02:55,063 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.754e+02 6.745e+02 1.012e+03 1.363e+03 3.649e+03, threshold=2.025e+03, percent-clipped=11.0 2023-06-24 09:03:15,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1704852.0, ans=22.5 2023-06-24 09:03:17,695 INFO [train.py:996] (3/4) Epoch 10, batch 9700, loss[loss=0.2063, simple_loss=0.2861, pruned_loss=0.06328, over 21602.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.322, pruned_loss=0.08454, over 4285662.22 frames. ], batch size: 230, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:03:38,026 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-24 09:03:57,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1704972.0, ans=0.1 2023-06-24 09:04:00,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1704972.0, ans=0.2 2023-06-24 09:04:06,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1705032.0, ans=0.0 2023-06-24 09:04:22,421 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-06-24 09:04:55,675 INFO [train.py:996] (3/4) Epoch 10, batch 9750, loss[loss=0.2571, simple_loss=0.339, pruned_loss=0.08762, over 21829.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3145, pruned_loss=0.08256, over 4281818.52 frames. ], batch size: 107, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:05:07,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1705212.0, ans=0.0 2023-06-24 09:05:07,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1705212.0, ans=0.2 2023-06-24 09:05:19,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1705272.0, ans=0.125 2023-06-24 09:05:55,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1705332.0, ans=0.0 2023-06-24 09:05:55,945 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=15.0 2023-06-24 09:06:10,517 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.138e+02 7.196e+02 1.029e+03 1.714e+03 4.123e+03, threshold=2.059e+03, percent-clipped=13.0 2023-06-24 09:06:26,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1705452.0, ans=0.1 2023-06-24 09:06:32,719 INFO [train.py:996] (3/4) Epoch 10, batch 9800, loss[loss=0.2186, simple_loss=0.2875, pruned_loss=0.07483, over 21796.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3147, pruned_loss=0.08331, over 4280610.62 frames. ], batch size: 247, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:06:42,846 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-24 09:06:56,177 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.43 vs. limit=10.0 2023-06-24 09:07:16,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1705632.0, ans=0.1 2023-06-24 09:07:22,848 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-24 09:07:55,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1705752.0, ans=0.125 2023-06-24 09:08:10,612 INFO [train.py:996] (3/4) Epoch 10, batch 9850, loss[loss=0.2238, simple_loss=0.2823, pruned_loss=0.08266, over 21396.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3114, pruned_loss=0.08262, over 4271604.77 frames. ], batch size: 194, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:08:16,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1705812.0, ans=0.125 2023-06-24 09:08:33,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1705872.0, ans=0.125 2023-06-24 09:09:11,922 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=22.5 2023-06-24 09:09:23,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1705992.0, ans=0.125 2023-06-24 09:09:26,564 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 7.595e+02 1.021e+03 1.343e+03 2.731e+03, threshold=2.043e+03, percent-clipped=9.0 2023-06-24 09:09:41,960 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.27 vs. limit=15.0 2023-06-24 09:09:47,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1706112.0, ans=0.125 2023-06-24 09:09:49,323 INFO [train.py:996] (3/4) Epoch 10, batch 9900, loss[loss=0.3056, simple_loss=0.3625, pruned_loss=0.1244, over 21319.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3064, pruned_loss=0.08156, over 4269998.51 frames. ], batch size: 507, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:09:51,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1706112.0, ans=0.1 2023-06-24 09:10:49,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1706232.0, ans=0.0 2023-06-24 09:10:58,295 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.10 vs. limit=15.0 2023-06-24 09:11:02,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1706292.0, ans=0.2 2023-06-24 09:11:12,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1706352.0, ans=0.05 2023-06-24 09:11:27,325 INFO [train.py:996] (3/4) Epoch 10, batch 9950, loss[loss=0.2016, simple_loss=0.2654, pruned_loss=0.06895, over 21498.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.307, pruned_loss=0.08188, over 4254207.78 frames. ], batch size: 263, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:11:45,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1706412.0, ans=0.1 2023-06-24 09:11:52,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1706472.0, ans=0.125 2023-06-24 09:12:18,208 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.63 vs. limit=10.0 2023-06-24 09:12:33,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1706592.0, ans=0.125 2023-06-24 09:12:33,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1706592.0, ans=0.09899494936611666 2023-06-24 09:12:44,772 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.712e+02 7.241e+02 1.069e+03 1.520e+03 2.876e+03, threshold=2.138e+03, percent-clipped=9.0 2023-06-24 09:12:46,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1706592.0, ans=0.125 2023-06-24 09:13:12,889 INFO [train.py:996] (3/4) Epoch 10, batch 10000, loss[loss=0.2513, simple_loss=0.3149, pruned_loss=0.09381, over 21299.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3015, pruned_loss=0.07994, over 4255456.69 frames. ], batch size: 176, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 09:13:25,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-06-24 09:13:49,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1706772.0, ans=0.125 2023-06-24 09:14:17,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1706892.0, ans=0.125 2023-06-24 09:14:35,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1706952.0, ans=0.035 2023-06-24 09:14:54,903 INFO [train.py:996] (3/4) Epoch 10, batch 10050, loss[loss=0.2355, simple_loss=0.3018, pruned_loss=0.08464, over 21601.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3051, pruned_loss=0.0815, over 4257183.87 frames. ], batch size: 415, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 09:15:27,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1707072.0, ans=0.2 2023-06-24 09:15:48,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1707132.0, ans=0.125 2023-06-24 09:15:52,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.66 vs. limit=15.0 2023-06-24 09:16:11,748 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.316e+02 6.781e+02 9.769e+02 1.554e+03 3.220e+03, threshold=1.954e+03, percent-clipped=12.0 2023-06-24 09:16:26,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1707252.0, ans=0.125 2023-06-24 09:16:30,114 INFO [train.py:996] (3/4) Epoch 10, batch 10100, loss[loss=0.203, simple_loss=0.275, pruned_loss=0.06553, over 21432.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3041, pruned_loss=0.07975, over 4258049.98 frames. ], batch size: 211, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:18:13,780 INFO [train.py:996] (3/4) Epoch 10, batch 10150, loss[loss=0.2482, simple_loss=0.3088, pruned_loss=0.09383, over 21813.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3105, pruned_loss=0.08247, over 4268431.73 frames. ], batch size: 118, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:18:51,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1707732.0, ans=0.125 2023-06-24 09:18:59,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1707732.0, ans=0.07 2023-06-24 09:19:25,307 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.627e+02 7.106e+02 9.650e+02 1.431e+03 2.478e+03, threshold=1.930e+03, percent-clipped=8.0 2023-06-24 09:19:53,995 INFO [train.py:996] (3/4) Epoch 10, batch 10200, loss[loss=0.1966, simple_loss=0.2909, pruned_loss=0.05109, over 21790.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3088, pruned_loss=0.08068, over 4269927.67 frames. ], batch size: 333, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:20:08,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1707912.0, ans=0.0 2023-06-24 09:20:18,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1707972.0, ans=0.125 2023-06-24 09:20:31,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1708032.0, ans=0.125 2023-06-24 09:20:46,511 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.00 vs. limit=10.0 2023-06-24 09:21:33,126 INFO [train.py:996] (3/4) Epoch 10, batch 10250, loss[loss=0.1645, simple_loss=0.2427, pruned_loss=0.04309, over 21305.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.306, pruned_loss=0.07583, over 4263800.89 frames. ], batch size: 159, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:21:33,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1708212.0, ans=0.0 2023-06-24 09:21:33,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1708212.0, ans=0.125 2023-06-24 09:21:50,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1708212.0, ans=0.0 2023-06-24 09:22:20,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1708332.0, ans=10.0 2023-06-24 09:22:34,859 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-06-24 09:22:39,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1708392.0, ans=0.2 2023-06-24 09:22:46,000 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.201e+02 5.691e+02 8.251e+02 1.364e+03 2.412e+03, threshold=1.650e+03, percent-clipped=9.0 2023-06-24 09:23:21,829 INFO [train.py:996] (3/4) Epoch 10, batch 10300, loss[loss=0.3135, simple_loss=0.3971, pruned_loss=0.1149, over 21378.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3073, pruned_loss=0.07701, over 4262822.97 frames. ], batch size: 507, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:23:27,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1708512.0, ans=0.125 2023-06-24 09:24:04,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1708632.0, ans=0.125 2023-06-24 09:24:51,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1708752.0, ans=0.1 2023-06-24 09:25:04,350 INFO [train.py:996] (3/4) Epoch 10, batch 10350, loss[loss=0.2181, simple_loss=0.2967, pruned_loss=0.06974, over 21683.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3091, pruned_loss=0.0768, over 4266407.61 frames. ], batch size: 351, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:26:01,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1708932.0, ans=10.0 2023-06-24 09:26:02,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1708992.0, ans=0.125 2023-06-24 09:26:21,111 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.286e+02 6.885e+02 1.073e+03 1.600e+03 3.112e+03, threshold=2.146e+03, percent-clipped=24.0 2023-06-24 09:26:31,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1709052.0, ans=0.125 2023-06-24 09:26:41,137 INFO [train.py:996] (3/4) Epoch 10, batch 10400, loss[loss=0.237, simple_loss=0.3194, pruned_loss=0.07732, over 21531.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3031, pruned_loss=0.07566, over 4270705.54 frames. ], batch size: 441, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:26:47,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1709112.0, ans=0.07 2023-06-24 09:27:05,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1709172.0, ans=0.2 2023-06-24 09:27:48,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1709292.0, ans=0.0 2023-06-24 09:28:04,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1709352.0, ans=0.0 2023-06-24 09:28:17,467 INFO [train.py:996] (3/4) Epoch 10, batch 10450, loss[loss=0.2638, simple_loss=0.3506, pruned_loss=0.08853, over 21632.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3075, pruned_loss=0.07829, over 4266398.86 frames. ], batch size: 414, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:29:22,941 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=22.5 2023-06-24 09:29:37,706 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.221e+02 7.690e+02 1.222e+03 1.916e+03 3.478e+03, threshold=2.445e+03, percent-clipped=16.0 2023-06-24 09:29:42,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1709652.0, ans=0.125 2023-06-24 09:29:51,708 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=22.5 2023-06-24 09:29:56,543 INFO [train.py:996] (3/4) Epoch 10, batch 10500, loss[loss=0.2099, simple_loss=0.2795, pruned_loss=0.07018, over 21745.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.308, pruned_loss=0.07719, over 4271318.30 frames. ], batch size: 316, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:30:18,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1709772.0, ans=0.0 2023-06-24 09:30:37,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1709772.0, ans=0.07 2023-06-24 09:30:47,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1709832.0, ans=0.125 2023-06-24 09:31:18,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1709952.0, ans=0.0 2023-06-24 09:31:26,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1709952.0, ans=0.1 2023-06-24 09:31:35,601 INFO [train.py:996] (3/4) Epoch 10, batch 10550, loss[loss=0.2144, simple_loss=0.2797, pruned_loss=0.07455, over 21882.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3014, pruned_loss=0.0762, over 4259391.35 frames. ], batch size: 98, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:31:55,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1710072.0, ans=0.1 2023-06-24 09:32:01,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1710072.0, ans=0.0 2023-06-24 09:32:37,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1710192.0, ans=0.125 2023-06-24 09:32:54,688 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.72 vs. limit=5.0 2023-06-24 09:32:55,106 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.399e+02 7.494e+02 1.006e+03 1.488e+03 3.263e+03, threshold=2.013e+03, percent-clipped=2.0 2023-06-24 09:33:15,633 INFO [train.py:996] (3/4) Epoch 10, batch 10600, loss[loss=0.1723, simple_loss=0.2429, pruned_loss=0.05079, over 21775.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2971, pruned_loss=0.07499, over 4261533.68 frames. ], batch size: 124, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:33:36,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1710372.0, ans=0.2 2023-06-24 09:34:22,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1710492.0, ans=0.125 2023-06-24 09:34:31,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1710492.0, ans=0.0 2023-06-24 09:34:40,945 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-24 09:34:59,837 INFO [train.py:996] (3/4) Epoch 10, batch 10650, loss[loss=0.2356, simple_loss=0.2991, pruned_loss=0.08608, over 20061.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2997, pruned_loss=0.07358, over 4259176.10 frames. ], batch size: 702, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:35:03,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1710612.0, ans=0.125 2023-06-24 09:35:27,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1710672.0, ans=0.2 2023-06-24 09:35:27,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1710672.0, ans=0.125 2023-06-24 09:35:43,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1710672.0, ans=0.0 2023-06-24 09:35:57,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1710732.0, ans=0.125 2023-06-24 09:36:16,430 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.325e+02 7.408e+02 1.241e+03 1.885e+03 3.956e+03, threshold=2.481e+03, percent-clipped=17.0 2023-06-24 09:36:45,443 INFO [train.py:996] (3/4) Epoch 10, batch 10700, loss[loss=0.2737, simple_loss=0.3489, pruned_loss=0.09926, over 21744.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.298, pruned_loss=0.07364, over 4248483.24 frames. ], batch size: 441, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:37:45,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1711092.0, ans=0.125 2023-06-24 09:38:07,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1711152.0, ans=0.0 2023-06-24 09:38:12,386 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:38:33,049 INFO [train.py:996] (3/4) Epoch 10, batch 10750, loss[loss=0.2358, simple_loss=0.3104, pruned_loss=0.08054, over 21319.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3079, pruned_loss=0.07776, over 4255679.24 frames. ], batch size: 176, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:38:44,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1711212.0, ans=0.0 2023-06-24 09:39:50,480 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.281e+02 7.385e+02 1.038e+03 1.565e+03 3.899e+03, threshold=2.076e+03, percent-clipped=10.0 2023-06-24 09:40:01,269 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-06-24 09:40:02,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1711452.0, ans=0.125 2023-06-24 09:40:02,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1711452.0, ans=0.1 2023-06-24 09:40:17,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1711452.0, ans=0.125 2023-06-24 09:40:20,301 INFO [train.py:996] (3/4) Epoch 10, batch 10800, loss[loss=0.2385, simple_loss=0.308, pruned_loss=0.08455, over 20028.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3112, pruned_loss=0.0775, over 4259161.90 frames. ], batch size: 702, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 09:40:41,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1711572.0, ans=0.1 2023-06-24 09:41:26,116 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=12.0 2023-06-24 09:41:55,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1711752.0, ans=0.1 2023-06-24 09:41:59,443 INFO [train.py:996] (3/4) Epoch 10, batch 10850, loss[loss=0.1932, simple_loss=0.2703, pruned_loss=0.05803, over 21640.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3117, pruned_loss=0.07818, over 4257680.28 frames. ], batch size: 282, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:42:06,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1711812.0, ans=0.0 2023-06-24 09:42:17,731 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-06-24 09:42:22,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1711872.0, ans=0.2 2023-06-24 09:42:59,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1711992.0, ans=0.0 2023-06-24 09:43:18,082 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.679e+02 6.562e+02 9.748e+02 1.395e+03 3.143e+03, threshold=1.950e+03, percent-clipped=4.0 2023-06-24 09:43:18,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1712052.0, ans=0.0 2023-06-24 09:43:31,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1712052.0, ans=0.125 2023-06-24 09:43:38,790 INFO [train.py:996] (3/4) Epoch 10, batch 10900, loss[loss=0.2329, simple_loss=0.3361, pruned_loss=0.0649, over 19788.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3063, pruned_loss=0.076, over 4262448.47 frames. ], batch size: 702, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:43:56,362 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.00 vs. limit=5.0 2023-06-24 09:44:42,365 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2023-06-24 09:44:56,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1712352.0, ans=0.05 2023-06-24 09:45:03,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1712352.0, ans=0.125 2023-06-24 09:45:18,157 INFO [train.py:996] (3/4) Epoch 10, batch 10950, loss[loss=0.195, simple_loss=0.2633, pruned_loss=0.0634, over 21827.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3015, pruned_loss=0.07382, over 4266241.71 frames. ], batch size: 107, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:45:53,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1712532.0, ans=0.125 2023-06-24 09:46:28,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1712592.0, ans=0.2 2023-06-24 09:46:35,542 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.061e+02 6.641e+02 1.100e+03 1.576e+03 3.666e+03, threshold=2.199e+03, percent-clipped=18.0 2023-06-24 09:46:36,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1712652.0, ans=0.07 2023-06-24 09:46:56,691 INFO [train.py:996] (3/4) Epoch 10, batch 11000, loss[loss=0.2084, simple_loss=0.2754, pruned_loss=0.07071, over 21607.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3017, pruned_loss=0.07435, over 4262326.14 frames. ], batch size: 230, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:47:06,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1712712.0, ans=0.1 2023-06-24 09:47:13,306 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=22.5 2023-06-24 09:47:17,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1712772.0, ans=0.0 2023-06-24 09:47:57,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1712892.0, ans=0.0 2023-06-24 09:48:16,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1712952.0, ans=0.0 2023-06-24 09:48:35,808 INFO [train.py:996] (3/4) Epoch 10, batch 11050, loss[loss=0.2076, simple_loss=0.2521, pruned_loss=0.08156, over 20094.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2987, pruned_loss=0.0759, over 4264637.47 frames. ], batch size: 704, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:49:52,347 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.455e+02 6.589e+02 8.856e+02 1.146e+03 2.849e+03, threshold=1.771e+03, percent-clipped=5.0 2023-06-24 09:49:54,904 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-06-24 09:50:13,201 INFO [train.py:996] (3/4) Epoch 10, batch 11100, loss[loss=0.2295, simple_loss=0.2893, pruned_loss=0.08485, over 21754.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2988, pruned_loss=0.07651, over 4253692.76 frames. ], batch size: 112, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:50:21,101 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=15.0 2023-06-24 09:50:23,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1713312.0, ans=0.125 2023-06-24 09:51:41,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1713552.0, ans=0.025 2023-06-24 09:51:54,340 INFO [train.py:996] (3/4) Epoch 10, batch 11150, loss[loss=0.2192, simple_loss=0.2997, pruned_loss=0.06934, over 21290.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2978, pruned_loss=0.07672, over 4250099.46 frames. ], batch size: 176, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:51:54,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1713612.0, ans=0.125 2023-06-24 09:52:11,373 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.68 vs. limit=10.0 2023-06-24 09:52:23,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1713672.0, ans=0.0 2023-06-24 09:52:56,094 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:52:56,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1713792.0, ans=0.125 2023-06-24 09:53:00,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1713792.0, ans=0.2 2023-06-24 09:53:07,923 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.64 vs. limit=22.5 2023-06-24 09:53:11,663 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.257e+02 6.370e+02 9.682e+02 1.600e+03 2.878e+03, threshold=1.936e+03, percent-clipped=17.0 2023-06-24 09:53:33,206 INFO [train.py:996] (3/4) Epoch 10, batch 11200, loss[loss=0.1996, simple_loss=0.2691, pruned_loss=0.06505, over 21608.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2952, pruned_loss=0.07636, over 4249699.07 frames. ], batch size: 332, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:54:26,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1714032.0, ans=0.125 2023-06-24 09:54:50,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1714152.0, ans=0.0 2023-06-24 09:54:56,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1714152.0, ans=0.0 2023-06-24 09:55:12,169 INFO [train.py:996] (3/4) Epoch 10, batch 11250, loss[loss=0.2334, simple_loss=0.3075, pruned_loss=0.07964, over 21196.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2945, pruned_loss=0.07631, over 4258249.59 frames. ], batch size: 159, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:55:16,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1714212.0, ans=0.1 2023-06-24 09:55:40,226 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-24 09:56:26,626 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.287e+02 6.030e+02 7.779e+02 1.066e+03 3.071e+03, threshold=1.556e+03, percent-clipped=6.0 2023-06-24 09:56:47,849 INFO [train.py:996] (3/4) Epoch 10, batch 11300, loss[loss=0.2021, simple_loss=0.2825, pruned_loss=0.06088, over 21275.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2963, pruned_loss=0.07673, over 4262518.48 frames. ], batch size: 159, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:57:04,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1714572.0, ans=0.125 2023-06-24 09:57:19,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1714632.0, ans=0.125 2023-06-24 09:58:15,865 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:58:28,624 INFO [train.py:996] (3/4) Epoch 10, batch 11350, loss[loss=0.2156, simple_loss=0.2945, pruned_loss=0.06838, over 21710.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2974, pruned_loss=0.07624, over 4257047.37 frames. ], batch size: 263, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:58:42,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1714812.0, ans=0.125 2023-06-24 09:59:16,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1714932.0, ans=0.1 2023-06-24 09:59:19,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1714932.0, ans=0.0 2023-06-24 09:59:21,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1714932.0, ans=0.1 2023-06-24 09:59:21,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1714932.0, ans=0.0 2023-06-24 09:59:23,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1714932.0, ans=0.2 2023-06-24 09:59:53,964 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.336e+02 5.946e+02 8.078e+02 1.222e+03 2.329e+03, threshold=1.616e+03, percent-clipped=17.0 2023-06-24 10:00:05,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1715052.0, ans=0.125 2023-06-24 10:00:10,824 INFO [train.py:996] (3/4) Epoch 10, batch 11400, loss[loss=0.232, simple_loss=0.3173, pruned_loss=0.07338, over 21837.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3034, pruned_loss=0.07934, over 4266152.15 frames. ], batch size: 317, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:01:05,553 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-24 10:01:09,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1715232.0, ans=0.025 2023-06-24 10:01:09,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1715232.0, ans=0.125 2023-06-24 10:01:23,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1715292.0, ans=0.1 2023-06-24 10:01:25,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1715292.0, ans=0.1 2023-06-24 10:01:47,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1715352.0, ans=0.125 2023-06-24 10:01:51,719 INFO [train.py:996] (3/4) Epoch 10, batch 11450, loss[loss=0.2836, simple_loss=0.3592, pruned_loss=0.104, over 21705.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3052, pruned_loss=0.07881, over 4258906.70 frames. ], batch size: 441, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:02:11,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1715472.0, ans=0.0 2023-06-24 10:02:59,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1715592.0, ans=0.125 2023-06-24 10:03:16,699 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.172e+02 6.928e+02 8.683e+02 1.191e+03 2.555e+03, threshold=1.737e+03, percent-clipped=7.0 2023-06-24 10:03:33,679 INFO [train.py:996] (3/4) Epoch 10, batch 11500, loss[loss=0.1943, simple_loss=0.2968, pruned_loss=0.04587, over 21799.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3072, pruned_loss=0.0793, over 4264413.11 frames. ], batch size: 282, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:04:11,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1715772.0, ans=0.125 2023-06-24 10:04:45,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1715892.0, ans=0.0 2023-06-24 10:05:05,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1715952.0, ans=0.0 2023-06-24 10:05:16,078 INFO [train.py:996] (3/4) Epoch 10, batch 11550, loss[loss=0.2443, simple_loss=0.3426, pruned_loss=0.07302, over 21838.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3123, pruned_loss=0.07888, over 4271836.07 frames. ], batch size: 282, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:05:34,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1716012.0, ans=0.0 2023-06-24 10:06:36,596 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.588e+02 7.008e+02 1.217e+03 2.057e+03 3.971e+03, threshold=2.435e+03, percent-clipped=35.0 2023-06-24 10:06:48,757 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.69 vs. limit=12.0 2023-06-24 10:07:02,453 INFO [train.py:996] (3/4) Epoch 10, batch 11600, loss[loss=0.2818, simple_loss=0.3657, pruned_loss=0.09893, over 21482.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3286, pruned_loss=0.08108, over 4272572.14 frames. ], batch size: 194, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:07:06,657 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-24 10:07:09,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1716312.0, ans=0.0 2023-06-24 10:07:26,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1716372.0, ans=0.125 2023-06-24 10:08:35,684 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-24 10:08:37,727 INFO [train.py:996] (3/4) Epoch 10, batch 11650, loss[loss=0.2291, simple_loss=0.2934, pruned_loss=0.08241, over 15327.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3347, pruned_loss=0.08161, over 4264597.35 frames. ], batch size: 60, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:08:41,868 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.64 vs. limit=15.0 2023-06-24 10:08:44,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1716612.0, ans=0.0 2023-06-24 10:09:10,475 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.56 vs. limit=10.0 2023-06-24 10:09:13,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1716732.0, ans=0.125 2023-06-24 10:09:26,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1716732.0, ans=0.0 2023-06-24 10:09:51,850 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.503e+02 6.606e+02 1.019e+03 1.524e+03 3.241e+03, threshold=2.038e+03, percent-clipped=8.0 2023-06-24 10:10:16,858 INFO [train.py:996] (3/4) Epoch 10, batch 11700, loss[loss=0.231, simple_loss=0.2993, pruned_loss=0.08131, over 21764.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3266, pruned_loss=0.08113, over 4252828.31 frames. ], batch size: 112, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:10:44,297 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:10:52,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1717032.0, ans=0.1 2023-06-24 10:10:55,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1717032.0, ans=0.0 2023-06-24 10:11:08,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1717092.0, ans=0.0 2023-06-24 10:11:09,815 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:11:26,992 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:11:48,447 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=15.0 2023-06-24 10:11:49,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1717152.0, ans=0.125 2023-06-24 10:11:50,860 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.508e-03 2023-06-24 10:11:55,171 INFO [train.py:996] (3/4) Epoch 10, batch 11750, loss[loss=0.237, simple_loss=0.3164, pruned_loss=0.07883, over 21843.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3173, pruned_loss=0.08061, over 4255442.61 frames. ], batch size: 124, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:12:34,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1717332.0, ans=0.1 2023-06-24 10:12:57,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1717392.0, ans=0.0 2023-06-24 10:12:58,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-24 10:13:15,594 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.135e+02 6.168e+02 8.880e+02 1.254e+03 3.045e+03, threshold=1.776e+03, percent-clipped=3.0 2023-06-24 10:13:28,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1717452.0, ans=10.0 2023-06-24 10:13:34,939 INFO [train.py:996] (3/4) Epoch 10, batch 11800, loss[loss=0.2485, simple_loss=0.318, pruned_loss=0.08947, over 21324.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3186, pruned_loss=0.08293, over 4265011.54 frames. ], batch size: 159, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:13:59,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1717572.0, ans=0.125 2023-06-24 10:14:22,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1717632.0, ans=0.035 2023-06-24 10:15:00,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1717752.0, ans=0.0 2023-06-24 10:15:19,728 INFO [train.py:996] (3/4) Epoch 10, batch 11850, loss[loss=0.2372, simple_loss=0.3234, pruned_loss=0.07548, over 21831.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3196, pruned_loss=0.08141, over 4270521.98 frames. ], batch size: 351, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:15:34,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1717872.0, ans=0.0 2023-06-24 10:16:41,798 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.280e+02 6.048e+02 1.109e+03 1.865e+03 3.176e+03, threshold=2.218e+03, percent-clipped=24.0 2023-06-24 10:17:01,674 INFO [train.py:996] (3/4) Epoch 10, batch 11900, loss[loss=0.197, simple_loss=0.2869, pruned_loss=0.05351, over 21680.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3216, pruned_loss=0.07847, over 4276258.63 frames. ], batch size: 247, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:17:40,697 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.47 vs. limit=15.0 2023-06-24 10:18:21,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1718352.0, ans=0.0 2023-06-24 10:18:42,196 INFO [train.py:996] (3/4) Epoch 10, batch 11950, loss[loss=0.1974, simple_loss=0.2937, pruned_loss=0.05054, over 21785.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3226, pruned_loss=0.07525, over 4267090.90 frames. ], batch size: 316, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:19:20,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1718532.0, ans=0.125 2023-06-24 10:19:48,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1718592.0, ans=0.125 2023-06-24 10:19:48,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1718592.0, ans=0.125 2023-06-24 10:19:48,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1718592.0, ans=0.0 2023-06-24 10:20:07,035 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.251e+02 6.812e+02 1.209e+03 1.807e+03 3.979e+03, threshold=2.418e+03, percent-clipped=18.0 2023-06-24 10:20:21,812 INFO [train.py:996] (3/4) Epoch 10, batch 12000, loss[loss=0.2618, simple_loss=0.3134, pruned_loss=0.1051, over 21574.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3163, pruned_loss=0.07442, over 4260838.40 frames. ], batch size: 414, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:20:21,812 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 10:20:37,784 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2579, simple_loss=0.3537, pruned_loss=0.08105, over 1796401.00 frames. 2023-06-24 10:20:37,785 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-24 10:20:44,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1718712.0, ans=0.2 2023-06-24 10:22:17,050 INFO [train.py:996] (3/4) Epoch 10, batch 12050, loss[loss=0.2507, simple_loss=0.3152, pruned_loss=0.09307, over 21501.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3118, pruned_loss=0.07623, over 4256370.05 frames. ], batch size: 131, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:22:46,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1719072.0, ans=0.125 2023-06-24 10:22:50,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1719072.0, ans=0.2 2023-06-24 10:23:44,276 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.653e+02 7.636e+02 1.060e+03 1.403e+03 2.653e+03, threshold=2.120e+03, percent-clipped=2.0 2023-06-24 10:24:03,742 INFO [train.py:996] (3/4) Epoch 10, batch 12100, loss[loss=0.2061, simple_loss=0.2851, pruned_loss=0.06353, over 19991.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3163, pruned_loss=0.07992, over 4263688.85 frames. ], batch size: 702, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:24:05,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1719312.0, ans=0.125 2023-06-24 10:24:22,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1719312.0, ans=0.0 2023-06-24 10:25:37,257 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-24 10:25:53,260 INFO [train.py:996] (3/4) Epoch 10, batch 12150, loss[loss=0.1992, simple_loss=0.3201, pruned_loss=0.03914, over 19788.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3175, pruned_loss=0.07873, over 4262939.74 frames. ], batch size: 703, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:26:05,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1719612.0, ans=0.1 2023-06-24 10:26:31,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1719672.0, ans=0.125 2023-06-24 10:26:35,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1719732.0, ans=0.125 2023-06-24 10:26:37,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1719732.0, ans=0.1 2023-06-24 10:26:38,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1719732.0, ans=0.09899494936611666 2023-06-24 10:26:39,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1719732.0, ans=0.0 2023-06-24 10:26:50,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1719792.0, ans=0.0 2023-06-24 10:27:22,051 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.407e+02 7.316e+02 1.017e+03 1.333e+03 3.987e+03, threshold=2.033e+03, percent-clipped=9.0 2023-06-24 10:27:32,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1719912.0, ans=0.0 2023-06-24 10:27:34,125 INFO [train.py:996] (3/4) Epoch 10, batch 12200, loss[loss=0.2057, simple_loss=0.2637, pruned_loss=0.07387, over 21201.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3148, pruned_loss=0.07776, over 4258839.98 frames. ], batch size: 548, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:28:27,766 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-24 10:28:48,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1720092.0, ans=0.0 2023-06-24 10:28:48,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1720092.0, ans=0.0 2023-06-24 10:29:13,849 INFO [train.py:996] (3/4) Epoch 10, batch 12250, loss[loss=0.1721, simple_loss=0.2511, pruned_loss=0.04655, over 21629.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3052, pruned_loss=0.07465, over 4261346.91 frames. ], batch size: 230, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:29:30,760 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=15.0 2023-06-24 10:29:31,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1720212.0, ans=0.1 2023-06-24 10:29:46,793 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.87 vs. limit=15.0 2023-06-24 10:29:56,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1720332.0, ans=0.1 2023-06-24 10:30:31,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1720452.0, ans=0.1 2023-06-24 10:30:33,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1720452.0, ans=0.125 2023-06-24 10:30:36,533 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-06-24 10:30:37,181 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.979e+02 6.057e+02 9.242e+02 1.403e+03 3.346e+03, threshold=1.848e+03, percent-clipped=10.0 2023-06-24 10:30:40,986 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:30:48,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1720452.0, ans=0.125 2023-06-24 10:30:52,740 INFO [train.py:996] (3/4) Epoch 10, batch 12300, loss[loss=0.1727, simple_loss=0.2473, pruned_loss=0.04903, over 21170.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2983, pruned_loss=0.07029, over 4264745.08 frames. ], batch size: 143, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:31:08,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1720512.0, ans=0.125 2023-06-24 10:31:31,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=15.0 2023-06-24 10:32:18,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1720752.0, ans=0.125 2023-06-24 10:32:19,504 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-24 10:32:28,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1720752.0, ans=0.2 2023-06-24 10:32:30,577 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-24 10:32:38,234 INFO [train.py:996] (3/4) Epoch 10, batch 12350, loss[loss=0.2485, simple_loss=0.3371, pruned_loss=0.07998, over 21856.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3024, pruned_loss=0.07116, over 4272032.68 frames. ], batch size: 316, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:32:54,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1720872.0, ans=0.125 2023-06-24 10:33:17,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1720932.0, ans=10.0 2023-06-24 10:33:56,482 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.324e+02 6.849e+02 9.475e+02 1.484e+03 4.503e+03, threshold=1.895e+03, percent-clipped=12.0 2023-06-24 10:34:17,385 INFO [train.py:996] (3/4) Epoch 10, batch 12400, loss[loss=0.2247, simple_loss=0.347, pruned_loss=0.05122, over 20786.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3049, pruned_loss=0.07348, over 4273122.08 frames. ], batch size: 608, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:34:49,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1721172.0, ans=0.125 2023-06-24 10:35:08,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1721292.0, ans=0.125 2023-06-24 10:35:32,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1721292.0, ans=0.0 2023-06-24 10:35:56,473 INFO [train.py:996] (3/4) Epoch 10, batch 12450, loss[loss=0.2542, simple_loss=0.3261, pruned_loss=0.09117, over 21347.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3083, pruned_loss=0.07665, over 4272125.22 frames. ], batch size: 176, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:36:30,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1721472.0, ans=0.1 2023-06-24 10:36:46,125 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2023-06-24 10:36:47,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1721532.0, ans=0.125 2023-06-24 10:36:49,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1721532.0, ans=0.125 2023-06-24 10:37:12,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1721592.0, ans=0.0 2023-06-24 10:37:26,122 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.427e+02 8.062e+02 1.038e+03 1.545e+03 2.621e+03, threshold=2.076e+03, percent-clipped=10.0 2023-06-24 10:37:41,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1721712.0, ans=0.125 2023-06-24 10:37:42,438 INFO [train.py:996] (3/4) Epoch 10, batch 12500, loss[loss=0.2426, simple_loss=0.3426, pruned_loss=0.07129, over 21604.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3182, pruned_loss=0.0801, over 4276998.45 frames. ], batch size: 230, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:37:59,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1721772.0, ans=0.0 2023-06-24 10:38:00,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1721772.0, ans=0.025 2023-06-24 10:38:11,467 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.63 vs. limit=10.0 2023-06-24 10:39:12,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1721952.0, ans=0.1 2023-06-24 10:39:12,937 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.41 vs. limit=15.0 2023-06-24 10:39:23,789 INFO [train.py:996] (3/4) Epoch 10, batch 12550, loss[loss=0.1864, simple_loss=0.2553, pruned_loss=0.05872, over 19904.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3244, pruned_loss=0.08279, over 4276656.32 frames. ], batch size: 702, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:40:43,795 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-24 10:40:48,771 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.945e+02 6.642e+02 8.862e+02 1.444e+03 2.963e+03, threshold=1.772e+03, percent-clipped=6.0 2023-06-24 10:40:58,108 INFO [train.py:996] (3/4) Epoch 10, batch 12600, loss[loss=0.25, simple_loss=0.3455, pruned_loss=0.07719, over 21601.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.323, pruned_loss=0.081, over 4278630.39 frames. ], batch size: 441, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:41:01,194 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.19 vs. limit=10.0 2023-06-24 10:42:22,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1722552.0, ans=0.2 2023-06-24 10:42:36,259 INFO [train.py:996] (3/4) Epoch 10, batch 12650, loss[loss=0.233, simple_loss=0.306, pruned_loss=0.08006, over 21911.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.316, pruned_loss=0.0769, over 4286129.76 frames. ], batch size: 316, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:43:29,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1722732.0, ans=0.2 2023-06-24 10:43:47,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1722792.0, ans=0.5 2023-06-24 10:43:49,860 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.77 vs. limit=10.0 2023-06-24 10:43:50,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1722792.0, ans=0.1 2023-06-24 10:44:06,394 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.090e+02 7.127e+02 1.026e+03 1.420e+03 2.601e+03, threshold=2.052e+03, percent-clipped=16.0 2023-06-24 10:44:16,309 INFO [train.py:996] (3/4) Epoch 10, batch 12700, loss[loss=0.2732, simple_loss=0.3356, pruned_loss=0.1054, over 21460.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3162, pruned_loss=0.08029, over 4288576.94 frames. ], batch size: 471, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:44:22,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1722912.0, ans=0.2 2023-06-24 10:44:53,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1722972.0, ans=0.125 2023-06-24 10:44:58,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1723032.0, ans=0.1 2023-06-24 10:45:12,009 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-06-24 10:45:16,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1723032.0, ans=0.2 2023-06-24 10:45:54,644 INFO [train.py:996] (3/4) Epoch 10, batch 12750, loss[loss=0.2204, simple_loss=0.2765, pruned_loss=0.08211, over 20144.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3169, pruned_loss=0.08039, over 4284965.95 frames. ], batch size: 703, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:47:04,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1723392.0, ans=0.95 2023-06-24 10:47:06,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1723392.0, ans=0.2 2023-06-24 10:47:18,530 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.582e+02 6.526e+02 8.799e+02 1.342e+03 3.585e+03, threshold=1.760e+03, percent-clipped=6.0 2023-06-24 10:47:33,305 INFO [train.py:996] (3/4) Epoch 10, batch 12800, loss[loss=0.3067, simple_loss=0.3594, pruned_loss=0.127, over 21587.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3166, pruned_loss=0.08082, over 4286338.43 frames. ], batch size: 508, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:48:11,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1723572.0, ans=0.125 2023-06-24 10:48:20,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1723632.0, ans=0.0 2023-06-24 10:49:18,842 INFO [train.py:996] (3/4) Epoch 10, batch 12850, loss[loss=0.2095, simple_loss=0.31, pruned_loss=0.05445, over 21756.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3189, pruned_loss=0.08201, over 4282497.01 frames. ], batch size: 351, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:49:19,884 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-24 10:49:21,655 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=22.5 2023-06-24 10:49:37,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1723812.0, ans=0.125 2023-06-24 10:50:01,581 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=15.0 2023-06-24 10:50:27,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1723992.0, ans=0.125 2023-06-24 10:50:38,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1724052.0, ans=0.125 2023-06-24 10:50:48,488 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.383e+02 5.816e+02 7.846e+02 1.216e+03 2.443e+03, threshold=1.569e+03, percent-clipped=11.0 2023-06-24 10:51:02,571 INFO [train.py:996] (3/4) Epoch 10, batch 12900, loss[loss=0.2032, simple_loss=0.2955, pruned_loss=0.05548, over 21725.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.317, pruned_loss=0.07895, over 4280279.02 frames. ], batch size: 298, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:51:16,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=1724112.0, ans=0.1 2023-06-24 10:51:31,658 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=15.0 2023-06-24 10:52:43,480 INFO [train.py:996] (3/4) Epoch 10, batch 12950, loss[loss=0.2626, simple_loss=0.3362, pruned_loss=0.09447, over 21578.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3146, pruned_loss=0.07715, over 4279742.52 frames. ], batch size: 389, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:52:47,770 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-06-24 10:52:59,808 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=15.0 2023-06-24 10:53:04,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1724472.0, ans=0.1 2023-06-24 10:53:11,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1724472.0, ans=0.125 2023-06-24 10:54:15,813 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.837e+02 8.466e+02 1.346e+03 1.826e+03 3.659e+03, threshold=2.691e+03, percent-clipped=37.0 2023-06-24 10:54:17,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1724652.0, ans=0.2 2023-06-24 10:54:23,653 INFO [train.py:996] (3/4) Epoch 10, batch 13000, loss[loss=0.1572, simple_loss=0.2242, pruned_loss=0.04513, over 21842.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3131, pruned_loss=0.07685, over 4285263.17 frames. ], batch size: 102, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:54:57,246 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2023-06-24 10:55:19,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=15.0 2023-06-24 10:56:01,962 INFO [train.py:996] (3/4) Epoch 10, batch 13050, loss[loss=0.2155, simple_loss=0.2819, pruned_loss=0.07454, over 21589.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3095, pruned_loss=0.07553, over 4281557.16 frames. ], batch size: 548, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:56:10,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1725012.0, ans=0.0 2023-06-24 10:56:23,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1725072.0, ans=0.1 2023-06-24 10:56:33,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1725072.0, ans=0.125 2023-06-24 10:57:02,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1725132.0, ans=0.125 2023-06-24 10:57:20,066 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-24 10:57:20,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1725192.0, ans=0.0 2023-06-24 10:57:29,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1725252.0, ans=0.125 2023-06-24 10:57:29,788 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=22.5 2023-06-24 10:57:33,639 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.137e+02 5.824e+02 8.081e+02 1.133e+03 2.445e+03, threshold=1.616e+03, percent-clipped=0.0 2023-06-24 10:57:34,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1725252.0, ans=0.0 2023-06-24 10:57:41,719 INFO [train.py:996] (3/4) Epoch 10, batch 13100, loss[loss=0.2182, simple_loss=0.293, pruned_loss=0.07173, over 21452.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3123, pruned_loss=0.07601, over 4284071.28 frames. ], batch size: 211, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:57:43,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1725312.0, ans=0.0 2023-06-24 10:57:45,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1725312.0, ans=0.125 2023-06-24 10:58:16,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1725372.0, ans=0.125 2023-06-24 10:58:45,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1725432.0, ans=0.05 2023-06-24 10:58:56,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1725492.0, ans=0.0 2023-06-24 10:59:08,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1725552.0, ans=10.0 2023-06-24 10:59:12,080 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=10.0 2023-06-24 10:59:27,988 INFO [train.py:996] (3/4) Epoch 10, batch 13150, loss[loss=0.2381, simple_loss=0.3081, pruned_loss=0.08401, over 21431.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3164, pruned_loss=0.0787, over 4283405.13 frames. ], batch size: 131, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:59:35,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1725612.0, ans=0.0 2023-06-24 10:59:59,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1725672.0, ans=0.125 2023-06-24 10:59:59,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1725672.0, ans=0.1 2023-06-24 11:00:25,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1725732.0, ans=0.125 2023-06-24 11:00:31,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1725792.0, ans=0.0 2023-06-24 11:00:53,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1725852.0, ans=0.125 2023-06-24 11:00:59,518 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.653e+02 8.375e+02 1.328e+03 1.823e+03 3.736e+03, threshold=2.655e+03, percent-clipped=31.0 2023-06-24 11:01:01,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1725852.0, ans=0.0 2023-06-24 11:01:07,589 INFO [train.py:996] (3/4) Epoch 10, batch 13200, loss[loss=0.2509, simple_loss=0.3158, pruned_loss=0.093, over 21503.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3161, pruned_loss=0.07832, over 4281464.76 frames. ], batch size: 194, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 11:01:11,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1725912.0, ans=0.125 2023-06-24 11:01:32,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1725972.0, ans=0.0 2023-06-24 11:01:48,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1725972.0, ans=0.0 2023-06-24 11:02:38,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1726152.0, ans=0.0 2023-06-24 11:02:41,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1726152.0, ans=0.125 2023-06-24 11:02:52,527 INFO [train.py:996] (3/4) Epoch 10, batch 13250, loss[loss=0.2225, simple_loss=0.2821, pruned_loss=0.08145, over 21788.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3151, pruned_loss=0.08038, over 4279186.43 frames. ], batch size: 102, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 11:03:15,543 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-24 11:03:23,693 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=12.0 2023-06-24 11:03:31,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1726272.0, ans=0.2 2023-06-24 11:03:47,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1726332.0, ans=0.05 2023-06-24 11:04:24,290 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.098e+02 9.382e+02 1.293e+03 1.907e+03 4.949e+03, threshold=2.585e+03, percent-clipped=10.0 2023-06-24 11:04:32,041 INFO [train.py:996] (3/4) Epoch 10, batch 13300, loss[loss=0.2889, simple_loss=0.3667, pruned_loss=0.1056, over 21428.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3177, pruned_loss=0.08119, over 4284073.50 frames. ], batch size: 471, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 11:05:01,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1726572.0, ans=0.1 2023-06-24 11:05:08,189 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=22.5 2023-06-24 11:05:33,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1726692.0, ans=0.125 2023-06-24 11:05:52,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1726752.0, ans=0.0 2023-06-24 11:06:00,324 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:06:14,230 INFO [train.py:996] (3/4) Epoch 10, batch 13350, loss[loss=0.2262, simple_loss=0.3053, pruned_loss=0.07352, over 21845.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3218, pruned_loss=0.08415, over 4285187.37 frames. ], batch size: 282, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:06:32,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1726872.0, ans=10.0 2023-06-24 11:06:41,337 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=15.0 2023-06-24 11:07:39,411 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.530e+02 6.336e+02 8.525e+02 1.254e+03 2.418e+03, threshold=1.705e+03, percent-clipped=0.0 2023-06-24 11:07:50,818 INFO [train.py:996] (3/4) Epoch 10, batch 13400, loss[loss=0.2761, simple_loss=0.3379, pruned_loss=0.1071, over 21769.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.324, pruned_loss=0.08573, over 4287776.03 frames. ], batch size: 351, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:07:51,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1727112.0, ans=0.125 2023-06-24 11:07:57,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1727112.0, ans=0.0 2023-06-24 11:07:59,310 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=15.0 2023-06-24 11:08:19,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1727172.0, ans=0.0 2023-06-24 11:08:44,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1727232.0, ans=0.125 2023-06-24 11:09:27,799 INFO [train.py:996] (3/4) Epoch 10, batch 13450, loss[loss=0.2364, simple_loss=0.2977, pruned_loss=0.08757, over 21787.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3248, pruned_loss=0.08646, over 4284396.95 frames. ], batch size: 124, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:09:33,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1727412.0, ans=0.125 2023-06-24 11:10:32,356 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-24 11:10:34,951 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-24 11:10:54,411 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-24 11:10:59,926 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.972e+02 7.908e+02 1.192e+03 1.823e+03 3.915e+03, threshold=2.384e+03, percent-clipped=24.0 2023-06-24 11:11:03,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1727652.0, ans=0.0 2023-06-24 11:11:06,201 INFO [train.py:996] (3/4) Epoch 10, batch 13500, loss[loss=0.2373, simple_loss=0.2998, pruned_loss=0.08738, over 21399.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3151, pruned_loss=0.08332, over 4271569.47 frames. ], batch size: 131, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:11:41,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1727772.0, ans=0.0 2023-06-24 11:11:52,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1727832.0, ans=0.5 2023-06-24 11:12:50,302 INFO [train.py:996] (3/4) Epoch 10, batch 13550, loss[loss=0.2618, simple_loss=0.3495, pruned_loss=0.08707, over 21413.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3175, pruned_loss=0.08211, over 4275242.84 frames. ], batch size: 194, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:12:56,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1728012.0, ans=0.125 2023-06-24 11:13:17,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-24 11:13:58,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1728192.0, ans=10.0 2023-06-24 11:14:14,891 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.411e+02 7.999e+02 1.250e+03 1.835e+03 3.854e+03, threshold=2.499e+03, percent-clipped=11.0 2023-06-24 11:14:21,337 INFO [train.py:996] (3/4) Epoch 10, batch 13600, loss[loss=0.2404, simple_loss=0.3054, pruned_loss=0.08774, over 21130.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3187, pruned_loss=0.08316, over 4280201.08 frames. ], batch size: 608, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:14:58,003 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.84 vs. limit=12.0 2023-06-24 11:15:35,246 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.42 vs. limit=15.0 2023-06-24 11:16:02,444 INFO [train.py:996] (3/4) Epoch 10, batch 13650, loss[loss=0.2123, simple_loss=0.275, pruned_loss=0.07476, over 21286.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.314, pruned_loss=0.08011, over 4276633.09 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:16:25,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1728672.0, ans=0.125 2023-06-24 11:17:29,427 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.339e+02 6.847e+02 1.012e+03 1.771e+03 3.769e+03, threshold=2.024e+03, percent-clipped=10.0 2023-06-24 11:17:29,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1728852.0, ans=0.0 2023-06-24 11:17:38,778 INFO [train.py:996] (3/4) Epoch 10, batch 13700, loss[loss=0.1839, simple_loss=0.2433, pruned_loss=0.06225, over 21819.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3105, pruned_loss=0.07897, over 4276258.48 frames. ], batch size: 124, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:19:16,499 INFO [train.py:996] (3/4) Epoch 10, batch 13750, loss[loss=0.2456, simple_loss=0.3287, pruned_loss=0.08126, over 21626.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3119, pruned_loss=0.07933, over 4277347.22 frames. ], batch size: 389, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:19:26,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1729212.0, ans=0.0 2023-06-24 11:19:29,655 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:20:07,478 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-24 11:20:36,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1729392.0, ans=0.04949747468305833 2023-06-24 11:20:56,600 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.866e+02 6.962e+02 1.196e+03 1.876e+03 4.514e+03, threshold=2.392e+03, percent-clipped=21.0 2023-06-24 11:21:05,656 INFO [train.py:996] (3/4) Epoch 10, batch 13800, loss[loss=0.2653, simple_loss=0.3686, pruned_loss=0.08101, over 21847.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3152, pruned_loss=0.07747, over 4264461.81 frames. ], batch size: 371, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:21:08,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-24 11:21:36,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1729572.0, ans=0.125 2023-06-24 11:22:49,292 INFO [train.py:996] (3/4) Epoch 10, batch 13850, loss[loss=0.2543, simple_loss=0.3272, pruned_loss=0.09072, over 21382.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.318, pruned_loss=0.07768, over 4264491.21 frames. ], batch size: 131, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:22:52,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1729812.0, ans=0.2 2023-06-24 11:23:11,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1729872.0, ans=0.2 2023-06-24 11:23:19,036 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:24:21,350 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.592e+02 7.694e+02 1.020e+03 1.467e+03 3.637e+03, threshold=2.040e+03, percent-clipped=4.0 2023-06-24 11:24:25,356 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=15.0 2023-06-24 11:24:25,888 INFO [train.py:996] (3/4) Epoch 10, batch 13900, loss[loss=0.229, simple_loss=0.3024, pruned_loss=0.07784, over 21824.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3218, pruned_loss=0.08135, over 4272413.04 frames. ], batch size: 282, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:25:29,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1730292.0, ans=0.95 2023-06-24 11:25:42,378 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-24 11:26:02,519 INFO [train.py:996] (3/4) Epoch 10, batch 13950, loss[loss=0.2466, simple_loss=0.3162, pruned_loss=0.08854, over 21636.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.322, pruned_loss=0.08361, over 4280393.35 frames. ], batch size: 263, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:26:22,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1730472.0, ans=0.125 2023-06-24 11:26:53,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1730532.0, ans=0.1 2023-06-24 11:27:33,031 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.002e+02 6.836e+02 9.588e+02 1.525e+03 4.378e+03, threshold=1.918e+03, percent-clipped=10.0 2023-06-24 11:27:37,556 INFO [train.py:996] (3/4) Epoch 10, batch 14000, loss[loss=0.2123, simple_loss=0.2924, pruned_loss=0.06609, over 21436.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3185, pruned_loss=0.08105, over 4275513.21 frames. ], batch size: 131, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:27:42,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1730712.0, ans=0.125 2023-06-24 11:27:48,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1730712.0, ans=0.1 2023-06-24 11:29:10,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1730952.0, ans=0.125 2023-06-24 11:29:12,982 INFO [train.py:996] (3/4) Epoch 10, batch 14050, loss[loss=0.2033, simple_loss=0.3027, pruned_loss=0.05196, over 21680.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3138, pruned_loss=0.07707, over 4261726.02 frames. ], batch size: 414, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:29:19,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1731012.0, ans=0.125 2023-06-24 11:29:57,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1731132.0, ans=0.125 2023-06-24 11:30:45,124 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.778e+02 8.037e+02 1.202e+03 1.798e+03 5.374e+03, threshold=2.404e+03, percent-clipped=21.0 2023-06-24 11:30:48,123 INFO [train.py:996] (3/4) Epoch 10, batch 14100, loss[loss=0.2429, simple_loss=0.3179, pruned_loss=0.08392, over 21717.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3078, pruned_loss=0.07714, over 4262737.88 frames. ], batch size: 351, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:30:50,654 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-24 11:30:54,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1731312.0, ans=0.0 2023-06-24 11:31:11,311 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.66 vs. limit=15.0 2023-06-24 11:31:42,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1731492.0, ans=0.2 2023-06-24 11:31:58,517 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-24 11:32:23,919 INFO [train.py:996] (3/4) Epoch 10, batch 14150, loss[loss=0.2466, simple_loss=0.3245, pruned_loss=0.08436, over 21368.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3123, pruned_loss=0.07855, over 4258093.69 frames. ], batch size: 160, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:32:28,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1731612.0, ans=0.125 2023-06-24 11:33:09,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1731732.0, ans=0.2 2023-06-24 11:33:12,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1731792.0, ans=0.035 2023-06-24 11:33:29,562 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=15.0 2023-06-24 11:33:50,830 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.922e+02 6.112e+02 7.492e+02 9.193e+02 1.976e+03, threshold=1.498e+03, percent-clipped=0.0 2023-06-24 11:33:58,954 INFO [train.py:996] (3/4) Epoch 10, batch 14200, loss[loss=0.2197, simple_loss=0.2964, pruned_loss=0.07148, over 21306.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3103, pruned_loss=0.07733, over 4269835.95 frames. ], batch size: 176, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:34:15,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1731972.0, ans=0.0 2023-06-24 11:34:24,027 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.77 vs. limit=10.0 2023-06-24 11:34:25,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1731972.0, ans=0.125 2023-06-24 11:34:34,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1732032.0, ans=0.125 2023-06-24 11:34:37,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1732032.0, ans=0.125 2023-06-24 11:34:37,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1732032.0, ans=0.125 2023-06-24 11:35:10,086 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-24 11:35:29,035 INFO [train.py:996] (3/4) Epoch 10, batch 14250, loss[loss=0.2252, simple_loss=0.2797, pruned_loss=0.08538, over 21372.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3045, pruned_loss=0.07711, over 4266102.05 frames. ], batch size: 144, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:35:32,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1732212.0, ans=0.125 2023-06-24 11:35:44,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1732212.0, ans=0.125 2023-06-24 11:35:53,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1732272.0, ans=0.5 2023-06-24 11:35:55,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1732272.0, ans=0.2 2023-06-24 11:36:03,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1732272.0, ans=10.0 2023-06-24 11:37:05,370 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.032e+02 5.862e+02 7.753e+02 1.347e+03 3.974e+03, threshold=1.551e+03, percent-clipped=20.0 2023-06-24 11:37:08,445 INFO [train.py:996] (3/4) Epoch 10, batch 14300, loss[loss=0.3837, simple_loss=0.4684, pruned_loss=0.1495, over 21529.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3088, pruned_loss=0.07806, over 4247624.40 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:38:14,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1732692.0, ans=0.1 2023-06-24 11:38:44,739 INFO [train.py:996] (3/4) Epoch 10, batch 14350, loss[loss=0.2469, simple_loss=0.3197, pruned_loss=0.087, over 21747.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3124, pruned_loss=0.07793, over 4232325.82 frames. ], batch size: 389, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:39:11,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1732872.0, ans=0.1 2023-06-24 11:39:25,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1732932.0, ans=0.125 2023-06-24 11:40:17,076 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.258e+02 7.078e+02 1.011e+03 1.349e+03 3.463e+03, threshold=2.022e+03, percent-clipped=22.0 2023-06-24 11:40:25,256 INFO [train.py:996] (3/4) Epoch 10, batch 14400, loss[loss=0.1837, simple_loss=0.257, pruned_loss=0.0552, over 21572.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3099, pruned_loss=0.07868, over 4244796.67 frames. ], batch size: 263, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:41:35,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1733352.0, ans=0.125 2023-06-24 11:41:37,305 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.93 vs. limit=15.0 2023-06-24 11:41:54,318 INFO [train.py:996] (3/4) Epoch 10, batch 14450, loss[loss=0.2658, simple_loss=0.3179, pruned_loss=0.1069, over 21621.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.305, pruned_loss=0.07915, over 4250538.53 frames. ], batch size: 441, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:42:11,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1733412.0, ans=0.0 2023-06-24 11:42:24,109 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-06-24 11:42:32,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1733532.0, ans=0.1 2023-06-24 11:42:45,518 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-24 11:43:05,555 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-06-24 11:43:23,366 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.538e+02 6.420e+02 9.253e+02 1.365e+03 3.104e+03, threshold=1.851e+03, percent-clipped=3.0 2023-06-24 11:43:26,353 INFO [train.py:996] (3/4) Epoch 10, batch 14500, loss[loss=0.2015, simple_loss=0.2792, pruned_loss=0.06185, over 21798.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3024, pruned_loss=0.0791, over 4253778.98 frames. ], batch size: 118, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:44:08,517 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-24 11:44:46,152 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2023-06-24 11:44:55,522 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-06-24 11:45:08,968 INFO [train.py:996] (3/4) Epoch 10, batch 14550, loss[loss=0.2755, simple_loss=0.3496, pruned_loss=0.1007, over 21983.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3074, pruned_loss=0.0806, over 4260204.98 frames. ], batch size: 317, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:45:11,289 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-24 11:46:02,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1734132.0, ans=0.0 2023-06-24 11:46:05,345 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-24 11:46:11,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1734192.0, ans=0.0 2023-06-24 11:46:21,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1734192.0, ans=0.125 2023-06-24 11:46:34,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1734252.0, ans=0.05 2023-06-24 11:46:40,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1734252.0, ans=0.1 2023-06-24 11:46:42,611 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.617e+02 6.748e+02 9.842e+02 1.367e+03 3.226e+03, threshold=1.968e+03, percent-clipped=9.0 2023-06-24 11:46:45,768 INFO [train.py:996] (3/4) Epoch 10, batch 14600, loss[loss=0.2889, simple_loss=0.3602, pruned_loss=0.1089, over 21462.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3148, pruned_loss=0.08285, over 4258577.93 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:46:53,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1734312.0, ans=0.2 2023-06-24 11:48:21,369 INFO [train.py:996] (3/4) Epoch 10, batch 14650, loss[loss=0.1739, simple_loss=0.2626, pruned_loss=0.04255, over 21699.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3164, pruned_loss=0.08215, over 4246106.32 frames. ], batch size: 247, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:49:07,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1734732.0, ans=0.125 2023-06-24 11:49:39,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1734792.0, ans=0.0 2023-06-24 11:49:50,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1734852.0, ans=0.0 2023-06-24 11:49:54,745 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.330e+02 6.829e+02 9.850e+02 1.571e+03 3.523e+03, threshold=1.970e+03, percent-clipped=13.0 2023-06-24 11:49:57,773 INFO [train.py:996] (3/4) Epoch 10, batch 14700, loss[loss=0.2636, simple_loss=0.3629, pruned_loss=0.08213, over 21549.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3088, pruned_loss=0.07629, over 4241602.32 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:49:58,800 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-06-24 11:50:05,126 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-24 11:50:15,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1734972.0, ans=0.0 2023-06-24 11:50:43,373 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-24 11:51:17,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1735092.0, ans=0.0 2023-06-24 11:51:36,590 INFO [train.py:996] (3/4) Epoch 10, batch 14750, loss[loss=0.25, simple_loss=0.32, pruned_loss=0.08996, over 21455.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.315, pruned_loss=0.07957, over 4255844.97 frames. ], batch size: 194, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:52:18,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1735332.0, ans=0.035 2023-06-24 11:52:18,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1735332.0, ans=0.0 2023-06-24 11:52:52,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1735392.0, ans=0.125 2023-06-24 11:52:53,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1735392.0, ans=0.125 2023-06-24 11:52:56,390 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.38 vs. limit=15.0 2023-06-24 11:52:57,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1735452.0, ans=0.125 2023-06-24 11:53:10,539 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.743e+02 7.762e+02 1.072e+03 1.702e+03 3.196e+03, threshold=2.144e+03, percent-clipped=17.0 2023-06-24 11:53:13,756 INFO [train.py:996] (3/4) Epoch 10, batch 14800, loss[loss=0.2591, simple_loss=0.3259, pruned_loss=0.09612, over 21500.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3271, pruned_loss=0.08575, over 4259911.30 frames. ], batch size: 441, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 11:54:17,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1735632.0, ans=0.0 2023-06-24 11:54:26,252 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:54:31,503 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-06-24 11:54:35,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1735752.0, ans=0.125 2023-06-24 11:55:02,598 INFO [train.py:996] (3/4) Epoch 10, batch 14850, loss[loss=0.1777, simple_loss=0.252, pruned_loss=0.05171, over 21483.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3196, pruned_loss=0.08445, over 4257301.59 frames. ], batch size: 230, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 11:55:33,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1735872.0, ans=0.2 2023-06-24 11:56:24,429 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-24 11:56:35,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1736052.0, ans=15.0 2023-06-24 11:56:37,528 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.296e+02 7.240e+02 1.054e+03 1.566e+03 3.588e+03, threshold=2.108e+03, percent-clipped=9.0 2023-06-24 11:56:40,614 INFO [train.py:996] (3/4) Epoch 10, batch 14900, loss[loss=0.231, simple_loss=0.3129, pruned_loss=0.07458, over 20012.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3223, pruned_loss=0.08584, over 4265326.57 frames. ], batch size: 703, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 11:57:22,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1736232.0, ans=0.0 2023-06-24 11:57:30,939 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=15.0 2023-06-24 11:57:46,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1736292.0, ans=0.1 2023-06-24 11:58:01,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1736352.0, ans=0.125 2023-06-24 11:58:28,352 INFO [train.py:996] (3/4) Epoch 10, batch 14950, loss[loss=0.2491, simple_loss=0.3361, pruned_loss=0.08105, over 21639.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3225, pruned_loss=0.08481, over 4260694.67 frames. ], batch size: 441, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 11:59:10,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1736532.0, ans=0.125 2023-06-24 11:59:23,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1736592.0, ans=0.125 2023-06-24 11:59:35,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1736592.0, ans=0.04949747468305833 2023-06-24 11:59:49,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1736652.0, ans=0.1 2023-06-24 12:00:04,934 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.178e+02 7.042e+02 9.483e+02 1.424e+03 2.881e+03, threshold=1.897e+03, percent-clipped=9.0 2023-06-24 12:00:06,677 INFO [train.py:996] (3/4) Epoch 10, batch 15000, loss[loss=0.2554, simple_loss=0.3043, pruned_loss=0.1033, over 16085.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3235, pruned_loss=0.08592, over 4252259.70 frames. ], batch size: 60, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:00:06,678 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 12:00:22,750 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2522, simple_loss=0.3488, pruned_loss=0.07776, over 1796401.00 frames. 2023-06-24 12:00:22,751 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-24 12:00:24,804 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:00:54,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1736772.0, ans=0.125 2023-06-24 12:01:05,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1736832.0, ans=0.0 2023-06-24 12:01:22,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1736892.0, ans=0.0 2023-06-24 12:01:32,818 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=22.5 2023-06-24 12:01:33,924 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:02:00,955 INFO [train.py:996] (3/4) Epoch 10, batch 15050, loss[loss=0.22, simple_loss=0.3, pruned_loss=0.06997, over 21775.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3261, pruned_loss=0.08739, over 4257547.46 frames. ], batch size: 247, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:02:41,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1737132.0, ans=0.125 2023-06-24 12:02:44,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1737132.0, ans=0.125 2023-06-24 12:02:49,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1737132.0, ans=0.0 2023-06-24 12:03:36,993 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.769e+02 8.670e+02 1.430e+03 2.221e+03 3.965e+03, threshold=2.861e+03, percent-clipped=33.0 2023-06-24 12:03:38,474 INFO [train.py:996] (3/4) Epoch 10, batch 15100, loss[loss=0.2442, simple_loss=0.3177, pruned_loss=0.08533, over 21445.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3286, pruned_loss=0.08662, over 4256511.26 frames. ], batch size: 211, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:03:51,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1737312.0, ans=0.0 2023-06-24 12:03:53,982 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.70 vs. limit=6.0 2023-06-24 12:04:27,500 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.604e-03 2023-06-24 12:04:29,435 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-06-24 12:04:42,151 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-06-24 12:05:11,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1737552.0, ans=0.1 2023-06-24 12:05:15,746 INFO [train.py:996] (3/4) Epoch 10, batch 15150, loss[loss=0.224, simple_loss=0.2862, pruned_loss=0.08093, over 21149.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3244, pruned_loss=0.08632, over 4253498.09 frames. ], batch size: 143, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:05:33,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1737612.0, ans=0.125 2023-06-24 12:06:06,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1737732.0, ans=0.0 2023-06-24 12:06:07,044 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-06-24 12:06:18,311 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-24 12:06:55,782 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.279e+02 6.808e+02 1.091e+03 1.698e+03 5.270e+03, threshold=2.181e+03, percent-clipped=2.0 2023-06-24 12:07:02,090 INFO [train.py:996] (3/4) Epoch 10, batch 15200, loss[loss=0.2222, simple_loss=0.3179, pruned_loss=0.06329, over 21546.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3139, pruned_loss=0.08206, over 4257310.83 frames. ], batch size: 441, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 12:08:32,601 INFO [train.py:996] (3/4) Epoch 10, batch 15250, loss[loss=0.2108, simple_loss=0.2787, pruned_loss=0.07142, over 21474.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3083, pruned_loss=0.08055, over 4261926.76 frames. ], batch size: 212, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:09:51,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1738452.0, ans=0.0 2023-06-24 12:09:51,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1738452.0, ans=0.125 2023-06-24 12:09:55,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1738452.0, ans=0.125 2023-06-24 12:09:58,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1738452.0, ans=0.2 2023-06-24 12:10:00,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1738452.0, ans=0.2 2023-06-24 12:10:17,222 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.809e+02 9.550e+02 1.644e+03 2.423e+03 4.460e+03, threshold=3.287e+03, percent-clipped=35.0 2023-06-24 12:10:17,242 INFO [train.py:996] (3/4) Epoch 10, batch 15300, loss[loss=0.291, simple_loss=0.3491, pruned_loss=0.1165, over 21447.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3123, pruned_loss=0.08391, over 4263909.20 frames. ], batch size: 211, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:10:56,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1738572.0, ans=0.125 2023-06-24 12:10:57,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1738632.0, ans=0.025 2023-06-24 12:11:22,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.33 vs. limit=10.0 2023-06-24 12:11:54,378 INFO [train.py:996] (3/4) Epoch 10, batch 15350, loss[loss=0.2309, simple_loss=0.3133, pruned_loss=0.07427, over 21449.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.319, pruned_loss=0.08645, over 4258004.23 frames. ], batch size: 194, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:12:36,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1738932.0, ans=0.125 2023-06-24 12:12:37,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1738932.0, ans=0.1 2023-06-24 12:13:24,418 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.980e+02 7.880e+02 1.088e+03 1.633e+03 3.514e+03, threshold=2.175e+03, percent-clipped=1.0 2023-06-24 12:13:24,438 INFO [train.py:996] (3/4) Epoch 10, batch 15400, loss[loss=0.2385, simple_loss=0.303, pruned_loss=0.08702, over 21311.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3193, pruned_loss=0.08472, over 4273029.87 frames. ], batch size: 159, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:14:21,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1739232.0, ans=0.125 2023-06-24 12:14:40,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1739352.0, ans=0.0 2023-06-24 12:15:04,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1739412.0, ans=0.125 2023-06-24 12:15:05,361 INFO [train.py:996] (3/4) Epoch 10, batch 15450, loss[loss=0.2312, simple_loss=0.3243, pruned_loss=0.06906, over 21788.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3163, pruned_loss=0.08367, over 4278997.96 frames. ], batch size: 298, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:15:07,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-24 12:15:26,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1739472.0, ans=0.09899494936611666 2023-06-24 12:16:40,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1739652.0, ans=0.125 2023-06-24 12:16:43,397 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 7.285e+02 1.027e+03 1.632e+03 3.153e+03, threshold=2.054e+03, percent-clipped=10.0 2023-06-24 12:16:43,418 INFO [train.py:996] (3/4) Epoch 10, batch 15500, loss[loss=0.2855, simple_loss=0.356, pruned_loss=0.1075, over 21826.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3181, pruned_loss=0.08385, over 4259629.20 frames. ], batch size: 282, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:16:52,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1739712.0, ans=0.0 2023-06-24 12:17:24,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1739832.0, ans=0.125 2023-06-24 12:17:32,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1739832.0, ans=0.1 2023-06-24 12:17:36,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1739832.0, ans=0.125 2023-06-24 12:18:21,657 INFO [train.py:996] (3/4) Epoch 10, batch 15550, loss[loss=0.224, simple_loss=0.3134, pruned_loss=0.06733, over 21643.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3154, pruned_loss=0.08122, over 4263873.20 frames. ], batch size: 414, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:19:14,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1740132.0, ans=0.125 2023-06-24 12:19:42,954 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.71 vs. limit=15.0 2023-06-24 12:19:58,917 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.243e+02 5.866e+02 9.218e+02 1.616e+03 3.082e+03, threshold=1.844e+03, percent-clipped=8.0 2023-06-24 12:19:58,938 INFO [train.py:996] (3/4) Epoch 10, batch 15600, loss[loss=0.2094, simple_loss=0.266, pruned_loss=0.07635, over 21510.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3092, pruned_loss=0.08051, over 4255601.08 frames. ], batch size: 230, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:19:59,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1740312.0, ans=0.125 2023-06-24 12:20:02,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1740312.0, ans=0.125 2023-06-24 12:20:28,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1740372.0, ans=0.0 2023-06-24 12:21:04,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1740492.0, ans=0.0 2023-06-24 12:21:30,695 INFO [train.py:996] (3/4) Epoch 10, batch 15650, loss[loss=0.2307, simple_loss=0.2886, pruned_loss=0.08645, over 21300.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3081, pruned_loss=0.0802, over 4263617.92 frames. ], batch size: 144, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:21:53,917 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=12.0 2023-06-24 12:22:00,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1740672.0, ans=0.1 2023-06-24 12:22:13,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1740732.0, ans=0.125 2023-06-24 12:22:42,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1740792.0, ans=0.2 2023-06-24 12:23:07,096 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.033e+02 7.365e+02 1.096e+03 1.379e+03 2.536e+03, threshold=2.192e+03, percent-clipped=6.0 2023-06-24 12:23:07,116 INFO [train.py:996] (3/4) Epoch 10, batch 15700, loss[loss=0.2153, simple_loss=0.3142, pruned_loss=0.05819, over 21190.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3038, pruned_loss=0.07853, over 4255824.58 frames. ], batch size: 549, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:23:29,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1740912.0, ans=0.125 2023-06-24 12:23:46,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1741032.0, ans=0.125 2023-06-24 12:23:47,367 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.83 vs. limit=6.0 2023-06-24 12:24:17,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1741092.0, ans=0.125 2023-06-24 12:24:20,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1741092.0, ans=0.0 2023-06-24 12:24:43,537 INFO [train.py:996] (3/4) Epoch 10, batch 15750, loss[loss=0.2552, simple_loss=0.3159, pruned_loss=0.09731, over 21858.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3006, pruned_loss=0.07861, over 4251320.77 frames. ], batch size: 98, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:25:26,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1741332.0, ans=0.125 2023-06-24 12:25:34,390 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:26:13,772 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.443e+02 6.609e+02 9.142e+02 1.184e+03 2.398e+03, threshold=1.828e+03, percent-clipped=2.0 2023-06-24 12:26:13,792 INFO [train.py:996] (3/4) Epoch 10, batch 15800, loss[loss=0.1816, simple_loss=0.2535, pruned_loss=0.05492, over 21373.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2958, pruned_loss=0.07762, over 4251487.63 frames. ], batch size: 211, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:26:59,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1741632.0, ans=0.1 2023-06-24 12:27:28,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1741692.0, ans=0.2 2023-06-24 12:27:41,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1741752.0, ans=0.125 2023-06-24 12:27:49,622 INFO [train.py:996] (3/4) Epoch 10, batch 15850, loss[loss=0.2372, simple_loss=0.3036, pruned_loss=0.08537, over 21815.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3003, pruned_loss=0.08009, over 4248020.74 frames. ], batch size: 118, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:28:12,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1741812.0, ans=0.125 2023-06-24 12:29:11,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1742052.0, ans=0.1 2023-06-24 12:29:16,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1742052.0, ans=0.125 2023-06-24 12:29:26,874 INFO [train.py:996] (3/4) Epoch 10, batch 15900, loss[loss=0.2159, simple_loss=0.2933, pruned_loss=0.06923, over 21750.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3022, pruned_loss=0.08098, over 4247183.34 frames. ], batch size: 124, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:29:28,401 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.526e+02 8.317e+02 1.237e+03 1.605e+03 4.098e+03, threshold=2.474e+03, percent-clipped=15.0 2023-06-24 12:30:27,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1742292.0, ans=0.2 2023-06-24 12:30:29,241 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2023-06-24 12:30:36,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1742292.0, ans=0.1 2023-06-24 12:30:38,447 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.00 vs. limit=15.0 2023-06-24 12:30:57,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1742352.0, ans=0.1 2023-06-24 12:31:04,684 INFO [train.py:996] (3/4) Epoch 10, batch 15950, loss[loss=0.2952, simple_loss=0.3715, pruned_loss=0.1095, over 21685.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3027, pruned_loss=0.07737, over 4241785.42 frames. ], batch size: 441, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:31:16,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1742412.0, ans=0.2 2023-06-24 12:31:34,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1742472.0, ans=0.125 2023-06-24 12:31:38,992 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:31:48,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1742532.0, ans=0.125 2023-06-24 12:32:34,979 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=22.5 2023-06-24 12:32:43,154 INFO [train.py:996] (3/4) Epoch 10, batch 16000, loss[loss=0.2276, simple_loss=0.3227, pruned_loss=0.06626, over 21788.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3009, pruned_loss=0.07438, over 4253607.62 frames. ], batch size: 351, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:32:44,652 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.507e+02 6.384e+02 8.996e+02 1.327e+03 2.604e+03, threshold=1.799e+03, percent-clipped=2.0 2023-06-24 12:33:15,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1742772.0, ans=0.125 2023-06-24 12:33:42,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1742892.0, ans=0.125 2023-06-24 12:33:56,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1742892.0, ans=0.1 2023-06-24 12:34:20,836 INFO [train.py:996] (3/4) Epoch 10, batch 16050, loss[loss=0.2456, simple_loss=0.3384, pruned_loss=0.07644, over 21651.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3015, pruned_loss=0.07207, over 4264186.10 frames. ], batch size: 230, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:35:41,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.whiten.whitening_limit, batch_count=1743252.0, ans=15.0 2023-06-24 12:35:42,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1743252.0, ans=0.125 2023-06-24 12:35:51,302 INFO [train.py:996] (3/4) Epoch 10, batch 16100, loss[loss=0.2736, simple_loss=0.324, pruned_loss=0.1116, over 21823.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3054, pruned_loss=0.0744, over 4267936.32 frames. ], batch size: 508, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:35:54,430 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.082e+02 6.034e+02 8.242e+02 1.333e+03 2.832e+03, threshold=1.648e+03, percent-clipped=8.0 2023-06-24 12:36:31,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1743432.0, ans=0.125 2023-06-24 12:36:40,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1743432.0, ans=0.1 2023-06-24 12:36:45,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1743492.0, ans=0.125 2023-06-24 12:37:26,916 INFO [train.py:996] (3/4) Epoch 10, batch 16150, loss[loss=0.233, simple_loss=0.3119, pruned_loss=0.07704, over 21827.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.305, pruned_loss=0.07728, over 4281060.68 frames. ], batch size: 298, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:37:41,015 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:38:11,213 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=22.5 2023-06-24 12:38:52,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1743852.0, ans=0.04949747468305833 2023-06-24 12:38:57,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1743852.0, ans=0.2 2023-06-24 12:38:57,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1743852.0, ans=6.0 2023-06-24 12:38:59,646 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=15.0 2023-06-24 12:39:05,193 INFO [train.py:996] (3/4) Epoch 10, batch 16200, loss[loss=0.2675, simple_loss=0.3424, pruned_loss=0.09632, over 21449.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3105, pruned_loss=0.07971, over 4282413.59 frames. ], batch size: 211, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:39:08,332 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.546e+02 7.202e+02 1.055e+03 1.408e+03 3.192e+03, threshold=2.110e+03, percent-clipped=15.0 2023-06-24 12:39:26,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1743972.0, ans=0.125 2023-06-24 12:39:29,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1743972.0, ans=0.125 2023-06-24 12:39:39,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1743972.0, ans=0.025 2023-06-24 12:40:21,646 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.72 vs. limit=22.5 2023-06-24 12:40:22,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1744152.0, ans=0.125 2023-06-24 12:40:37,332 INFO [train.py:996] (3/4) Epoch 10, batch 16250, loss[loss=0.1688, simple_loss=0.2433, pruned_loss=0.04712, over 21185.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3125, pruned_loss=0.08034, over 4282127.16 frames. ], batch size: 176, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:40:37,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1744212.0, ans=0.125 2023-06-24 12:41:11,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1744272.0, ans=0.035 2023-06-24 12:41:12,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1744272.0, ans=0.125 2023-06-24 12:41:34,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1744332.0, ans=0.1 2023-06-24 12:42:13,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1744452.0, ans=0.125 2023-06-24 12:42:18,477 INFO [train.py:996] (3/4) Epoch 10, batch 16300, loss[loss=0.2111, simple_loss=0.2768, pruned_loss=0.07267, over 21764.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3056, pruned_loss=0.07572, over 4266507.67 frames. ], batch size: 112, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:42:19,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1744512.0, ans=0.125 2023-06-24 12:42:19,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1744512.0, ans=0.04949747468305833 2023-06-24 12:42:27,036 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.249e+02 6.511e+02 9.054e+02 1.473e+03 4.161e+03, threshold=1.811e+03, percent-clipped=10.0 2023-06-24 12:42:30,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1744512.0, ans=0.1 2023-06-24 12:42:33,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1744512.0, ans=0.1 2023-06-24 12:42:54,139 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-24 12:43:09,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1744632.0, ans=0.035 2023-06-24 12:43:36,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1744752.0, ans=0.1 2023-06-24 12:44:01,330 INFO [train.py:996] (3/4) Epoch 10, batch 16350, loss[loss=0.2584, simple_loss=0.3284, pruned_loss=0.09419, over 21651.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3049, pruned_loss=0.07573, over 4257092.83 frames. ], batch size: 351, lr: 2.93e-03, grad_scale: 8.0 2023-06-24 12:44:09,674 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-24 12:44:25,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1744872.0, ans=0.0 2023-06-24 12:44:34,987 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-24 12:44:50,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1744932.0, ans=0.0 2023-06-24 12:45:07,194 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2023-06-24 12:45:16,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1745052.0, ans=0.2 2023-06-24 12:45:38,319 INFO [train.py:996] (3/4) Epoch 10, batch 16400, loss[loss=0.2289, simple_loss=0.3039, pruned_loss=0.07692, over 21820.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3118, pruned_loss=0.07853, over 4264180.77 frames. ], batch size: 298, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:45:40,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1745112.0, ans=0.0 2023-06-24 12:45:42,800 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.368e+02 7.745e+02 1.144e+03 1.661e+03 2.943e+03, threshold=2.288e+03, percent-clipped=22.0 2023-06-24 12:45:52,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1745172.0, ans=0.125 2023-06-24 12:46:42,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1745292.0, ans=0.125 2023-06-24 12:46:50,560 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=22.5 2023-06-24 12:46:53,603 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=15.0 2023-06-24 12:47:16,240 INFO [train.py:996] (3/4) Epoch 10, batch 16450, loss[loss=0.2712, simple_loss=0.3279, pruned_loss=0.1073, over 21625.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3113, pruned_loss=0.07958, over 4266520.34 frames. ], batch size: 471, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:48:14,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1745592.0, ans=0.0 2023-06-24 12:48:46,688 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-24 12:48:53,689 INFO [train.py:996] (3/4) Epoch 10, batch 16500, loss[loss=0.1701, simple_loss=0.2317, pruned_loss=0.05428, over 21340.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3095, pruned_loss=0.07962, over 4263797.70 frames. ], batch size: 131, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:48:57,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1745712.0, ans=0.125 2023-06-24 12:48:58,424 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.573e+02 7.722e+02 1.056e+03 1.682e+03 4.861e+03, threshold=2.112e+03, percent-clipped=4.0 2023-06-24 12:49:14,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1745772.0, ans=0.1 2023-06-24 12:49:28,548 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-06-24 12:49:42,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1745832.0, ans=0.125 2023-06-24 12:49:55,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1745892.0, ans=0.0 2023-06-24 12:50:19,553 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=22.5 2023-06-24 12:50:31,379 INFO [train.py:996] (3/4) Epoch 10, batch 16550, loss[loss=0.2897, simple_loss=0.3694, pruned_loss=0.105, over 21462.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3096, pruned_loss=0.07758, over 4266191.09 frames. ], batch size: 471, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:51:23,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1746132.0, ans=0.125 2023-06-24 12:51:31,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1746132.0, ans=0.025 2023-06-24 12:51:51,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1746192.0, ans=0.125 2023-06-24 12:51:52,600 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.32 vs. limit=15.0 2023-06-24 12:52:01,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1746252.0, ans=0.125 2023-06-24 12:52:07,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1746252.0, ans=0.1 2023-06-24 12:52:16,511 INFO [train.py:996] (3/4) Epoch 10, batch 16600, loss[loss=0.2536, simple_loss=0.3558, pruned_loss=0.0757, over 21367.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3187, pruned_loss=0.08128, over 4271584.82 frames. ], batch size: 211, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:52:21,401 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.341e+02 8.131e+02 1.242e+03 1.757e+03 3.477e+03, threshold=2.484e+03, percent-clipped=12.0 2023-06-24 12:53:20,818 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.91 vs. limit=10.0 2023-06-24 12:53:22,516 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=22.5 2023-06-24 12:53:45,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1746552.0, ans=0.125 2023-06-24 12:54:00,369 INFO [train.py:996] (3/4) Epoch 10, batch 16650, loss[loss=0.2676, simple_loss=0.3389, pruned_loss=0.09812, over 21469.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3246, pruned_loss=0.08323, over 4267157.85 frames. ], batch size: 194, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:54:10,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1746612.0, ans=0.0 2023-06-24 12:54:20,466 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-06-24 12:54:20,596 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-24 12:54:26,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1746672.0, ans=0.05 2023-06-24 12:54:33,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1746672.0, ans=0.015 2023-06-24 12:54:37,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1746672.0, ans=0.125 2023-06-24 12:55:42,450 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=22.5 2023-06-24 12:55:44,440 INFO [train.py:996] (3/4) Epoch 10, batch 16700, loss[loss=0.2571, simple_loss=0.3375, pruned_loss=0.08834, over 21911.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3243, pruned_loss=0.08284, over 4270534.97 frames. ], batch size: 372, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:55:49,381 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.755e+02 7.026e+02 1.004e+03 1.401e+03 2.239e+03, threshold=2.007e+03, percent-clipped=0.0 2023-06-24 12:55:59,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1746912.0, ans=0.015 2023-06-24 12:56:37,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1747032.0, ans=0.0 2023-06-24 12:57:09,218 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.67 vs. limit=10.0 2023-06-24 12:57:10,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1747152.0, ans=15.0 2023-06-24 12:57:30,021 INFO [train.py:996] (3/4) Epoch 10, batch 16750, loss[loss=0.2097, simple_loss=0.2623, pruned_loss=0.07861, over 20052.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3255, pruned_loss=0.0849, over 4269822.37 frames. ], batch size: 704, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:58:35,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1747392.0, ans=0.0 2023-06-24 12:58:51,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1747452.0, ans=0.125 2023-06-24 12:59:08,083 INFO [train.py:996] (3/4) Epoch 10, batch 16800, loss[loss=0.2251, simple_loss=0.2785, pruned_loss=0.08584, over 20044.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3276, pruned_loss=0.08428, over 4256032.60 frames. ], batch size: 702, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:59:12,598 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.402e+02 6.959e+02 1.077e+03 1.675e+03 3.931e+03, threshold=2.154e+03, percent-clipped=17.0 2023-06-24 12:59:22,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1747512.0, ans=0.0 2023-06-24 12:59:26,018 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-24 12:59:49,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1747572.0, ans=0.07 2023-06-24 13:00:04,059 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-24 13:00:13,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1747692.0, ans=0.0 2023-06-24 13:00:22,656 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:00:37,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1747752.0, ans=0.125 2023-06-24 13:00:43,536 INFO [train.py:996] (3/4) Epoch 10, batch 16850, loss[loss=0.2389, simple_loss=0.31, pruned_loss=0.08393, over 21372.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3241, pruned_loss=0.08494, over 4269367.55 frames. ], batch size: 159, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:00:49,517 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.22 vs. limit=10.0 2023-06-24 13:01:34,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1747932.0, ans=0.1 2023-06-24 13:01:35,844 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.61 vs. limit=5.0 2023-06-24 13:01:59,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1747992.0, ans=0.0 2023-06-24 13:02:19,540 INFO [train.py:996] (3/4) Epoch 10, batch 16900, loss[loss=0.2443, simple_loss=0.3624, pruned_loss=0.06312, over 20789.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3204, pruned_loss=0.08405, over 4271997.23 frames. ], batch size: 607, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:02:30,803 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.364e+02 7.003e+02 1.142e+03 1.621e+03 3.220e+03, threshold=2.284e+03, percent-clipped=11.0 2023-06-24 13:03:01,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1748172.0, ans=0.1 2023-06-24 13:03:34,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1748292.0, ans=0.05 2023-06-24 13:03:39,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1748352.0, ans=0.125 2023-06-24 13:03:44,896 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.28 vs. limit=10.0 2023-06-24 13:03:56,465 INFO [train.py:996] (3/4) Epoch 10, batch 16950, loss[loss=0.2286, simple_loss=0.2954, pruned_loss=0.08096, over 21258.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3141, pruned_loss=0.08232, over 4271461.87 frames. ], batch size: 159, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:04:17,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1748472.0, ans=0.125 2023-06-24 13:04:26,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1748472.0, ans=0.125 2023-06-24 13:04:49,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1748532.0, ans=0.2 2023-06-24 13:05:28,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1748652.0, ans=10.0 2023-06-24 13:05:33,930 INFO [train.py:996] (3/4) Epoch 10, batch 17000, loss[loss=0.2401, simple_loss=0.3073, pruned_loss=0.08647, over 21956.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3122, pruned_loss=0.08265, over 4273589.47 frames. ], batch size: 316, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:05:37,879 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-24 13:05:44,492 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.290e+02 6.732e+02 9.381e+02 1.306e+03 2.679e+03, threshold=1.876e+03, percent-clipped=4.0 2023-06-24 13:05:46,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1748712.0, ans=0.125 2023-06-24 13:06:13,581 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-24 13:06:14,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1748772.0, ans=0.0 2023-06-24 13:06:14,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1748772.0, ans=0.0 2023-06-24 13:06:25,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1748832.0, ans=0.125 2023-06-24 13:06:57,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1748952.0, ans=0.0 2023-06-24 13:07:15,580 INFO [train.py:996] (3/4) Epoch 10, batch 17050, loss[loss=0.3177, simple_loss=0.412, pruned_loss=0.1117, over 21680.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3209, pruned_loss=0.08491, over 4270286.19 frames. ], batch size: 414, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:07:40,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1749072.0, ans=0.1 2023-06-24 13:07:48,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1749072.0, ans=0.125 2023-06-24 13:08:46,009 INFO [train.py:996] (3/4) Epoch 10, batch 17100, loss[loss=0.2542, simple_loss=0.3189, pruned_loss=0.0948, over 21314.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.322, pruned_loss=0.08642, over 4277008.12 frames. ], batch size: 143, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:08:56,760 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.820e+02 8.115e+02 1.148e+03 1.810e+03 4.142e+03, threshold=2.296e+03, percent-clipped=21.0 2023-06-24 13:08:58,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1749312.0, ans=0.0 2023-06-24 13:09:24,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1749372.0, ans=0.015 2023-06-24 13:09:59,622 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.03 vs. limit=10.0 2023-06-24 13:10:26,352 INFO [train.py:996] (3/4) Epoch 10, batch 17150, loss[loss=0.2384, simple_loss=0.3013, pruned_loss=0.0877, over 21812.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3171, pruned_loss=0.08584, over 4287987.52 frames. ], batch size: 112, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:10:40,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1749612.0, ans=0.07 2023-06-24 13:12:07,582 INFO [train.py:996] (3/4) Epoch 10, batch 17200, loss[loss=0.3156, simple_loss=0.3671, pruned_loss=0.132, over 21310.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3164, pruned_loss=0.08549, over 4288595.07 frames. ], batch size: 507, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 13:12:18,507 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.490e+02 5.928e+02 7.581e+02 1.081e+03 2.493e+03, threshold=1.516e+03, percent-clipped=2.0 2023-06-24 13:12:27,554 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-24 13:12:47,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1749972.0, ans=0.0 2023-06-24 13:13:33,766 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:13:49,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1750152.0, ans=0.1 2023-06-24 13:13:49,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1750152.0, ans=0.1 2023-06-24 13:13:52,376 INFO [train.py:996] (3/4) Epoch 10, batch 17250, loss[loss=0.2892, simple_loss=0.3606, pruned_loss=0.1089, over 21929.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3178, pruned_loss=0.08631, over 4284538.88 frames. ], batch size: 317, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 13:15:02,936 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-24 13:15:34,679 INFO [train.py:996] (3/4) Epoch 10, batch 17300, loss[loss=0.2848, simple_loss=0.3496, pruned_loss=0.11, over 21427.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3257, pruned_loss=0.08977, over 4287265.05 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:15:42,398 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.710e+02 7.635e+02 9.609e+02 1.379e+03 2.737e+03, threshold=1.922e+03, percent-clipped=17.0 2023-06-24 13:15:45,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1750512.0, ans=0.0 2023-06-24 13:15:48,338 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-06-24 13:16:26,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1750632.0, ans=0.125 2023-06-24 13:16:40,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1750692.0, ans=0.0 2023-06-24 13:16:48,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1750692.0, ans=0.0 2023-06-24 13:17:04,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1750752.0, ans=0.125 2023-06-24 13:17:07,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1750752.0, ans=0.1 2023-06-24 13:17:11,816 INFO [train.py:996] (3/4) Epoch 10, batch 17350, loss[loss=0.2159, simple_loss=0.2958, pruned_loss=0.06799, over 21375.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3267, pruned_loss=0.08863, over 4287688.55 frames. ], batch size: 211, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:17:13,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1750812.0, ans=0.125 2023-06-24 13:17:15,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1750812.0, ans=0.125 2023-06-24 13:17:38,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1750872.0, ans=0.2 2023-06-24 13:18:23,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1750992.0, ans=0.125 2023-06-24 13:18:49,840 INFO [train.py:996] (3/4) Epoch 10, batch 17400, loss[loss=0.2677, simple_loss=0.3529, pruned_loss=0.09132, over 21626.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3221, pruned_loss=0.08423, over 4280033.44 frames. ], batch size: 441, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:18:53,821 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-24 13:18:57,486 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.089e+02 6.083e+02 9.753e+02 1.322e+03 2.899e+03, threshold=1.951e+03, percent-clipped=8.0 2023-06-24 13:19:25,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1751172.0, ans=0.1 2023-06-24 13:19:28,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1751232.0, ans=0.125 2023-06-24 13:19:38,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1751232.0, ans=0.1 2023-06-24 13:19:43,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1751232.0, ans=0.2 2023-06-24 13:19:59,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1751292.0, ans=0.0 2023-06-24 13:20:06,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1751292.0, ans=0.1 2023-06-24 13:20:14,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1751352.0, ans=0.0 2023-06-24 13:20:26,378 INFO [train.py:996] (3/4) Epoch 10, batch 17450, loss[loss=0.2371, simple_loss=0.3142, pruned_loss=0.07998, over 20572.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3168, pruned_loss=0.08196, over 4272561.69 frames. ], batch size: 607, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:20:28,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1751412.0, ans=0.125 2023-06-24 13:20:38,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1751412.0, ans=0.0 2023-06-24 13:21:47,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1751652.0, ans=0.0 2023-06-24 13:21:51,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=1751652.0, ans=15.0 2023-06-24 13:21:53,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1751652.0, ans=0.1 2023-06-24 13:22:02,694 INFO [train.py:996] (3/4) Epoch 10, batch 17500, loss[loss=0.1725, simple_loss=0.2479, pruned_loss=0.04861, over 16550.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3121, pruned_loss=0.07978, over 4270437.11 frames. ], batch size: 60, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:22:16,399 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.845e+02 5.907e+02 8.163e+02 1.225e+03 3.069e+03, threshold=1.633e+03, percent-clipped=7.0 2023-06-24 13:23:39,105 INFO [train.py:996] (3/4) Epoch 10, batch 17550, loss[loss=0.2134, simple_loss=0.3113, pruned_loss=0.0577, over 21826.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3126, pruned_loss=0.07864, over 4260596.92 frames. ], batch size: 316, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:23:47,535 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=15.0 2023-06-24 13:24:27,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1752132.0, ans=0.1 2023-06-24 13:24:38,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1752192.0, ans=0.2 2023-06-24 13:25:07,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1752252.0, ans=0.125 2023-06-24 13:25:09,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1752312.0, ans=0.125 2023-06-24 13:25:10,556 INFO [train.py:996] (3/4) Epoch 10, batch 17600, loss[loss=0.2457, simple_loss=0.3235, pruned_loss=0.08396, over 21712.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3158, pruned_loss=0.07978, over 4264950.78 frames. ], batch size: 332, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:25:24,366 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.544e+02 6.468e+02 8.117e+02 1.176e+03 4.887e+03, threshold=1.623e+03, percent-clipped=13.0 2023-06-24 13:25:51,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1752372.0, ans=0.1 2023-06-24 13:26:39,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1752552.0, ans=0.125 2023-06-24 13:26:51,771 INFO [train.py:996] (3/4) Epoch 10, batch 17650, loss[loss=0.2354, simple_loss=0.3248, pruned_loss=0.07299, over 21254.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3129, pruned_loss=0.07905, over 4250229.67 frames. ], batch size: 143, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:26:56,132 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=12.0 2023-06-24 13:27:22,277 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-24 13:27:35,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1752732.0, ans=0.125 2023-06-24 13:27:35,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1752732.0, ans=0.0 2023-06-24 13:27:42,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1752732.0, ans=0.125 2023-06-24 13:27:42,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1752732.0, ans=0.125 2023-06-24 13:28:05,038 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-24 13:28:06,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1752792.0, ans=0.0 2023-06-24 13:28:33,523 INFO [train.py:996] (3/4) Epoch 10, batch 17700, loss[loss=0.2663, simple_loss=0.3453, pruned_loss=0.09363, over 21752.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3099, pruned_loss=0.07687, over 4256680.19 frames. ], batch size: 298, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:28:43,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1752912.0, ans=0.0 2023-06-24 13:28:47,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1752912.0, ans=0.125 2023-06-24 13:28:48,054 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.630e+02 6.330e+02 1.013e+03 1.605e+03 3.260e+03, threshold=2.027e+03, percent-clipped=24.0 2023-06-24 13:29:48,431 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.12 vs. limit=12.0 2023-06-24 13:30:16,871 INFO [train.py:996] (3/4) Epoch 10, batch 17750, loss[loss=0.2604, simple_loss=0.3347, pruned_loss=0.09307, over 21396.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3161, pruned_loss=0.08008, over 4260752.47 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:31:30,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1753392.0, ans=0.125 2023-06-24 13:31:32,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1753392.0, ans=0.05 2023-06-24 13:31:51,452 INFO [train.py:996] (3/4) Epoch 10, batch 17800, loss[loss=0.2505, simple_loss=0.3404, pruned_loss=0.08027, over 21306.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3144, pruned_loss=0.07899, over 4263565.64 frames. ], batch size: 549, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:32:07,439 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.377e+02 6.218e+02 8.448e+02 1.392e+03 2.915e+03, threshold=1.690e+03, percent-clipped=12.0 2023-06-24 13:32:07,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1753512.0, ans=0.5 2023-06-24 13:32:51,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1753692.0, ans=0.125 2023-06-24 13:33:16,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1753752.0, ans=0.0 2023-06-24 13:33:19,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1753752.0, ans=0.2 2023-06-24 13:33:31,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1753752.0, ans=0.125 2023-06-24 13:33:34,292 INFO [train.py:996] (3/4) Epoch 10, batch 17850, loss[loss=0.246, simple_loss=0.312, pruned_loss=0.09003, over 21243.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3151, pruned_loss=0.08016, over 4265013.56 frames. ], batch size: 176, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:33:46,065 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.76 vs. limit=12.0 2023-06-24 13:34:02,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1753872.0, ans=0.125 2023-06-24 13:35:12,038 INFO [train.py:996] (3/4) Epoch 10, batch 17900, loss[loss=0.2356, simple_loss=0.3146, pruned_loss=0.07826, over 19972.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3211, pruned_loss=0.08266, over 4272833.91 frames. ], batch size: 703, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:35:23,064 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.491e+02 6.124e+02 9.329e+02 1.248e+03 3.216e+03, threshold=1.866e+03, percent-clipped=9.0 2023-06-24 13:35:36,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1754172.0, ans=0.125 2023-06-24 13:35:46,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1754172.0, ans=0.125 2023-06-24 13:36:16,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-24 13:36:50,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1754412.0, ans=0.05 2023-06-24 13:36:52,247 INFO [train.py:996] (3/4) Epoch 10, batch 17950, loss[loss=0.2131, simple_loss=0.3162, pruned_loss=0.05503, over 21674.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3224, pruned_loss=0.08027, over 4266884.49 frames. ], batch size: 414, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:37:09,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1754412.0, ans=0.1 2023-06-24 13:37:45,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1754532.0, ans=0.125 2023-06-24 13:38:02,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1754592.0, ans=0.0 2023-06-24 13:38:28,325 INFO [train.py:996] (3/4) Epoch 10, batch 18000, loss[loss=0.2722, simple_loss=0.3771, pruned_loss=0.08367, over 20769.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3168, pruned_loss=0.07889, over 4261068.86 frames. ], batch size: 607, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:38:28,326 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 13:38:47,231 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2575, simple_loss=0.3533, pruned_loss=0.08085, over 1796401.00 frames. 2023-06-24 13:38:47,232 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-24 13:39:02,579 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.082e+02 8.313e+02 1.378e+03 2.030e+03 3.547e+03, threshold=2.755e+03, percent-clipped=28.0 2023-06-24 13:39:34,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1754832.0, ans=0.125 2023-06-24 13:39:37,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1754832.0, ans=0.125 2023-06-24 13:40:07,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1754952.0, ans=0.125 2023-06-24 13:40:19,199 INFO [train.py:996] (3/4) Epoch 10, batch 18050, loss[loss=0.1936, simple_loss=0.2784, pruned_loss=0.05442, over 21707.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3109, pruned_loss=0.0782, over 4266085.12 frames. ], batch size: 282, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:40:40,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1755012.0, ans=0.125 2023-06-24 13:40:53,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1755072.0, ans=0.2 2023-06-24 13:41:40,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1755252.0, ans=0.2 2023-06-24 13:41:42,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1755252.0, ans=0.0 2023-06-24 13:41:52,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1755252.0, ans=0.125 2023-06-24 13:42:02,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1755312.0, ans=0.125 2023-06-24 13:42:03,155 INFO [train.py:996] (3/4) Epoch 10, batch 18100, loss[loss=0.2413, simple_loss=0.3319, pruned_loss=0.07536, over 21622.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3142, pruned_loss=0.07986, over 4265624.08 frames. ], batch size: 263, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:42:19,035 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.442e+02 6.227e+02 8.455e+02 1.236e+03 2.629e+03, threshold=1.691e+03, percent-clipped=0.0 2023-06-24 13:42:58,068 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-06-24 13:43:03,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1755492.0, ans=0.125 2023-06-24 13:43:26,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1755552.0, ans=0.1 2023-06-24 13:43:44,537 INFO [train.py:996] (3/4) Epoch 10, batch 18150, loss[loss=0.2751, simple_loss=0.3945, pruned_loss=0.0779, over 19750.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3166, pruned_loss=0.07889, over 4269343.55 frames. ], batch size: 702, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:44:31,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1755732.0, ans=0.125 2023-06-24 13:44:42,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1755792.0, ans=0.1 2023-06-24 13:44:47,614 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-24 13:45:10,525 INFO [train.py:996] (3/4) Epoch 10, batch 18200, loss[loss=0.2369, simple_loss=0.2886, pruned_loss=0.09267, over 21333.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3098, pruned_loss=0.07845, over 4261935.83 frames. ], batch size: 473, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:45:24,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1755912.0, ans=0.125 2023-06-24 13:45:27,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1755912.0, ans=0.1 2023-06-24 13:45:30,450 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.409e+02 6.816e+02 9.910e+02 1.570e+03 3.771e+03, threshold=1.982e+03, percent-clipped=24.0 2023-06-24 13:45:55,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1756032.0, ans=0.1 2023-06-24 13:46:15,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1756092.0, ans=0.125 2023-06-24 13:46:24,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1756152.0, ans=0.2 2023-06-24 13:46:41,619 INFO [train.py:996] (3/4) Epoch 10, batch 18250, loss[loss=0.2169, simple_loss=0.2965, pruned_loss=0.06866, over 21718.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3025, pruned_loss=0.0767, over 4270118.35 frames. ], batch size: 389, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:47:20,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1756332.0, ans=0.125 2023-06-24 13:47:29,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1756332.0, ans=0.125 2023-06-24 13:47:32,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1756332.0, ans=0.0 2023-06-24 13:47:44,884 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=22.5 2023-06-24 13:47:48,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1756392.0, ans=0.04949747468305833 2023-06-24 13:48:12,410 INFO [train.py:996] (3/4) Epoch 10, batch 18300, loss[loss=0.2486, simple_loss=0.3129, pruned_loss=0.09209, over 21462.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3013, pruned_loss=0.07656, over 4278150.47 frames. ], batch size: 131, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:48:14,833 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-24 13:48:23,312 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.671e+02 6.079e+02 7.788e+02 1.352e+03 4.344e+03, threshold=1.558e+03, percent-clipped=12.0 2023-06-24 13:48:54,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1756572.0, ans=0.1 2023-06-24 13:49:21,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1756692.0, ans=0.0 2023-06-24 13:49:23,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1756692.0, ans=22.5 2023-06-24 13:49:35,347 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:49:49,370 INFO [train.py:996] (3/4) Epoch 10, batch 18350, loss[loss=0.2395, simple_loss=0.3123, pruned_loss=0.0833, over 21468.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3067, pruned_loss=0.07594, over 4267608.65 frames. ], batch size: 389, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:50:24,418 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-24 13:50:36,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1756932.0, ans=0.2 2023-06-24 13:50:48,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1756932.0, ans=0.0 2023-06-24 13:50:51,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1756992.0, ans=0.0 2023-06-24 13:50:59,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1756992.0, ans=0.025 2023-06-24 13:51:27,720 INFO [train.py:996] (3/4) Epoch 10, batch 18400, loss[loss=0.1836, simple_loss=0.2579, pruned_loss=0.05465, over 21358.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3025, pruned_loss=0.07514, over 4270494.29 frames. ], batch size: 131, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 13:51:39,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1757112.0, ans=0.0 2023-06-24 13:51:43,851 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.931e+02 6.378e+02 8.856e+02 1.210e+03 2.743e+03, threshold=1.771e+03, percent-clipped=10.0 2023-06-24 13:51:55,247 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:52:05,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1757232.0, ans=0.07 2023-06-24 13:52:13,963 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:52:59,692 INFO [train.py:996] (3/4) Epoch 10, batch 18450, loss[loss=0.1954, simple_loss=0.2934, pruned_loss=0.04866, over 21730.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3001, pruned_loss=0.07265, over 4271982.51 frames. ], batch size: 415, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 13:53:42,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1757532.0, ans=0.2 2023-06-24 13:53:50,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1757532.0, ans=0.1 2023-06-24 13:54:14,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1757652.0, ans=0.0 2023-06-24 13:54:23,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1757652.0, ans=0.125 2023-06-24 13:54:35,861 INFO [train.py:996] (3/4) Epoch 10, batch 18500, loss[loss=0.1852, simple_loss=0.2613, pruned_loss=0.05455, over 21375.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2933, pruned_loss=0.07052, over 4265129.55 frames. ], batch size: 194, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:54:57,474 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.156e+02 5.768e+02 9.363e+02 1.391e+03 2.603e+03, threshold=1.873e+03, percent-clipped=9.0 2023-06-24 13:55:27,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=1757832.0, ans=0.2 2023-06-24 13:55:37,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1757892.0, ans=0.125 2023-06-24 13:56:12,621 INFO [train.py:996] (3/4) Epoch 10, batch 18550, loss[loss=0.2354, simple_loss=0.2962, pruned_loss=0.08732, over 21320.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2921, pruned_loss=0.07024, over 4264941.32 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:56:19,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1758012.0, ans=0.125 2023-06-24 13:57:34,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.79 vs. limit=15.0 2023-06-24 13:57:36,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1758252.0, ans=0.0 2023-06-24 13:57:49,278 INFO [train.py:996] (3/4) Epoch 10, batch 18600, loss[loss=0.197, simple_loss=0.2669, pruned_loss=0.06357, over 21204.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2899, pruned_loss=0.07046, over 4270252.27 frames. ], batch size: 144, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:57:50,469 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.74 vs. limit=22.5 2023-06-24 13:58:12,237 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.175e+02 6.384e+02 9.662e+02 1.486e+03 4.666e+03, threshold=1.932e+03, percent-clipped=18.0 2023-06-24 13:58:25,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1758372.0, ans=0.125 2023-06-24 13:58:43,057 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-24 13:59:24,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1758612.0, ans=0.125 2023-06-24 13:59:25,722 INFO [train.py:996] (3/4) Epoch 10, batch 18650, loss[loss=0.2416, simple_loss=0.3267, pruned_loss=0.07822, over 21754.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2901, pruned_loss=0.07071, over 4266434.38 frames. ], batch size: 415, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:59:58,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1758672.0, ans=0.125 2023-06-24 14:00:13,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1758732.0, ans=0.2 2023-06-24 14:00:55,937 INFO [train.py:996] (3/4) Epoch 10, batch 18700, loss[loss=0.2413, simple_loss=0.3149, pruned_loss=0.08384, over 22038.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2891, pruned_loss=0.07276, over 4265041.72 frames. ], batch size: 113, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:01:04,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1758912.0, ans=0.05 2023-06-24 14:01:12,549 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.606e+02 6.882e+02 9.841e+02 1.662e+03 3.485e+03, threshold=1.968e+03, percent-clipped=16.0 2023-06-24 14:01:29,646 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.48 vs. limit=10.0 2023-06-24 14:02:03,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1759092.0, ans=0.125 2023-06-24 14:02:16,827 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-24 14:02:28,405 INFO [train.py:996] (3/4) Epoch 10, batch 18750, loss[loss=0.2637, simple_loss=0.3332, pruned_loss=0.09717, over 21193.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2907, pruned_loss=0.07456, over 4266357.86 frames. ], batch size: 143, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:03:36,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1759392.0, ans=0.0 2023-06-24 14:03:39,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1759392.0, ans=0.125 2023-06-24 14:03:47,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1759452.0, ans=0.125 2023-06-24 14:04:00,661 INFO [train.py:996] (3/4) Epoch 10, batch 18800, loss[loss=0.1808, simple_loss=0.2598, pruned_loss=0.05093, over 21469.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2972, pruned_loss=0.07488, over 4269824.40 frames. ], batch size: 194, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 14:04:09,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=22.5 2023-06-24 14:04:22,332 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.381e+02 6.431e+02 9.700e+02 1.560e+03 3.348e+03, threshold=1.940e+03, percent-clipped=15.0 2023-06-24 14:05:09,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.14 vs. limit=10.0 2023-06-24 14:05:15,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1759752.0, ans=0.1 2023-06-24 14:05:20,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-24 14:05:31,409 INFO [train.py:996] (3/4) Epoch 10, batch 18850, loss[loss=0.2177, simple_loss=0.2913, pruned_loss=0.07207, over 21692.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2934, pruned_loss=0.07034, over 4267757.18 frames. ], batch size: 333, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 14:06:38,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1759992.0, ans=0.1 2023-06-24 14:06:53,001 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-24 14:06:53,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.92 vs. limit=10.0 2023-06-24 14:06:54,498 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2023-06-24 14:06:57,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1760052.0, ans=0.0 2023-06-24 14:07:07,594 INFO [train.py:996] (3/4) Epoch 10, batch 18900, loss[loss=0.256, simple_loss=0.31, pruned_loss=0.101, over 21778.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2898, pruned_loss=0.07061, over 4269934.21 frames. ], batch size: 316, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 14:07:24,191 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.033e+02 6.602e+02 8.408e+02 1.056e+03 2.556e+03, threshold=1.682e+03, percent-clipped=3.0 2023-06-24 14:07:49,966 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=22.5 2023-06-24 14:07:57,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1760232.0, ans=0.95 2023-06-24 14:08:05,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1760232.0, ans=0.0 2023-06-24 14:08:29,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1760352.0, ans=0.1 2023-06-24 14:08:29,561 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-24 14:08:32,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1760352.0, ans=0.1 2023-06-24 14:08:38,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1760352.0, ans=0.125 2023-06-24 14:08:40,370 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-24 14:08:44,278 INFO [train.py:996] (3/4) Epoch 10, batch 18950, loss[loss=0.2356, simple_loss=0.3427, pruned_loss=0.06422, over 21280.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2938, pruned_loss=0.07309, over 4267433.70 frames. ], batch size: 548, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:09:16,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1760472.0, ans=0.1 2023-06-24 14:09:22,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1760472.0, ans=0.0 2023-06-24 14:10:26,135 INFO [train.py:996] (3/4) Epoch 10, batch 19000, loss[loss=0.2431, simple_loss=0.3246, pruned_loss=0.08079, over 21510.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3046, pruned_loss=0.07659, over 4275088.49 frames. ], batch size: 131, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:10:49,349 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.433e+02 8.812e+02 1.283e+03 1.934e+03 4.893e+03, threshold=2.566e+03, percent-clipped=32.0 2023-06-24 14:10:51,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1760772.0, ans=0.125 2023-06-24 14:11:40,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1760952.0, ans=0.1 2023-06-24 14:11:46,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1760952.0, ans=0.0 2023-06-24 14:11:46,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1760952.0, ans=0.125 2023-06-24 14:12:02,048 INFO [train.py:996] (3/4) Epoch 10, batch 19050, loss[loss=0.2086, simple_loss=0.285, pruned_loss=0.06607, over 21817.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3086, pruned_loss=0.08019, over 4275200.15 frames. ], batch size: 282, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:12:03,062 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=22.5 2023-06-24 14:12:24,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1761012.0, ans=0.125 2023-06-24 14:12:38,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1761072.0, ans=0.09899494936611666 2023-06-24 14:12:54,563 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.76 vs. limit=6.0 2023-06-24 14:13:04,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1761192.0, ans=0.1 2023-06-24 14:13:28,209 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-24 14:13:42,730 INFO [train.py:996] (3/4) Epoch 10, batch 19100, loss[loss=0.2006, simple_loss=0.2612, pruned_loss=0.06999, over 21319.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3061, pruned_loss=0.08056, over 4274914.70 frames. ], batch size: 194, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:13:44,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1761312.0, ans=0.125 2023-06-24 14:13:46,912 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-24 14:14:01,621 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.691e+02 6.507e+02 8.205e+02 1.177e+03 2.298e+03, threshold=1.641e+03, percent-clipped=0.0 2023-06-24 14:14:54,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1761492.0, ans=0.125 2023-06-24 14:15:03,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1761552.0, ans=0.125 2023-06-24 14:15:25,864 INFO [train.py:996] (3/4) Epoch 10, batch 19150, loss[loss=0.2465, simple_loss=0.3403, pruned_loss=0.07633, over 21704.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3082, pruned_loss=0.08059, over 4265350.94 frames. ], batch size: 298, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:16:05,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1761732.0, ans=0.125 2023-06-24 14:16:13,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1761732.0, ans=0.125 2023-06-24 14:16:27,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1761792.0, ans=0.125 2023-06-24 14:16:51,224 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=22.5 2023-06-24 14:17:06,218 INFO [train.py:996] (3/4) Epoch 10, batch 19200, loss[loss=0.2354, simple_loss=0.3374, pruned_loss=0.06668, over 21573.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3177, pruned_loss=0.08049, over 4260752.27 frames. ], batch size: 263, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 14:17:21,999 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.633e+02 7.091e+02 1.026e+03 1.602e+03 3.229e+03, threshold=2.053e+03, percent-clipped=24.0 2023-06-24 14:17:25,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1761972.0, ans=0.2 2023-06-24 14:18:21,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1762092.0, ans=0.0 2023-06-24 14:18:44,960 INFO [train.py:996] (3/4) Epoch 10, batch 19250, loss[loss=0.191, simple_loss=0.2859, pruned_loss=0.04801, over 21787.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3177, pruned_loss=0.07603, over 4262050.29 frames. ], batch size: 298, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:18:49,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1762212.0, ans=0.0 2023-06-24 14:20:20,861 INFO [train.py:996] (3/4) Epoch 10, batch 19300, loss[loss=0.2419, simple_loss=0.3131, pruned_loss=0.08533, over 21864.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3148, pruned_loss=0.07627, over 4256097.57 frames. ], batch size: 414, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:20:36,655 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.942e+02 6.247e+02 8.864e+02 1.177e+03 3.202e+03, threshold=1.773e+03, percent-clipped=6.0 2023-06-24 14:20:58,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1762632.0, ans=0.025 2023-06-24 14:21:41,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.83 vs. limit=5.0 2023-06-24 14:21:46,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1762752.0, ans=10.0 2023-06-24 14:21:47,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1762752.0, ans=0.0 2023-06-24 14:21:57,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1762752.0, ans=0.0 2023-06-24 14:22:00,310 INFO [train.py:996] (3/4) Epoch 10, batch 19350, loss[loss=0.1796, simple_loss=0.2618, pruned_loss=0.04868, over 21544.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.308, pruned_loss=0.07284, over 4261405.10 frames. ], batch size: 195, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:22:19,264 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.01 vs. limit=15.0 2023-06-24 14:22:21,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1762872.0, ans=0.0 2023-06-24 14:22:43,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1762932.0, ans=0.2 2023-06-24 14:23:36,211 INFO [train.py:996] (3/4) Epoch 10, batch 19400, loss[loss=0.1873, simple_loss=0.2709, pruned_loss=0.05183, over 21797.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3058, pruned_loss=0.07194, over 4264262.32 frames. ], batch size: 282, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 14:23:56,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1763172.0, ans=0.0 2023-06-24 14:23:58,113 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.962e+02 7.006e+02 1.136e+03 1.736e+03 4.231e+03, threshold=2.271e+03, percent-clipped=24.0 2023-06-24 14:24:26,860 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-06-24 14:24:49,533 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-24 14:25:11,446 INFO [train.py:996] (3/4) Epoch 10, batch 19450, loss[loss=0.2276, simple_loss=0.285, pruned_loss=0.0851, over 21639.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3033, pruned_loss=0.07308, over 4276654.80 frames. ], batch size: 247, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 14:25:21,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1763412.0, ans=0.125 2023-06-24 14:26:09,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1763532.0, ans=0.125 2023-06-24 14:26:48,803 INFO [train.py:996] (3/4) Epoch 10, batch 19500, loss[loss=0.1967, simple_loss=0.2562, pruned_loss=0.0686, over 21813.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3014, pruned_loss=0.07506, over 4274952.56 frames. ], batch size: 102, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 14:27:11,211 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.173e+02 6.380e+02 1.047e+03 1.511e+03 3.799e+03, threshold=2.095e+03, percent-clipped=7.0 2023-06-24 14:27:25,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1763772.0, ans=0.0 2023-06-24 14:27:36,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1763832.0, ans=0.1 2023-06-24 14:27:38,600 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-24 14:28:01,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1763892.0, ans=0.0 2023-06-24 14:28:25,358 INFO [train.py:996] (3/4) Epoch 10, batch 19550, loss[loss=0.1984, simple_loss=0.2958, pruned_loss=0.05046, over 21464.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2967, pruned_loss=0.07302, over 4279576.46 frames. ], batch size: 211, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 14:28:30,992 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-24 14:29:15,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-24 14:29:41,592 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:29:46,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1764252.0, ans=0.04949747468305833 2023-06-24 14:29:53,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1764252.0, ans=0.1 2023-06-24 14:30:01,056 INFO [train.py:996] (3/4) Epoch 10, batch 19600, loss[loss=0.2848, simple_loss=0.3402, pruned_loss=0.1147, over 21563.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2993, pruned_loss=0.07434, over 4287682.71 frames. ], batch size: 471, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:30:28,056 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.111e+02 6.531e+02 1.025e+03 1.412e+03 3.718e+03, threshold=2.049e+03, percent-clipped=12.0 2023-06-24 14:30:31,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1764372.0, ans=0.0 2023-06-24 14:30:48,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1764432.0, ans=0.125 2023-06-24 14:30:48,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1764432.0, ans=0.125 2023-06-24 14:31:08,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1764492.0, ans=0.2 2023-06-24 14:31:38,431 INFO [train.py:996] (3/4) Epoch 10, batch 19650, loss[loss=0.2278, simple_loss=0.3107, pruned_loss=0.07246, over 21419.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3031, pruned_loss=0.07832, over 4289387.57 frames. ], batch size: 131, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:32:03,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1764672.0, ans=0.125 2023-06-24 14:32:24,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1764732.0, ans=0.0 2023-06-24 14:33:13,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=1764852.0, ans=22.5 2023-06-24 14:33:23,792 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:33:27,991 INFO [train.py:996] (3/4) Epoch 10, batch 19700, loss[loss=0.2122, simple_loss=0.315, pruned_loss=0.05473, over 20841.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3074, pruned_loss=0.07931, over 4283907.98 frames. ], batch size: 609, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:33:31,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1764912.0, ans=0.125 2023-06-24 14:33:44,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1764912.0, ans=0.0 2023-06-24 14:33:54,877 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.700e+02 9.383e+02 1.272e+03 2.018e+03 4.455e+03, threshold=2.544e+03, percent-clipped=24.0 2023-06-24 14:34:00,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1764972.0, ans=0.0 2023-06-24 14:34:10,217 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-24 14:34:29,211 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.47 vs. limit=12.0 2023-06-24 14:35:08,166 INFO [train.py:996] (3/4) Epoch 10, batch 19750, loss[loss=0.3643, simple_loss=0.4421, pruned_loss=0.1433, over 21497.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3164, pruned_loss=0.08069, over 4274632.35 frames. ], batch size: 471, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:35:18,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1765212.0, ans=0.125 2023-06-24 14:35:24,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1765212.0, ans=0.2 2023-06-24 14:35:44,234 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-24 14:36:01,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1765332.0, ans=0.125 2023-06-24 14:36:49,173 INFO [train.py:996] (3/4) Epoch 10, batch 19800, loss[loss=0.2193, simple_loss=0.3008, pruned_loss=0.06893, over 21788.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3159, pruned_loss=0.08062, over 4271787.24 frames. ], batch size: 351, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:37:11,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.026e+02 9.250e+02 1.586e+03 2.402e+03 4.902e+03, threshold=3.172e+03, percent-clipped=21.0 2023-06-24 14:37:48,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1765692.0, ans=0.0 2023-06-24 14:38:13,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1765752.0, ans=0.0 2023-06-24 14:38:27,738 INFO [train.py:996] (3/4) Epoch 10, batch 19850, loss[loss=0.2217, simple_loss=0.3172, pruned_loss=0.06307, over 21716.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3081, pruned_loss=0.07641, over 4271631.36 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:38:54,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1765872.0, ans=0.125 2023-06-24 14:39:24,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1765992.0, ans=0.0 2023-06-24 14:39:25,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1765992.0, ans=0.1 2023-06-24 14:39:37,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1766052.0, ans=0.09899494936611666 2023-06-24 14:40:03,674 INFO [train.py:996] (3/4) Epoch 10, batch 19900, loss[loss=0.1935, simple_loss=0.2982, pruned_loss=0.04433, over 21597.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.307, pruned_loss=0.07387, over 4262682.15 frames. ], batch size: 263, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:40:20,754 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 6.106e+02 8.584e+02 1.585e+03 3.373e+03, threshold=1.717e+03, percent-clipped=1.0 2023-06-24 14:40:24,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1766172.0, ans=0.125 2023-06-24 14:40:40,953 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.26 vs. limit=15.0 2023-06-24 14:40:53,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1766232.0, ans=0.125 2023-06-24 14:40:59,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1766292.0, ans=0.0 2023-06-24 14:41:04,285 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-06-24 14:41:09,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1766352.0, ans=0.0 2023-06-24 14:41:29,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1766352.0, ans=0.125 2023-06-24 14:41:36,296 INFO [train.py:996] (3/4) Epoch 10, batch 19950, loss[loss=0.2065, simple_loss=0.2786, pruned_loss=0.06718, over 21757.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3016, pruned_loss=0.07339, over 4257146.30 frames. ], batch size: 351, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:41:39,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1766412.0, ans=0.0 2023-06-24 14:42:03,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1766472.0, ans=0.0 2023-06-24 14:42:18,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1766532.0, ans=0.04949747468305833 2023-06-24 14:42:29,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1766532.0, ans=0.125 2023-06-24 14:42:33,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1766592.0, ans=0.0 2023-06-24 14:42:38,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1766592.0, ans=0.0 2023-06-24 14:43:12,263 INFO [train.py:996] (3/4) Epoch 10, batch 20000, loss[loss=0.2078, simple_loss=0.2847, pruned_loss=0.06544, over 21324.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3031, pruned_loss=0.07381, over 4260599.41 frames. ], batch size: 176, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:43:14,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1766712.0, ans=0.0 2023-06-24 14:43:29,112 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.593e+02 7.271e+02 1.092e+03 1.631e+03 3.154e+03, threshold=2.184e+03, percent-clipped=18.0 2023-06-24 14:44:47,354 INFO [train.py:996] (3/4) Epoch 10, batch 20050, loss[loss=0.2001, simple_loss=0.2521, pruned_loss=0.07406, over 20273.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3051, pruned_loss=0.07639, over 4273890.66 frames. ], batch size: 703, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:45:14,314 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:45:26,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1767132.0, ans=0.0 2023-06-24 14:46:17,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1767252.0, ans=0.0 2023-06-24 14:46:26,755 INFO [train.py:996] (3/4) Epoch 10, batch 20100, loss[loss=0.2537, simple_loss=0.348, pruned_loss=0.07973, over 21846.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3075, pruned_loss=0.07801, over 4278695.10 frames. ], batch size: 316, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:46:51,351 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.584e+02 6.086e+02 7.806e+02 1.176e+03 2.985e+03, threshold=1.561e+03, percent-clipped=5.0 2023-06-24 14:47:45,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1767552.0, ans=0.125 2023-06-24 14:47:59,998 INFO [train.py:996] (3/4) Epoch 10, batch 20150, loss[loss=0.2532, simple_loss=0.3309, pruned_loss=0.08779, over 21465.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3133, pruned_loss=0.08039, over 4273118.71 frames. ], batch size: 211, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:48:31,576 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=22.5 2023-06-24 14:49:07,561 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-24 14:49:49,141 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-24 14:49:51,304 INFO [train.py:996] (3/4) Epoch 10, batch 20200, loss[loss=0.2195, simple_loss=0.298, pruned_loss=0.07054, over 21699.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3203, pruned_loss=0.08227, over 4271973.66 frames. ], batch size: 247, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:50:10,607 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.888e+02 8.305e+02 1.166e+03 1.860e+03 3.941e+03, threshold=2.331e+03, percent-clipped=33.0 2023-06-24 14:50:17,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1767972.0, ans=0.125 2023-06-24 14:50:39,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1768032.0, ans=0.2 2023-06-24 14:51:06,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1768152.0, ans=0.0 2023-06-24 14:51:29,550 INFO [train.py:996] (3/4) Epoch 10, batch 20250, loss[loss=0.248, simple_loss=0.3165, pruned_loss=0.08972, over 21863.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3206, pruned_loss=0.08096, over 4276436.35 frames. ], batch size: 124, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:52:17,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1768332.0, ans=0.0 2023-06-24 14:52:30,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1768392.0, ans=0.025 2023-06-24 14:53:05,405 INFO [train.py:996] (3/4) Epoch 10, batch 20300, loss[loss=0.2104, simple_loss=0.2961, pruned_loss=0.06234, over 21450.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3207, pruned_loss=0.07911, over 4260405.11 frames. ], batch size: 211, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:53:08,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1768512.0, ans=0.0 2023-06-24 14:53:11,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1768512.0, ans=0.125 2023-06-24 14:53:28,204 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.352e+02 6.165e+02 8.569e+02 1.423e+03 2.886e+03, threshold=1.714e+03, percent-clipped=5.0 2023-06-24 14:54:30,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=1768752.0, ans=0.1 2023-06-24 14:54:40,737 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.90 vs. limit=15.0 2023-06-24 14:54:41,362 INFO [train.py:996] (3/4) Epoch 10, batch 20350, loss[loss=0.2026, simple_loss=0.2788, pruned_loss=0.06315, over 15978.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3198, pruned_loss=0.07936, over 4252668.73 frames. ], batch size: 60, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:54:43,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1768812.0, ans=0.1 2023-06-24 14:54:47,291 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=22.5 2023-06-24 14:55:01,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1768872.0, ans=0.2 2023-06-24 14:55:17,225 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=15.0 2023-06-24 14:55:27,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1768932.0, ans=0.015 2023-06-24 14:55:46,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1768992.0, ans=0.125 2023-06-24 14:56:19,295 INFO [train.py:996] (3/4) Epoch 10, batch 20400, loss[loss=0.2901, simple_loss=0.3556, pruned_loss=0.1123, over 21400.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.323, pruned_loss=0.08295, over 4260752.26 frames. ], batch size: 131, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:56:21,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1769112.0, ans=0.0 2023-06-24 14:56:26,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1769112.0, ans=0.04949747468305833 2023-06-24 14:56:42,154 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.538e+02 7.746e+02 1.148e+03 1.668e+03 3.679e+03, threshold=2.297e+03, percent-clipped=22.0 2023-06-24 14:56:44,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1769172.0, ans=0.125 2023-06-24 14:56:53,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1769172.0, ans=0.2 2023-06-24 14:57:09,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1769232.0, ans=0.125 2023-06-24 14:57:10,345 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2023-06-24 14:57:51,898 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-24 14:57:51,926 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=12.0 2023-06-24 14:57:55,667 INFO [train.py:996] (3/4) Epoch 10, batch 20450, loss[loss=0.2397, simple_loss=0.3141, pruned_loss=0.08261, over 21850.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3258, pruned_loss=0.08629, over 4260913.64 frames. ], batch size: 332, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:57:57,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1769412.0, ans=0.015 2023-06-24 14:58:11,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1769412.0, ans=0.0 2023-06-24 14:58:59,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1769592.0, ans=0.0 2023-06-24 14:59:20,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1769652.0, ans=0.125 2023-06-24 14:59:23,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1769652.0, ans=0.0 2023-06-24 14:59:32,110 INFO [train.py:996] (3/4) Epoch 10, batch 20500, loss[loss=0.2179, simple_loss=0.2784, pruned_loss=0.07872, over 21808.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3212, pruned_loss=0.08606, over 4264777.60 frames. ], batch size: 107, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:59:32,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1769712.0, ans=0.125 2023-06-24 14:59:52,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1769772.0, ans=0.2 2023-06-24 15:00:01,817 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.514e+02 6.913e+02 8.863e+02 1.328e+03 2.262e+03, threshold=1.773e+03, percent-clipped=0.0 2023-06-24 15:00:05,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1769772.0, ans=0.0 2023-06-24 15:01:08,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1770012.0, ans=0.0 2023-06-24 15:01:09,470 INFO [train.py:996] (3/4) Epoch 10, batch 20550, loss[loss=0.1932, simple_loss=0.274, pruned_loss=0.05613, over 21315.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3128, pruned_loss=0.08376, over 4260109.45 frames. ], batch size: 211, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:01:09,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1770012.0, ans=0.125 2023-06-24 15:02:05,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1770192.0, ans=0.1 2023-06-24 15:02:08,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1770192.0, ans=0.125 2023-06-24 15:02:21,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1770192.0, ans=0.0 2023-06-24 15:02:46,090 INFO [train.py:996] (3/4) Epoch 10, batch 20600, loss[loss=0.2549, simple_loss=0.3731, pruned_loss=0.06832, over 20040.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3154, pruned_loss=0.08345, over 4254013.32 frames. ], batch size: 703, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:02:59,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1770312.0, ans=0.125 2023-06-24 15:03:15,184 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.661e+02 6.737e+02 1.120e+03 2.042e+03 4.837e+03, threshold=2.240e+03, percent-clipped=29.0 2023-06-24 15:03:33,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1770432.0, ans=0.125 2023-06-24 15:04:21,167 INFO [train.py:996] (3/4) Epoch 10, batch 20650, loss[loss=0.2587, simple_loss=0.3524, pruned_loss=0.08249, over 17040.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3107, pruned_loss=0.08275, over 4245280.94 frames. ], batch size: 60, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:05:52,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1770852.0, ans=0.125 2023-06-24 15:05:58,598 INFO [train.py:996] (3/4) Epoch 10, batch 20700, loss[loss=0.3127, simple_loss=0.3741, pruned_loss=0.1257, over 21400.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3055, pruned_loss=0.07992, over 4251234.84 frames. ], batch size: 507, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:06:23,995 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.424e+02 6.389e+02 9.253e+02 1.399e+03 2.647e+03, threshold=1.851e+03, percent-clipped=4.0 2023-06-24 15:07:42,268 INFO [train.py:996] (3/4) Epoch 10, batch 20750, loss[loss=0.2684, simple_loss=0.3621, pruned_loss=0.08733, over 21756.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.308, pruned_loss=0.07982, over 4247393.17 frames. ], batch size: 351, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:07:45,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1771212.0, ans=0.125 2023-06-24 15:08:46,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1771392.0, ans=0.125 2023-06-24 15:08:50,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1771392.0, ans=0.125 2023-06-24 15:09:20,730 INFO [train.py:996] (3/4) Epoch 10, batch 20800, loss[loss=0.2042, simple_loss=0.2786, pruned_loss=0.0649, over 21623.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3106, pruned_loss=0.07947, over 4244556.70 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 15:09:40,545 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-06-24 15:09:47,340 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.552e+02 1.010e+03 1.567e+03 2.337e+03 4.966e+03, threshold=3.135e+03, percent-clipped=39.0 2023-06-24 15:09:56,086 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.21 vs. limit=12.0 2023-06-24 15:10:57,046 INFO [train.py:996] (3/4) Epoch 10, batch 20850, loss[loss=0.2525, simple_loss=0.3152, pruned_loss=0.09493, over 21715.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3052, pruned_loss=0.07799, over 4244892.63 frames. ], batch size: 441, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:11:14,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1771812.0, ans=0.2 2023-06-24 15:11:14,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1771812.0, ans=0.2 2023-06-24 15:11:21,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1771872.0, ans=0.0 2023-06-24 15:12:33,601 INFO [train.py:996] (3/4) Epoch 10, batch 20900, loss[loss=0.222, simple_loss=0.2887, pruned_loss=0.07764, over 21812.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.304, pruned_loss=0.07788, over 4255449.56 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:12:59,925 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.344e+02 6.808e+02 1.169e+03 1.577e+03 3.825e+03, threshold=2.338e+03, percent-clipped=3.0 2023-06-24 15:13:28,673 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-24 15:13:40,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1772292.0, ans=0.04949747468305833 2023-06-24 15:13:53,790 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.47 vs. limit=15.0 2023-06-24 15:14:02,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1772352.0, ans=0.0 2023-06-24 15:14:09,061 INFO [train.py:996] (3/4) Epoch 10, batch 20950, loss[loss=0.1933, simple_loss=0.2789, pruned_loss=0.05383, over 21771.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2983, pruned_loss=0.07432, over 4257441.28 frames. ], batch size: 332, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:14:09,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1772412.0, ans=0.0 2023-06-24 15:15:44,302 INFO [train.py:996] (3/4) Epoch 10, batch 21000, loss[loss=0.2347, simple_loss=0.3451, pruned_loss=0.06215, over 19799.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2964, pruned_loss=0.07422, over 4252029.90 frames. ], batch size: 703, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:15:44,303 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 15:16:03,223 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2634, simple_loss=0.3598, pruned_loss=0.08347, over 1796401.00 frames. 2023-06-24 15:16:03,223 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-24 15:16:20,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1772772.0, ans=0.125 2023-06-24 15:16:24,560 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.402e+02 6.763e+02 8.645e+02 1.170e+03 2.024e+03, threshold=1.729e+03, percent-clipped=0.0 2023-06-24 15:17:33,514 INFO [train.py:996] (3/4) Epoch 10, batch 21050, loss[loss=0.2353, simple_loss=0.2913, pruned_loss=0.08963, over 21447.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2955, pruned_loss=0.0746, over 4245024.92 frames. ], batch size: 441, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:17:39,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1773012.0, ans=0.125 2023-06-24 15:17:42,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1773012.0, ans=0.125 2023-06-24 15:17:43,226 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=12.0 2023-06-24 15:17:56,870 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:18:16,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1773132.0, ans=0.0 2023-06-24 15:19:08,746 INFO [train.py:996] (3/4) Epoch 10, batch 21100, loss[loss=0.2255, simple_loss=0.2724, pruned_loss=0.0893, over 20212.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2927, pruned_loss=0.07503, over 4244007.07 frames. ], batch size: 703, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 15:19:36,716 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.103e+02 5.905e+02 7.931e+02 1.116e+03 2.788e+03, threshold=1.586e+03, percent-clipped=2.0 2023-06-24 15:20:00,528 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:20:14,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1773492.0, ans=0.1 2023-06-24 15:20:20,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1773492.0, ans=0.0 2023-06-24 15:20:24,071 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=22.5 2023-06-24 15:20:45,195 INFO [train.py:996] (3/4) Epoch 10, batch 21150, loss[loss=0.2081, simple_loss=0.2677, pruned_loss=0.07425, over 21776.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2892, pruned_loss=0.07608, over 4241781.41 frames. ], batch size: 352, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 15:21:04,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1773672.0, ans=0.125 2023-06-24 15:21:17,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1773672.0, ans=0.0 2023-06-24 15:21:29,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1773732.0, ans=0.125 2023-06-24 15:21:57,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1773792.0, ans=0.2 2023-06-24 15:22:14,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1773852.0, ans=0.125 2023-06-24 15:22:21,486 INFO [train.py:996] (3/4) Epoch 10, batch 21200, loss[loss=0.1997, simple_loss=0.2708, pruned_loss=0.06423, over 21747.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2857, pruned_loss=0.0747, over 4253961.96 frames. ], batch size: 351, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:22:23,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1773912.0, ans=0.125 2023-06-24 15:22:49,819 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.351e+02 6.353e+02 8.503e+02 1.111e+03 2.488e+03, threshold=1.701e+03, percent-clipped=3.0 2023-06-24 15:23:12,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1774032.0, ans=0.125 2023-06-24 15:23:29,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1774092.0, ans=0.0 2023-06-24 15:23:33,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1774092.0, ans=0.125 2023-06-24 15:23:44,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1774152.0, ans=0.0 2023-06-24 15:23:44,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1774152.0, ans=0.125 2023-06-24 15:23:46,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-24 15:23:57,718 INFO [train.py:996] (3/4) Epoch 10, batch 21250, loss[loss=0.2106, simple_loss=0.2829, pruned_loss=0.06913, over 21827.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2846, pruned_loss=0.07497, over 4254362.09 frames. ], batch size: 118, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 15:23:58,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1774212.0, ans=0.125 2023-06-24 15:24:29,916 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.03 vs. limit=22.5 2023-06-24 15:24:58,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1774392.0, ans=0.025 2023-06-24 15:25:07,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1774392.0, ans=0.0 2023-06-24 15:25:24,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1774452.0, ans=0.0 2023-06-24 15:25:28,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1774452.0, ans=0.125 2023-06-24 15:25:32,738 INFO [train.py:996] (3/4) Epoch 10, batch 21300, loss[loss=0.2547, simple_loss=0.3223, pruned_loss=0.09357, over 21782.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2897, pruned_loss=0.07618, over 4262296.38 frames. ], batch size: 441, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 15:25:55,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1774572.0, ans=0.5 2023-06-24 15:26:02,198 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.637e+02 7.124e+02 9.830e+02 1.357e+03 3.184e+03, threshold=1.966e+03, percent-clipped=15.0 2023-06-24 15:26:23,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1774632.0, ans=0.2 2023-06-24 15:26:34,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1774692.0, ans=0.125 2023-06-24 15:26:42,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1774692.0, ans=0.0 2023-06-24 15:27:10,393 INFO [train.py:996] (3/4) Epoch 10, batch 21350, loss[loss=0.2442, simple_loss=0.3304, pruned_loss=0.07902, over 21601.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2954, pruned_loss=0.07812, over 4273589.08 frames. ], batch size: 441, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:27:31,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1774872.0, ans=0.2 2023-06-24 15:27:34,666 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:27:37,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1774872.0, ans=0.035 2023-06-24 15:28:36,728 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=12.0 2023-06-24 15:28:48,138 INFO [train.py:996] (3/4) Epoch 10, batch 21400, loss[loss=0.2863, simple_loss=0.3647, pruned_loss=0.1039, over 21808.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3, pruned_loss=0.07811, over 4274486.40 frames. ], batch size: 118, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:29:23,241 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.641e+02 5.834e+02 7.962e+02 1.308e+03 2.363e+03, threshold=1.592e+03, percent-clipped=6.0 2023-06-24 15:29:42,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1775232.0, ans=0.125 2023-06-24 15:30:24,974 INFO [train.py:996] (3/4) Epoch 10, batch 21450, loss[loss=0.2358, simple_loss=0.3102, pruned_loss=0.0807, over 21940.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3044, pruned_loss=0.07984, over 4273904.32 frames. ], batch size: 333, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:30:50,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1775472.0, ans=0.125 2023-06-24 15:32:06,977 INFO [train.py:996] (3/4) Epoch 10, batch 21500, loss[loss=0.2341, simple_loss=0.2864, pruned_loss=0.09092, over 21431.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.302, pruned_loss=0.08075, over 4273996.29 frames. ], batch size: 441, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:32:09,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1775712.0, ans=0.1 2023-06-24 15:32:36,245 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.266e+02 7.584e+02 1.027e+03 1.446e+03 3.225e+03, threshold=2.054e+03, percent-clipped=19.0 2023-06-24 15:32:45,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1775832.0, ans=0.1 2023-06-24 15:32:56,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1775832.0, ans=0.2 2023-06-24 15:33:07,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1775892.0, ans=0.015 2023-06-24 15:33:08,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1775892.0, ans=0.125 2023-06-24 15:33:20,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1775952.0, ans=0.2 2023-06-24 15:33:25,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1775952.0, ans=0.2 2023-06-24 15:33:45,356 INFO [train.py:996] (3/4) Epoch 10, batch 21550, loss[loss=0.219, simple_loss=0.2779, pruned_loss=0.08, over 21329.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2955, pruned_loss=0.07755, over 4263668.30 frames. ], batch size: 548, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:34:50,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1776192.0, ans=0.0 2023-06-24 15:35:18,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1776252.0, ans=0.125 2023-06-24 15:35:19,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1776252.0, ans=0.125 2023-06-24 15:35:24,300 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-24 15:35:25,096 INFO [train.py:996] (3/4) Epoch 10, batch 21600, loss[loss=0.239, simple_loss=0.3227, pruned_loss=0.07766, over 21542.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2942, pruned_loss=0.07686, over 4256967.56 frames. ], batch size: 389, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:35:33,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1776312.0, ans=0.125 2023-06-24 15:36:01,299 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.319e+02 8.052e+02 1.212e+03 1.997e+03 4.912e+03, threshold=2.424e+03, percent-clipped=21.0 2023-06-24 15:36:13,149 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=15.0 2023-06-24 15:36:25,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1776492.0, ans=15.0 2023-06-24 15:36:38,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1776492.0, ans=0.0 2023-06-24 15:36:54,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1776552.0, ans=0.125 2023-06-24 15:37:01,694 INFO [train.py:996] (3/4) Epoch 10, batch 21650, loss[loss=0.2026, simple_loss=0.2648, pruned_loss=0.07026, over 21106.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2952, pruned_loss=0.07446, over 4253973.10 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:37:01,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1776612.0, ans=0.125 2023-06-24 15:37:22,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1776672.0, ans=0.2 2023-06-24 15:37:38,356 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.49 vs. limit=15.0 2023-06-24 15:37:42,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1776732.0, ans=0.95 2023-06-24 15:38:37,593 INFO [train.py:996] (3/4) Epoch 10, batch 21700, loss[loss=0.1973, simple_loss=0.2717, pruned_loss=0.06145, over 21542.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2963, pruned_loss=0.07308, over 4263204.70 frames. ], batch size: 195, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:39:12,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.168e+02 6.660e+02 9.555e+02 1.550e+03 3.491e+03, threshold=1.911e+03, percent-clipped=7.0 2023-06-24 15:39:40,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1777092.0, ans=0.125 2023-06-24 15:39:45,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1777092.0, ans=0.125 2023-06-24 15:39:47,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1777092.0, ans=0.125 2023-06-24 15:39:50,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1777092.0, ans=0.07 2023-06-24 15:39:50,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1777092.0, ans=0.04949747468305833 2023-06-24 15:40:13,238 INFO [train.py:996] (3/4) Epoch 10, batch 21750, loss[loss=0.2238, simple_loss=0.2815, pruned_loss=0.08306, over 21648.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2922, pruned_loss=0.07312, over 4268347.43 frames. ], batch size: 282, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:40:19,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1777212.0, ans=0.2 2023-06-24 15:41:50,253 INFO [train.py:996] (3/4) Epoch 10, batch 21800, loss[loss=0.2407, simple_loss=0.3029, pruned_loss=0.08927, over 21593.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2905, pruned_loss=0.07356, over 4255876.52 frames. ], batch size: 247, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:42:00,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1777512.0, ans=0.125 2023-06-24 15:42:25,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.465e+02 6.460e+02 8.682e+02 1.144e+03 2.406e+03, threshold=1.736e+03, percent-clipped=3.0 2023-06-24 15:42:50,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1777692.0, ans=0.1 2023-06-24 15:42:53,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1777692.0, ans=0.1 2023-06-24 15:43:04,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1777752.0, ans=0.2 2023-06-24 15:43:14,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1777752.0, ans=10.0 2023-06-24 15:43:25,426 INFO [train.py:996] (3/4) Epoch 10, batch 21850, loss[loss=0.2626, simple_loss=0.3361, pruned_loss=0.09452, over 21617.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.297, pruned_loss=0.07408, over 4241194.64 frames. ], batch size: 471, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:43:53,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1777872.0, ans=0.0 2023-06-24 15:44:02,029 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-06-24 15:44:27,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1777992.0, ans=0.125 2023-06-24 15:44:29,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1777992.0, ans=0.04949747468305833 2023-06-24 15:44:32,177 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:44:48,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1778052.0, ans=0.2 2023-06-24 15:45:05,274 INFO [train.py:996] (3/4) Epoch 10, batch 21900, loss[loss=0.2516, simple_loss=0.2885, pruned_loss=0.1074, over 21387.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2964, pruned_loss=0.07519, over 4258532.73 frames. ], batch size: 508, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:45:36,505 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.519e+02 8.112e+02 1.126e+03 1.862e+03 4.122e+03, threshold=2.252e+03, percent-clipped=27.0 2023-06-24 15:45:47,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1778232.0, ans=0.125 2023-06-24 15:46:42,328 INFO [train.py:996] (3/4) Epoch 10, batch 21950, loss[loss=0.1785, simple_loss=0.2456, pruned_loss=0.0557, over 21234.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2907, pruned_loss=0.07361, over 4254423.84 frames. ], batch size: 144, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:46:42,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1778412.0, ans=0.125 2023-06-24 15:47:07,401 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-24 15:47:49,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1778592.0, ans=0.1 2023-06-24 15:48:18,358 INFO [train.py:996] (3/4) Epoch 10, batch 22000, loss[loss=0.2652, simple_loss=0.3851, pruned_loss=0.07268, over 19871.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2864, pruned_loss=0.0713, over 4254141.93 frames. ], batch size: 702, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:48:54,787 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.014e+02 5.065e+02 6.961e+02 1.085e+03 3.109e+03, threshold=1.392e+03, percent-clipped=2.0 2023-06-24 15:49:35,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1778952.0, ans=0.0 2023-06-24 15:49:48,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1778952.0, ans=0.0 2023-06-24 15:50:02,058 INFO [train.py:996] (3/4) Epoch 10, batch 22050, loss[loss=0.2351, simple_loss=0.324, pruned_loss=0.07305, over 21757.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2909, pruned_loss=0.07272, over 4229860.39 frames. ], batch size: 282, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:50:04,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1779012.0, ans=0.1 2023-06-24 15:50:29,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1779072.0, ans=0.125 2023-06-24 15:50:45,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1779132.0, ans=0.95 2023-06-24 15:51:01,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1779192.0, ans=10.0 2023-06-24 15:51:38,356 INFO [train.py:996] (3/4) Epoch 10, batch 22100, loss[loss=0.294, simple_loss=0.3596, pruned_loss=0.1142, over 21273.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3014, pruned_loss=0.07744, over 4236186.35 frames. ], batch size: 548, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:51:49,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1779312.0, ans=0.125 2023-06-24 15:52:09,367 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.723e+02 7.259e+02 1.034e+03 1.568e+03 3.837e+03, threshold=2.069e+03, percent-clipped=34.0 2023-06-24 15:52:13,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1779432.0, ans=0.125 2023-06-24 15:52:17,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1779432.0, ans=0.125 2023-06-24 15:53:08,130 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.53 vs. limit=10.0 2023-06-24 15:53:16,263 INFO [train.py:996] (3/4) Epoch 10, batch 22150, loss[loss=0.2362, simple_loss=0.3007, pruned_loss=0.08584, over 21533.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3044, pruned_loss=0.07879, over 4249212.11 frames. ], batch size: 548, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:54:16,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1779792.0, ans=0.0 2023-06-24 15:54:33,184 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-24 15:54:54,164 INFO [train.py:996] (3/4) Epoch 10, batch 22200, loss[loss=0.2542, simple_loss=0.3322, pruned_loss=0.08812, over 21854.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3071, pruned_loss=0.07999, over 4257217.85 frames. ], batch size: 124, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:54:57,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1779912.0, ans=0.125 2023-06-24 15:55:07,271 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-24 15:55:11,908 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.25 vs. limit=22.5 2023-06-24 15:55:24,789 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.644e+02 6.933e+02 1.129e+03 1.517e+03 2.505e+03, threshold=2.259e+03, percent-clipped=10.0 2023-06-24 15:55:26,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1779972.0, ans=0.2 2023-06-24 15:55:44,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1780032.0, ans=0.125 2023-06-24 15:55:49,038 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-24 15:56:31,899 INFO [train.py:996] (3/4) Epoch 10, batch 22250, loss[loss=0.2679, simple_loss=0.3448, pruned_loss=0.09551, over 21262.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3122, pruned_loss=0.08069, over 4264897.18 frames. ], batch size: 143, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:57:02,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1780272.0, ans=0.125 2023-06-24 15:57:03,819 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:57:24,713 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-24 15:57:28,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1780392.0, ans=0.1 2023-06-24 15:57:45,127 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-24 15:58:06,587 INFO [train.py:996] (3/4) Epoch 10, batch 22300, loss[loss=0.2108, simple_loss=0.281, pruned_loss=0.07032, over 21929.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3136, pruned_loss=0.08294, over 4272182.40 frames. ], batch size: 283, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:58:13,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1780512.0, ans=0.1 2023-06-24 15:58:20,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1780512.0, ans=0.95 2023-06-24 15:58:37,327 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.918e+02 7.899e+02 1.099e+03 1.483e+03 2.745e+03, threshold=2.199e+03, percent-clipped=4.0 2023-06-24 15:59:42,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1780812.0, ans=0.1 2023-06-24 15:59:43,736 INFO [train.py:996] (3/4) Epoch 10, batch 22350, loss[loss=0.2278, simple_loss=0.2971, pruned_loss=0.07931, over 21839.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3119, pruned_loss=0.08321, over 4274541.89 frames. ], batch size: 107, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:00:00,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1780812.0, ans=0.0 2023-06-24 16:00:05,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1780872.0, ans=0.125 2023-06-24 16:00:07,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.31 vs. limit=15.0 2023-06-24 16:00:12,030 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-24 16:00:16,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1780872.0, ans=0.125 2023-06-24 16:00:17,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1780932.0, ans=0.125 2023-06-24 16:00:46,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1780992.0, ans=0.125 2023-06-24 16:00:47,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1780992.0, ans=0.125 2023-06-24 16:01:25,434 INFO [train.py:996] (3/4) Epoch 10, batch 22400, loss[loss=0.198, simple_loss=0.2741, pruned_loss=0.06092, over 21624.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3078, pruned_loss=0.08054, over 4268514.45 frames. ], batch size: 332, lr: 2.90e-03, grad_scale: 32.0 2023-06-24 16:01:37,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1781112.0, ans=0.0 2023-06-24 16:01:51,918 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.474e+02 7.856e+02 9.897e+02 1.374e+03 2.984e+03, threshold=1.979e+03, percent-clipped=5.0 2023-06-24 16:02:56,370 INFO [train.py:996] (3/4) Epoch 10, batch 22450, loss[loss=0.2194, simple_loss=0.2791, pruned_loss=0.07989, over 21181.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3018, pruned_loss=0.07967, over 4268702.02 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 32.0 2023-06-24 16:03:04,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1781412.0, ans=0.1 2023-06-24 16:03:05,026 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-24 16:03:28,396 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-24 16:03:47,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1781532.0, ans=0.125 2023-06-24 16:04:04,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1781592.0, ans=0.2 2023-06-24 16:04:11,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1781652.0, ans=0.125 2023-06-24 16:04:27,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1781652.0, ans=0.125 2023-06-24 16:04:29,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1781652.0, ans=0.2 2023-06-24 16:04:33,444 INFO [train.py:996] (3/4) Epoch 10, batch 22500, loss[loss=0.2485, simple_loss=0.3473, pruned_loss=0.07485, over 21646.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.2982, pruned_loss=0.07896, over 4272920.05 frames. ], batch size: 298, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:05:06,696 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.137e+02 1.060e+03 1.856e+03 3.830e+03, threshold=2.121e+03, percent-clipped=17.0 2023-06-24 16:05:24,708 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-24 16:05:33,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1781892.0, ans=0.125 2023-06-24 16:06:10,749 INFO [train.py:996] (3/4) Epoch 10, batch 22550, loss[loss=0.2251, simple_loss=0.2959, pruned_loss=0.07709, over 21852.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3014, pruned_loss=0.07898, over 4281649.18 frames. ], batch size: 282, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:06:19,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1782012.0, ans=0.02 2023-06-24 16:06:22,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1782012.0, ans=0.0 2023-06-24 16:06:47,437 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=15.0 2023-06-24 16:07:05,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1782132.0, ans=0.125 2023-06-24 16:07:49,269 INFO [train.py:996] (3/4) Epoch 10, batch 22600, loss[loss=0.3158, simple_loss=0.3889, pruned_loss=0.1214, over 21477.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3058, pruned_loss=0.07943, over 4282579.63 frames. ], batch size: 507, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:08:27,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.566e+02 7.815e+02 1.188e+03 1.926e+03 4.524e+03, threshold=2.375e+03, percent-clipped=20.0 2023-06-24 16:09:02,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=1782492.0, ans=22.5 2023-06-24 16:09:12,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1782552.0, ans=0.125 2023-06-24 16:09:25,459 INFO [train.py:996] (3/4) Epoch 10, batch 22650, loss[loss=0.2234, simple_loss=0.2877, pruned_loss=0.07955, over 21817.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3025, pruned_loss=0.07971, over 4284609.36 frames. ], batch size: 98, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:09:57,351 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:09:57,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1782672.0, ans=0.0 2023-06-24 16:10:05,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1782732.0, ans=0.2 2023-06-24 16:10:31,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1782792.0, ans=0.125 2023-06-24 16:10:42,636 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=22.5 2023-06-24 16:11:01,271 INFO [train.py:996] (3/4) Epoch 10, batch 22700, loss[loss=0.2073, simple_loss=0.2773, pruned_loss=0.06865, over 21783.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2972, pruned_loss=0.07934, over 4277725.76 frames. ], batch size: 112, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:11:11,266 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-24 16:11:18,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1782912.0, ans=0.125 2023-06-24 16:11:29,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1782972.0, ans=0.0 2023-06-24 16:11:38,440 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.321e+02 7.699e+02 1.029e+03 1.382e+03 2.516e+03, threshold=2.058e+03, percent-clipped=2.0 2023-06-24 16:12:11,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1783092.0, ans=0.0 2023-06-24 16:12:20,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1783152.0, ans=0.125 2023-06-24 16:12:37,725 INFO [train.py:996] (3/4) Epoch 10, batch 22750, loss[loss=0.2025, simple_loss=0.2739, pruned_loss=0.06561, over 21972.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.2983, pruned_loss=0.08026, over 4270500.77 frames. ], batch size: 103, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:13:42,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1783392.0, ans=0.1 2023-06-24 16:14:11,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1783452.0, ans=0.2 2023-06-24 16:14:14,062 INFO [train.py:996] (3/4) Epoch 10, batch 22800, loss[loss=0.2345, simple_loss=0.2981, pruned_loss=0.08543, over 21307.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3027, pruned_loss=0.08299, over 4277851.71 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 32.0 2023-06-24 16:14:51,215 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.470e+02 7.113e+02 9.771e+02 1.479e+03 3.289e+03, threshold=1.954e+03, percent-clipped=6.0 2023-06-24 16:15:28,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1783692.0, ans=0.125 2023-06-24 16:15:29,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1783692.0, ans=0.1 2023-06-24 16:15:49,959 INFO [train.py:996] (3/4) Epoch 10, batch 22850, loss[loss=0.198, simple_loss=0.2643, pruned_loss=0.06588, over 21813.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3014, pruned_loss=0.08233, over 4273953.84 frames. ], batch size: 118, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:16:28,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1783872.0, ans=0.0 2023-06-24 16:16:56,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1783992.0, ans=0.125 2023-06-24 16:17:02,699 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:17:13,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1784052.0, ans=0.0 2023-06-24 16:17:24,538 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.02 vs. limit=22.5 2023-06-24 16:17:27,971 INFO [train.py:996] (3/4) Epoch 10, batch 22900, loss[loss=0.2304, simple_loss=0.3303, pruned_loss=0.06531, over 21815.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3016, pruned_loss=0.08142, over 4274180.03 frames. ], batch size: 282, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:18:12,790 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.680e+02 7.158e+02 1.057e+03 1.638e+03 3.126e+03, threshold=2.114e+03, percent-clipped=14.0 2023-06-24 16:18:59,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1784352.0, ans=0.0 2023-06-24 16:19:06,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1784352.0, ans=0.0 2023-06-24 16:19:16,522 INFO [train.py:996] (3/4) Epoch 10, batch 22950, loss[loss=0.2786, simple_loss=0.3682, pruned_loss=0.09444, over 19918.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3148, pruned_loss=0.0805, over 4274129.26 frames. ], batch size: 703, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:19:51,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1784472.0, ans=0.1 2023-06-24 16:19:59,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1784532.0, ans=0.0 2023-06-24 16:20:13,433 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:20:26,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1784652.0, ans=0.0 2023-06-24 16:20:49,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1784652.0, ans=0.125 2023-06-24 16:20:52,490 INFO [train.py:996] (3/4) Epoch 10, batch 23000, loss[loss=0.2695, simple_loss=0.338, pruned_loss=0.1005, over 21606.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3135, pruned_loss=0.07789, over 4275872.69 frames. ], batch size: 471, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:21:12,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1784772.0, ans=0.0 2023-06-24 16:21:28,880 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=12.0 2023-06-24 16:21:30,659 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.479e+02 7.106e+02 9.780e+02 1.454e+03 3.933e+03, threshold=1.956e+03, percent-clipped=7.0 2023-06-24 16:21:32,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1784832.0, ans=0.04949747468305833 2023-06-24 16:21:32,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1784832.0, ans=0.2 2023-06-24 16:21:46,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1784892.0, ans=0.0 2023-06-24 16:21:51,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1784892.0, ans=0.125 2023-06-24 16:22:04,717 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-24 16:22:36,120 INFO [train.py:996] (3/4) Epoch 10, batch 23050, loss[loss=0.2536, simple_loss=0.3316, pruned_loss=0.08785, over 21805.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3144, pruned_loss=0.07988, over 4280882.47 frames. ], batch size: 441, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:22:53,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1785012.0, ans=0.0 2023-06-24 16:23:07,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1785072.0, ans=0.0 2023-06-24 16:23:54,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1785252.0, ans=0.2 2023-06-24 16:24:13,714 INFO [train.py:996] (3/4) Epoch 10, batch 23100, loss[loss=0.2051, simple_loss=0.2659, pruned_loss=0.07217, over 21437.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3094, pruned_loss=0.07992, over 4275049.74 frames. ], batch size: 131, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:24:18,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1785312.0, ans=0.125 2023-06-24 16:24:47,866 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.180e+02 6.884e+02 9.412e+02 1.257e+03 2.198e+03, threshold=1.882e+03, percent-clipped=3.0 2023-06-24 16:24:48,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1785432.0, ans=0.125 2023-06-24 16:25:11,072 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-24 16:25:50,174 INFO [train.py:996] (3/4) Epoch 10, batch 23150, loss[loss=0.2117, simple_loss=0.2656, pruned_loss=0.07885, over 20712.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3038, pruned_loss=0.07958, over 4274298.95 frames. ], batch size: 609, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:25:51,274 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-06-24 16:26:17,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1785672.0, ans=10.0 2023-06-24 16:26:22,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1785672.0, ans=0.125 2023-06-24 16:27:20,627 INFO [train.py:996] (3/4) Epoch 10, batch 23200, loss[loss=0.2567, simple_loss=0.3163, pruned_loss=0.09859, over 21537.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3038, pruned_loss=0.08015, over 4277530.66 frames. ], batch size: 194, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:27:22,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1785912.0, ans=0.125 2023-06-24 16:27:34,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1785912.0, ans=0.2 2023-06-24 16:27:34,837 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=22.5 2023-06-24 16:27:41,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1785972.0, ans=15.0 2023-06-24 16:27:59,994 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.595e+02 6.989e+02 9.310e+02 1.266e+03 2.936e+03, threshold=1.862e+03, percent-clipped=9.0 2023-06-24 16:28:00,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1786032.0, ans=0.2 2023-06-24 16:28:10,319 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-06-24 16:28:10,388 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-24 16:28:23,933 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-24 16:28:27,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-24 16:28:27,269 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.27 vs. limit=12.0 2023-06-24 16:28:53,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1786152.0, ans=0.1 2023-06-24 16:29:01,237 INFO [train.py:996] (3/4) Epoch 10, batch 23250, loss[loss=0.2568, simple_loss=0.3204, pruned_loss=0.09657, over 21742.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3035, pruned_loss=0.08141, over 4289621.49 frames. ], batch size: 389, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:29:22,131 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-24 16:29:32,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1786272.0, ans=0.125 2023-06-24 16:29:35,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1786332.0, ans=0.125 2023-06-24 16:29:59,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1786392.0, ans=0.125 2023-06-24 16:30:37,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1786512.0, ans=0.125 2023-06-24 16:30:38,195 INFO [train.py:996] (3/4) Epoch 10, batch 23300, loss[loss=0.2776, simple_loss=0.3858, pruned_loss=0.08464, over 21202.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3108, pruned_loss=0.08273, over 4285903.85 frames. ], batch size: 548, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:30:38,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1786512.0, ans=0.125 2023-06-24 16:31:09,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1786572.0, ans=0.125 2023-06-24 16:31:13,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1786632.0, ans=0.1 2023-06-24 16:31:14,184 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.511e+02 8.047e+02 1.147e+03 1.597e+03 3.212e+03, threshold=2.293e+03, percent-clipped=17.0 2023-06-24 16:32:20,651 INFO [train.py:996] (3/4) Epoch 10, batch 23350, loss[loss=0.1841, simple_loss=0.2727, pruned_loss=0.04772, over 21697.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3155, pruned_loss=0.08277, over 4289344.83 frames. ], batch size: 351, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:32:32,600 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.88 vs. limit=5.0 2023-06-24 16:32:39,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1786872.0, ans=0.125 2023-06-24 16:32:40,064 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-24 16:32:49,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1786872.0, ans=0.125 2023-06-24 16:33:22,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1786992.0, ans=0.05 2023-06-24 16:33:22,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1786992.0, ans=10.0 2023-06-24 16:33:31,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1786992.0, ans=0.125 2023-06-24 16:33:45,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1787052.0, ans=0.0 2023-06-24 16:33:56,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1787112.0, ans=0.125 2023-06-24 16:33:57,546 INFO [train.py:996] (3/4) Epoch 10, batch 23400, loss[loss=0.2129, simple_loss=0.2878, pruned_loss=0.069, over 21835.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3085, pruned_loss=0.07885, over 4280796.02 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:34:04,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1787112.0, ans=0.125 2023-06-24 16:34:15,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1787172.0, ans=0.0 2023-06-24 16:34:28,899 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.034e+02 7.282e+02 9.879e+02 1.387e+03 3.167e+03, threshold=1.976e+03, percent-clipped=3.0 2023-06-24 16:34:29,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1787232.0, ans=0.125 2023-06-24 16:34:32,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1787232.0, ans=0.1 2023-06-24 16:35:34,823 INFO [train.py:996] (3/4) Epoch 10, batch 23450, loss[loss=0.2241, simple_loss=0.3001, pruned_loss=0.07408, over 21744.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3091, pruned_loss=0.08073, over 4288538.64 frames. ], batch size: 332, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:36:32,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1787532.0, ans=0.0 2023-06-24 16:36:41,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1787592.0, ans=0.125 2023-06-24 16:37:09,826 INFO [train.py:996] (3/4) Epoch 10, batch 23500, loss[loss=0.2877, simple_loss=0.3349, pruned_loss=0.1202, over 21817.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3094, pruned_loss=0.08239, over 4291063.77 frames. ], batch size: 508, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:37:22,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1787712.0, ans=0.125 2023-06-24 16:37:34,061 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=12.0 2023-06-24 16:37:35,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1787772.0, ans=0.125 2023-06-24 16:37:45,942 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.121e+02 6.777e+02 9.961e+02 1.518e+03 3.385e+03, threshold=1.992e+03, percent-clipped=9.0 2023-06-24 16:37:49,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1787832.0, ans=0.0 2023-06-24 16:38:09,762 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:38:18,026 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-24 16:38:35,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1787952.0, ans=0.125 2023-06-24 16:38:46,203 INFO [train.py:996] (3/4) Epoch 10, batch 23550, loss[loss=0.2608, simple_loss=0.303, pruned_loss=0.1093, over 21604.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3074, pruned_loss=0.08277, over 4280396.69 frames. ], batch size: 414, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:38:51,663 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:39:04,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1788072.0, ans=0.125 2023-06-24 16:39:05,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1788072.0, ans=0.1 2023-06-24 16:39:08,939 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=15.0 2023-06-24 16:39:25,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1788132.0, ans=0.125 2023-06-24 16:39:34,101 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=15.0 2023-06-24 16:39:50,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1788192.0, ans=0.125 2023-06-24 16:39:55,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1788192.0, ans=0.0 2023-06-24 16:40:18,339 INFO [train.py:996] (3/4) Epoch 10, batch 23600, loss[loss=0.2729, simple_loss=0.3412, pruned_loss=0.1023, over 21723.00 frames. ], tot_loss[loss=0.237, simple_loss=0.308, pruned_loss=0.08304, over 4278923.63 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 16:40:32,028 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=15.0 2023-06-24 16:40:52,742 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.88 vs. limit=5.0 2023-06-24 16:41:05,498 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.671e+02 8.088e+02 1.163e+03 1.528e+03 3.406e+03, threshold=2.327e+03, percent-clipped=14.0 2023-06-24 16:41:24,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1788492.0, ans=0.125 2023-06-24 16:41:29,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1788492.0, ans=0.1 2023-06-24 16:41:49,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1788552.0, ans=0.125 2023-06-24 16:41:55,368 INFO [train.py:996] (3/4) Epoch 10, batch 23650, loss[loss=0.2745, simple_loss=0.3416, pruned_loss=0.1037, over 21248.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3083, pruned_loss=0.08189, over 4275230.32 frames. ], batch size: 143, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:42:15,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=1788672.0, ans=0.2 2023-06-24 16:42:16,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1788672.0, ans=0.125 2023-06-24 16:42:16,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1788672.0, ans=0.1 2023-06-24 16:42:58,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1788732.0, ans=0.125 2023-06-24 16:43:30,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1788852.0, ans=0.1 2023-06-24 16:43:38,446 INFO [train.py:996] (3/4) Epoch 10, batch 23700, loss[loss=0.2355, simple_loss=0.3184, pruned_loss=0.07635, over 21575.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3116, pruned_loss=0.08103, over 4279510.84 frames. ], batch size: 414, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:44:26,152 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.212e+02 6.427e+02 8.775e+02 1.177e+03 2.225e+03, threshold=1.755e+03, percent-clipped=0.0 2023-06-24 16:44:32,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1789032.0, ans=0.0 2023-06-24 16:44:50,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1789092.0, ans=0.0 2023-06-24 16:45:00,588 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=22.5 2023-06-24 16:45:22,433 INFO [train.py:996] (3/4) Epoch 10, batch 23750, loss[loss=0.2396, simple_loss=0.315, pruned_loss=0.08216, over 21764.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3138, pruned_loss=0.08206, over 4281847.50 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:45:35,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1789212.0, ans=0.2 2023-06-24 16:45:54,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1789272.0, ans=0.04949747468305833 2023-06-24 16:46:02,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1789332.0, ans=0.125 2023-06-24 16:46:24,043 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.61 vs. limit=22.5 2023-06-24 16:47:01,104 INFO [train.py:996] (3/4) Epoch 10, batch 23800, loss[loss=0.22, simple_loss=0.3024, pruned_loss=0.06879, over 21387.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3109, pruned_loss=0.07841, over 4278542.67 frames. ], batch size: 194, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:47:13,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1789512.0, ans=0.125 2023-06-24 16:47:39,066 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.271e+02 7.251e+02 1.165e+03 1.659e+03 4.396e+03, threshold=2.330e+03, percent-clipped=22.0 2023-06-24 16:47:58,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1789692.0, ans=0.0 2023-06-24 16:48:24,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1789752.0, ans=0.1 2023-06-24 16:48:44,368 INFO [train.py:996] (3/4) Epoch 10, batch 23850, loss[loss=0.2712, simple_loss=0.3497, pruned_loss=0.09629, over 21493.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3218, pruned_loss=0.08125, over 4279449.38 frames. ], batch size: 131, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:48:44,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1789812.0, ans=0.125 2023-06-24 16:48:54,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1789812.0, ans=0.0 2023-06-24 16:49:02,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1789872.0, ans=0.0 2023-06-24 16:49:35,333 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:49:44,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1789992.0, ans=0.05 2023-06-24 16:50:02,073 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-24 16:50:20,387 INFO [train.py:996] (3/4) Epoch 10, batch 23900, loss[loss=0.2114, simple_loss=0.2671, pruned_loss=0.0779, over 20189.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3265, pruned_loss=0.08328, over 4279974.70 frames. ], batch size: 703, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:50:20,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1790112.0, ans=0.2 2023-06-24 16:50:59,820 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.038e+02 1.066e+03 1.472e+03 2.085e+03 4.372e+03, threshold=2.943e+03, percent-clipped=19.0 2023-06-24 16:51:58,850 INFO [train.py:996] (3/4) Epoch 10, batch 23950, loss[loss=0.2212, simple_loss=0.2764, pruned_loss=0.083, over 21414.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3184, pruned_loss=0.08205, over 4265986.44 frames. ], batch size: 211, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:52:00,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1790412.0, ans=0.2 2023-06-24 16:53:15,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1790592.0, ans=0.125 2023-06-24 16:53:23,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1790652.0, ans=0.2 2023-06-24 16:53:36,867 INFO [train.py:996] (3/4) Epoch 10, batch 24000, loss[loss=0.2982, simple_loss=0.3655, pruned_loss=0.1154, over 21277.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3206, pruned_loss=0.0858, over 4260538.86 frames. ], batch size: 143, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 16:53:36,867 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 16:53:52,745 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2655, simple_loss=0.3589, pruned_loss=0.08609, over 1796401.00 frames. 2023-06-24 16:53:52,746 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-24 16:54:41,003 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.470e+02 7.229e+02 9.460e+02 1.386e+03 2.838e+03, threshold=1.892e+03, percent-clipped=0.0 2023-06-24 16:54:47,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1790832.0, ans=0.125 2023-06-24 16:54:48,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1790832.0, ans=10.0 2023-06-24 16:54:57,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1790832.0, ans=0.09899494936611666 2023-06-24 16:55:19,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1790952.0, ans=0.05 2023-06-24 16:55:31,743 INFO [train.py:996] (3/4) Epoch 10, batch 24050, loss[loss=0.222, simple_loss=0.3141, pruned_loss=0.06497, over 21632.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3217, pruned_loss=0.0862, over 4264961.35 frames. ], batch size: 414, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 16:55:33,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1791012.0, ans=0.125 2023-06-24 16:55:46,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1791012.0, ans=0.125 2023-06-24 16:57:13,052 INFO [train.py:996] (3/4) Epoch 10, batch 24100, loss[loss=0.2433, simple_loss=0.3264, pruned_loss=0.08014, over 21720.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3225, pruned_loss=0.08566, over 4271842.71 frames. ], batch size: 332, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 16:57:13,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1791312.0, ans=0.0 2023-06-24 16:57:14,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-24 16:57:14,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1791312.0, ans=0.125 2023-06-24 16:57:17,107 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-24 16:57:50,010 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-24 16:57:58,699 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:58:01,294 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.582e+02 7.131e+02 9.675e+02 1.402e+03 3.208e+03, threshold=1.935e+03, percent-clipped=13.0 2023-06-24 16:58:41,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1791552.0, ans=0.0 2023-06-24 16:58:52,123 INFO [train.py:996] (3/4) Epoch 10, batch 24150, loss[loss=0.2857, simple_loss=0.3461, pruned_loss=0.1127, over 21752.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3232, pruned_loss=0.08766, over 4281882.66 frames. ], batch size: 441, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:00:30,317 INFO [train.py:996] (3/4) Epoch 10, batch 24200, loss[loss=0.3148, simple_loss=0.3911, pruned_loss=0.1192, over 21594.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3265, pruned_loss=0.08915, over 4282332.70 frames. ], batch size: 441, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:00:32,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1791912.0, ans=0.125 2023-06-24 17:00:36,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1791912.0, ans=0.125 2023-06-24 17:01:03,907 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-24 17:01:11,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1792032.0, ans=0.1 2023-06-24 17:01:15,899 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.664e+02 7.347e+02 1.079e+03 1.481e+03 2.381e+03, threshold=2.159e+03, percent-clipped=5.0 2023-06-24 17:01:18,510 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.76 vs. limit=6.0 2023-06-24 17:01:37,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1792092.0, ans=10.0 2023-06-24 17:01:56,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1792152.0, ans=0.125 2023-06-24 17:02:14,119 INFO [train.py:996] (3/4) Epoch 10, batch 24250, loss[loss=0.2551, simple_loss=0.3493, pruned_loss=0.08042, over 21649.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3223, pruned_loss=0.08272, over 4285887.06 frames. ], batch size: 441, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:02:32,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1792212.0, ans=0.125 2023-06-24 17:03:26,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1792452.0, ans=0.1 2023-06-24 17:03:56,595 INFO [train.py:996] (3/4) Epoch 10, batch 24300, loss[loss=0.1984, simple_loss=0.284, pruned_loss=0.05646, over 21806.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3176, pruned_loss=0.07797, over 4285176.12 frames. ], batch size: 351, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:04:06,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1792512.0, ans=0.125 2023-06-24 17:04:10,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1792572.0, ans=0.125 2023-06-24 17:04:14,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1792572.0, ans=0.125 2023-06-24 17:04:32,604 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.370e+02 5.749e+02 8.681e+02 1.337e+03 2.668e+03, threshold=1.736e+03, percent-clipped=3.0 2023-06-24 17:04:41,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1792632.0, ans=0.125 2023-06-24 17:04:56,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1792692.0, ans=0.025 2023-06-24 17:05:22,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=22.5 2023-06-24 17:05:25,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1792752.0, ans=0.125 2023-06-24 17:05:29,311 INFO [train.py:996] (3/4) Epoch 10, batch 24350, loss[loss=0.2545, simple_loss=0.327, pruned_loss=0.09095, over 21814.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3152, pruned_loss=0.07689, over 4281558.87 frames. ], batch size: 351, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:05:32,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1792812.0, ans=0.0 2023-06-24 17:05:45,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1792872.0, ans=0.2 2023-06-24 17:06:58,136 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.58 vs. limit=15.0 2023-06-24 17:07:03,466 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:07:07,860 INFO [train.py:996] (3/4) Epoch 10, batch 24400, loss[loss=0.2252, simple_loss=0.3102, pruned_loss=0.07007, over 21597.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3162, pruned_loss=0.07902, over 4282716.42 frames. ], batch size: 263, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:07:17,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1793112.0, ans=0.1 2023-06-24 17:07:18,355 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=15.0 2023-06-24 17:07:52,226 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.844e+02 8.479e+02 1.173e+03 1.615e+03 2.996e+03, threshold=2.346e+03, percent-clipped=19.0 2023-06-24 17:08:00,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1793232.0, ans=0.125 2023-06-24 17:08:24,709 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-24 17:08:48,993 INFO [train.py:996] (3/4) Epoch 10, batch 24450, loss[loss=0.2258, simple_loss=0.2983, pruned_loss=0.0767, over 21154.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3156, pruned_loss=0.08004, over 4280004.97 frames. ], batch size: 143, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:10:20,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1793652.0, ans=0.0 2023-06-24 17:10:26,059 INFO [train.py:996] (3/4) Epoch 10, batch 24500, loss[loss=0.2067, simple_loss=0.2673, pruned_loss=0.07304, over 20291.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3169, pruned_loss=0.08048, over 4281712.38 frames. ], batch size: 703, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:10:39,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1793712.0, ans=0.1 2023-06-24 17:10:56,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1793772.0, ans=0.125 2023-06-24 17:11:01,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1793832.0, ans=0.0 2023-06-24 17:11:07,633 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.494e+02 6.645e+02 1.000e+03 1.722e+03 3.391e+03, threshold=2.001e+03, percent-clipped=7.0 2023-06-24 17:11:33,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1793892.0, ans=0.1 2023-06-24 17:11:33,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1793892.0, ans=0.1 2023-06-24 17:11:44,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1793952.0, ans=0.125 2023-06-24 17:11:59,898 INFO [train.py:996] (3/4) Epoch 10, batch 24550, loss[loss=0.2832, simple_loss=0.3541, pruned_loss=0.1062, over 21120.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3184, pruned_loss=0.08235, over 4283046.23 frames. ], batch size: 143, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:12:29,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1794072.0, ans=0.0 2023-06-24 17:13:15,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1794192.0, ans=0.125 2023-06-24 17:13:37,074 INFO [train.py:996] (3/4) Epoch 10, batch 24600, loss[loss=0.2599, simple_loss=0.332, pruned_loss=0.09394, over 20714.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3144, pruned_loss=0.08227, over 4279071.86 frames. ], batch size: 607, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:14:28,283 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.632e+02 7.490e+02 1.116e+03 1.562e+03 6.451e+03, threshold=2.232e+03, percent-clipped=18.0 2023-06-24 17:14:54,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1794492.0, ans=0.125 2023-06-24 17:14:59,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1794552.0, ans=0.125 2023-06-24 17:15:15,834 INFO [train.py:996] (3/4) Epoch 10, batch 24650, loss[loss=0.2079, simple_loss=0.2867, pruned_loss=0.06455, over 20802.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3073, pruned_loss=0.08144, over 4273605.18 frames. ], batch size: 609, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:15:16,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1794612.0, ans=0.07 2023-06-24 17:15:50,891 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.52 vs. limit=12.0 2023-06-24 17:16:19,047 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:16:53,142 INFO [train.py:996] (3/4) Epoch 10, batch 24700, loss[loss=0.2645, simple_loss=0.3234, pruned_loss=0.1029, over 21486.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3052, pruned_loss=0.0804, over 4263896.97 frames. ], batch size: 441, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:17:18,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1794972.0, ans=0.125 2023-06-24 17:17:19,320 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.65 vs. limit=6.0 2023-06-24 17:17:49,168 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.586e+02 6.249e+02 8.587e+02 1.281e+03 3.151e+03, threshold=1.717e+03, percent-clipped=6.0 2023-06-24 17:17:52,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1795032.0, ans=0.0 2023-06-24 17:18:14,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1795152.0, ans=0.1 2023-06-24 17:18:19,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1795152.0, ans=0.09899494936611666 2023-06-24 17:18:24,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1795152.0, ans=0.5 2023-06-24 17:18:31,295 INFO [train.py:996] (3/4) Epoch 10, batch 24750, loss[loss=0.1908, simple_loss=0.2773, pruned_loss=0.05219, over 20734.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2988, pruned_loss=0.07759, over 4269008.25 frames. ], batch size: 607, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:18:32,385 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=12.0 2023-06-24 17:19:20,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1795332.0, ans=0.0 2023-06-24 17:19:29,718 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.61 vs. limit=15.0 2023-06-24 17:19:47,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1795392.0, ans=0.125 2023-06-24 17:20:07,375 INFO [train.py:996] (3/4) Epoch 10, batch 24800, loss[loss=0.245, simple_loss=0.3049, pruned_loss=0.09257, over 21355.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2934, pruned_loss=0.07724, over 4274326.75 frames. ], batch size: 159, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 17:20:23,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1795572.0, ans=0.125 2023-06-24 17:20:59,563 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.326e+02 5.922e+02 8.431e+02 1.285e+03 2.453e+03, threshold=1.686e+03, percent-clipped=12.0 2023-06-24 17:21:45,136 INFO [train.py:996] (3/4) Epoch 10, batch 24850, loss[loss=0.1872, simple_loss=0.2521, pruned_loss=0.0612, over 21310.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2939, pruned_loss=0.07868, over 4275177.14 frames. ], batch size: 143, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:21:53,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1795812.0, ans=0.1 2023-06-24 17:21:55,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1795812.0, ans=0.125 2023-06-24 17:23:06,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1796052.0, ans=0.1 2023-06-24 17:23:22,411 INFO [train.py:996] (3/4) Epoch 10, batch 24900, loss[loss=0.2531, simple_loss=0.326, pruned_loss=0.09009, over 21844.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2963, pruned_loss=0.07936, over 4277521.04 frames. ], batch size: 282, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:24:13,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1796232.0, ans=0.0 2023-06-24 17:24:17,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1796232.0, ans=0.125 2023-06-24 17:24:20,587 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.038e+02 8.107e+02 1.257e+03 1.989e+03 3.453e+03, threshold=2.514e+03, percent-clipped=33.0 2023-06-24 17:24:49,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1796352.0, ans=0.1 2023-06-24 17:25:00,284 INFO [train.py:996] (3/4) Epoch 10, batch 24950, loss[loss=0.2524, simple_loss=0.3179, pruned_loss=0.09347, over 21586.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3047, pruned_loss=0.08363, over 4276641.57 frames. ], batch size: 263, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:25:35,307 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.81 vs. limit=12.0 2023-06-24 17:26:14,159 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-24 17:26:21,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1796652.0, ans=0.125 2023-06-24 17:26:43,361 INFO [train.py:996] (3/4) Epoch 10, batch 25000, loss[loss=0.2177, simple_loss=0.2893, pruned_loss=0.07307, over 21908.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3116, pruned_loss=0.08483, over 4278850.10 frames. ], batch size: 118, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:27:37,645 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 7.141e+02 9.275e+02 1.381e+03 2.945e+03, threshold=1.855e+03, percent-clipped=4.0 2023-06-24 17:27:42,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1796832.0, ans=0.1 2023-06-24 17:28:23,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1796952.0, ans=0.125 2023-06-24 17:28:31,888 INFO [train.py:996] (3/4) Epoch 10, batch 25050, loss[loss=0.2535, simple_loss=0.3632, pruned_loss=0.07193, over 19965.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3054, pruned_loss=0.08257, over 4280280.36 frames. ], batch size: 702, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:29:19,136 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=15.0 2023-06-24 17:30:03,435 INFO [train.py:996] (3/4) Epoch 10, batch 25100, loss[loss=0.2185, simple_loss=0.3164, pruned_loss=0.06032, over 21852.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3009, pruned_loss=0.08212, over 4285193.69 frames. ], batch size: 371, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:30:51,805 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.233e+02 6.255e+02 9.404e+02 1.549e+03 2.850e+03, threshold=1.881e+03, percent-clipped=12.0 2023-06-24 17:31:12,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1797492.0, ans=0.125 2023-06-24 17:31:19,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1797552.0, ans=0.95 2023-06-24 17:31:35,832 INFO [train.py:996] (3/4) Epoch 10, batch 25150, loss[loss=0.1996, simple_loss=0.292, pruned_loss=0.0536, over 21817.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3044, pruned_loss=0.08018, over 4281096.75 frames. ], batch size: 332, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:32:06,474 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=22.5 2023-06-24 17:32:36,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1797792.0, ans=0.2 2023-06-24 17:32:49,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1797792.0, ans=0.125 2023-06-24 17:33:00,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1797852.0, ans=0.125 2023-06-24 17:33:12,420 INFO [train.py:996] (3/4) Epoch 10, batch 25200, loss[loss=0.2449, simple_loss=0.2993, pruned_loss=0.09522, over 20077.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3037, pruned_loss=0.07886, over 4282319.00 frames. ], batch size: 702, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:34:00,966 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 6.948e+02 1.057e+03 1.508e+03 2.758e+03, threshold=2.115e+03, percent-clipped=16.0 2023-06-24 17:34:33,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1798152.0, ans=0.0 2023-06-24 17:34:37,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1798152.0, ans=0.125 2023-06-24 17:34:37,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1798152.0, ans=0.1 2023-06-24 17:34:49,118 INFO [train.py:996] (3/4) Epoch 10, batch 25250, loss[loss=0.1882, simple_loss=0.2662, pruned_loss=0.05508, over 21191.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3011, pruned_loss=0.07767, over 4271840.14 frames. ], batch size: 548, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:36:26,952 INFO [train.py:996] (3/4) Epoch 10, batch 25300, loss[loss=0.2173, simple_loss=0.2971, pruned_loss=0.06875, over 21729.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3001, pruned_loss=0.07741, over 4262250.73 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:37:03,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1798572.0, ans=0.125 2023-06-24 17:37:15,746 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.390e+02 6.891e+02 9.341e+02 1.406e+03 3.031e+03, threshold=1.868e+03, percent-clipped=2.0 2023-06-24 17:38:02,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1798752.0, ans=0.125 2023-06-24 17:38:09,659 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=12.0 2023-06-24 17:38:10,229 INFO [train.py:996] (3/4) Epoch 10, batch 25350, loss[loss=0.1785, simple_loss=0.269, pruned_loss=0.04397, over 21761.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3023, pruned_loss=0.07686, over 4252253.49 frames. ], batch size: 282, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:38:20,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1798812.0, ans=0.1 2023-06-24 17:39:14,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1798992.0, ans=0.125 2023-06-24 17:39:42,334 INFO [train.py:996] (3/4) Epoch 10, batch 25400, loss[loss=0.2143, simple_loss=0.2734, pruned_loss=0.07754, over 21344.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2987, pruned_loss=0.07615, over 4255346.05 frames. ], batch size: 144, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:39:54,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1799112.0, ans=0.0 2023-06-24 17:40:13,466 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=22.5 2023-06-24 17:40:31,214 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.306e+02 6.316e+02 8.896e+02 1.149e+03 2.761e+03, threshold=1.779e+03, percent-clipped=5.0 2023-06-24 17:41:09,792 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=22.5 2023-06-24 17:41:24,897 INFO [train.py:996] (3/4) Epoch 10, batch 25450, loss[loss=0.2398, simple_loss=0.3002, pruned_loss=0.08969, over 21579.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2995, pruned_loss=0.07722, over 4258473.75 frames. ], batch size: 263, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 17:41:47,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1799472.0, ans=0.0 2023-06-24 17:41:49,577 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.35 vs. limit=10.0 2023-06-24 17:42:28,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1799592.0, ans=0.125 2023-06-24 17:43:00,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1799652.0, ans=0.95 2023-06-24 17:43:04,172 INFO [train.py:996] (3/4) Epoch 10, batch 25500, loss[loss=0.2342, simple_loss=0.3286, pruned_loss=0.06993, over 21620.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3007, pruned_loss=0.07456, over 4261409.80 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 17:43:23,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1799772.0, ans=0.125 2023-06-24 17:43:23,995 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=15.0 2023-06-24 17:43:28,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1799772.0, ans=0.0 2023-06-24 17:43:43,841 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.327e+02 6.563e+02 1.058e+03 1.442e+03 3.790e+03, threshold=2.117e+03, percent-clipped=15.0 2023-06-24 17:43:45,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1799832.0, ans=0.0 2023-06-24 17:44:39,387 INFO [train.py:996] (3/4) Epoch 10, batch 25550, loss[loss=0.2673, simple_loss=0.3657, pruned_loss=0.08446, over 21573.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3061, pruned_loss=0.07395, over 4259476.80 frames. ], batch size: 508, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:44:50,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1800012.0, ans=0.125 2023-06-24 17:45:09,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1800132.0, ans=0.2 2023-06-24 17:45:18,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1800132.0, ans=0.125 2023-06-24 17:46:17,012 INFO [train.py:996] (3/4) Epoch 10, batch 25600, loss[loss=0.291, simple_loss=0.364, pruned_loss=0.109, over 21485.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3102, pruned_loss=0.0748, over 4266366.02 frames. ], batch size: 131, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 17:46:25,933 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.78 vs. limit=6.0 2023-06-24 17:46:27,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1800312.0, ans=0.2 2023-06-24 17:46:29,161 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-24 17:46:33,870 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=15.0 2023-06-24 17:46:37,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1800372.0, ans=0.125 2023-06-24 17:46:44,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1800372.0, ans=0.0 2023-06-24 17:46:58,009 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.521e+02 8.019e+02 1.314e+03 1.703e+03 3.186e+03, threshold=2.628e+03, percent-clipped=13.0 2023-06-24 17:47:17,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1800492.0, ans=0.0 2023-06-24 17:47:54,743 INFO [train.py:996] (3/4) Epoch 10, batch 25650, loss[loss=0.2535, simple_loss=0.3082, pruned_loss=0.09939, over 21057.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3108, pruned_loss=0.07771, over 4261624.68 frames. ], batch size: 143, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:48:50,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1800792.0, ans=0.0 2023-06-24 17:49:28,887 INFO [train.py:996] (3/4) Epoch 10, batch 25700, loss[loss=0.2212, simple_loss=0.2775, pruned_loss=0.08246, over 21376.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3091, pruned_loss=0.07935, over 4258787.63 frames. ], batch size: 473, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:49:45,763 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-24 17:50:16,724 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.285e+02 8.817e+02 1.358e+03 2.096e+03 4.463e+03, threshold=2.717e+03, percent-clipped=14.0 2023-06-24 17:51:03,503 INFO [train.py:996] (3/4) Epoch 10, batch 25750, loss[loss=0.2714, simple_loss=0.336, pruned_loss=0.1034, over 21460.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3114, pruned_loss=0.08085, over 4263065.01 frames. ], batch size: 471, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:51:04,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1801212.0, ans=0.125 2023-06-24 17:51:11,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1801212.0, ans=0.035 2023-06-24 17:51:21,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1801272.0, ans=0.0 2023-06-24 17:52:38,269 INFO [train.py:996] (3/4) Epoch 10, batch 25800, loss[loss=0.284, simple_loss=0.3574, pruned_loss=0.1053, over 21775.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3223, pruned_loss=0.08525, over 4262016.03 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:52:53,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1801512.0, ans=0.1 2023-06-24 17:53:01,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1801512.0, ans=0.0 2023-06-24 17:53:40,820 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.838e+02 7.699e+02 1.038e+03 1.788e+03 4.629e+03, threshold=2.076e+03, percent-clipped=8.0 2023-06-24 17:53:59,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1801752.0, ans=0.0 2023-06-24 17:54:01,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1801752.0, ans=0.95 2023-06-24 17:54:03,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1801752.0, ans=0.1 2023-06-24 17:54:17,136 INFO [train.py:996] (3/4) Epoch 10, batch 25850, loss[loss=0.2662, simple_loss=0.3361, pruned_loss=0.09814, over 21872.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.324, pruned_loss=0.08452, over 4264879.55 frames. ], batch size: 371, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:54:39,981 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.03 vs. limit=15.0 2023-06-24 17:54:41,473 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-24 17:54:50,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1801872.0, ans=0.125 2023-06-24 17:55:15,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1801932.0, ans=0.125 2023-06-24 17:55:45,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1802052.0, ans=0.0 2023-06-24 17:55:53,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1802052.0, ans=0.125 2023-06-24 17:56:01,198 INFO [train.py:996] (3/4) Epoch 10, batch 25900, loss[loss=0.2364, simple_loss=0.3251, pruned_loss=0.07383, over 21514.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3257, pruned_loss=0.08501, over 4273160.70 frames. ], batch size: 230, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:56:35,626 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=15.0 2023-06-24 17:56:39,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1802172.0, ans=0.07 2023-06-24 17:56:53,706 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.611e+02 7.173e+02 1.132e+03 1.463e+03 2.574e+03, threshold=2.264e+03, percent-clipped=5.0 2023-06-24 17:57:02,903 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=22.5 2023-06-24 17:57:22,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1802352.0, ans=0.125 2023-06-24 17:57:27,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1802352.0, ans=0.07 2023-06-24 17:57:45,042 INFO [train.py:996] (3/4) Epoch 10, batch 25950, loss[loss=0.2485, simple_loss=0.3301, pruned_loss=0.08344, over 21745.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3316, pruned_loss=0.08764, over 4275796.43 frames. ], batch size: 332, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:58:33,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1802592.0, ans=0.125 2023-06-24 17:58:35,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1802592.0, ans=0.0 2023-06-24 17:59:23,435 INFO [train.py:996] (3/4) Epoch 10, batch 26000, loss[loss=0.2944, simple_loss=0.3768, pruned_loss=0.1061, over 21439.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3316, pruned_loss=0.08646, over 4272618.52 frames. ], batch size: 131, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:00:06,703 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.493e+02 7.586e+02 1.074e+03 1.522e+03 3.008e+03, threshold=2.148e+03, percent-clipped=6.0 2023-06-24 18:00:16,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1802892.0, ans=0.125 2023-06-24 18:00:41,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=1802952.0, ans=0.02 2023-06-24 18:00:56,212 INFO [train.py:996] (3/4) Epoch 10, batch 26050, loss[loss=0.3173, simple_loss=0.4234, pruned_loss=0.1056, over 19752.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3315, pruned_loss=0.08772, over 4273431.03 frames. ], batch size: 702, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:01:04,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1803012.0, ans=0.125 2023-06-24 18:01:16,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1803072.0, ans=0.125 2023-06-24 18:02:18,633 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-24 18:02:32,709 INFO [train.py:996] (3/4) Epoch 10, batch 26100, loss[loss=0.2421, simple_loss=0.304, pruned_loss=0.09005, over 21787.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3269, pruned_loss=0.08796, over 4275890.36 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:03:05,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1803432.0, ans=0.125 2023-06-24 18:03:15,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.148e+02 7.025e+02 9.795e+02 1.501e+03 3.322e+03, threshold=1.959e+03, percent-clipped=9.0 2023-06-24 18:03:51,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1803552.0, ans=0.0 2023-06-24 18:04:01,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1803552.0, ans=0.125 2023-06-24 18:04:05,708 INFO [train.py:996] (3/4) Epoch 10, batch 26150, loss[loss=0.2415, simple_loss=0.3141, pruned_loss=0.08449, over 21879.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3223, pruned_loss=0.08738, over 4282539.17 frames. ], batch size: 371, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:04:11,874 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=12.0 2023-06-24 18:04:19,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1803612.0, ans=0.125 2023-06-24 18:05:24,778 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.84 vs. limit=6.0 2023-06-24 18:05:44,639 INFO [train.py:996] (3/4) Epoch 10, batch 26200, loss[loss=0.2765, simple_loss=0.3804, pruned_loss=0.08624, over 21622.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3245, pruned_loss=0.08596, over 4288734.76 frames. ], batch size: 389, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:06:36,095 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.529e+02 6.225e+02 8.349e+02 1.175e+03 2.397e+03, threshold=1.670e+03, percent-clipped=3.0 2023-06-24 18:06:55,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1804092.0, ans=0.0 2023-06-24 18:07:17,717 INFO [train.py:996] (3/4) Epoch 10, batch 26250, loss[loss=0.2971, simple_loss=0.3624, pruned_loss=0.1159, over 21758.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3269, pruned_loss=0.08412, over 4288412.73 frames. ], batch size: 508, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:07:22,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1804212.0, ans=0.0 2023-06-24 18:07:35,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1804272.0, ans=0.2 2023-06-24 18:08:10,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1804332.0, ans=0.125 2023-06-24 18:08:46,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1804452.0, ans=0.0 2023-06-24 18:08:53,828 INFO [train.py:996] (3/4) Epoch 10, batch 26300, loss[loss=0.2307, simple_loss=0.3042, pruned_loss=0.07862, over 21919.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3241, pruned_loss=0.08511, over 4296615.58 frames. ], batch size: 107, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:09:08,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1804572.0, ans=0.125 2023-06-24 18:09:35,734 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-24 18:09:46,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1804632.0, ans=0.0 2023-06-24 18:09:48,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1804632.0, ans=0.0 2023-06-24 18:09:48,130 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:09:48,829 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.13 vs. limit=15.0 2023-06-24 18:09:51,136 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.353e+02 7.232e+02 9.217e+02 1.300e+03 2.808e+03, threshold=1.843e+03, percent-clipped=15.0 2023-06-24 18:10:07,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1804692.0, ans=0.0 2023-06-24 18:10:27,966 INFO [train.py:996] (3/4) Epoch 10, batch 26350, loss[loss=0.2216, simple_loss=0.3038, pruned_loss=0.06974, over 20740.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3224, pruned_loss=0.08553, over 4297277.68 frames. ], batch size: 607, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:10:44,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1804812.0, ans=0.125 2023-06-24 18:10:58,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1804872.0, ans=0.2 2023-06-24 18:11:23,941 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-24 18:11:41,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1804992.0, ans=0.125 2023-06-24 18:11:46,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1805052.0, ans=0.125 2023-06-24 18:11:55,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1805052.0, ans=0.125 2023-06-24 18:12:00,213 INFO [train.py:996] (3/4) Epoch 10, batch 26400, loss[loss=0.2182, simple_loss=0.2759, pruned_loss=0.08027, over 21478.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3172, pruned_loss=0.08581, over 4292773.26 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:12:57,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1805232.0, ans=0.125 2023-06-24 18:13:02,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1805232.0, ans=0.2 2023-06-24 18:13:04,641 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.504e+02 7.130e+02 9.166e+02 1.346e+03 2.893e+03, threshold=1.833e+03, percent-clipped=10.0 2023-06-24 18:13:08,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1805292.0, ans=0.1 2023-06-24 18:13:10,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1805292.0, ans=0.0 2023-06-24 18:13:21,900 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=22.5 2023-06-24 18:13:22,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1805352.0, ans=0.125 2023-06-24 18:13:44,720 INFO [train.py:996] (3/4) Epoch 10, batch 26450, loss[loss=0.266, simple_loss=0.3946, pruned_loss=0.06873, over 21136.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3159, pruned_loss=0.08453, over 4281653.19 frames. ], batch size: 549, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:14:16,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1805472.0, ans=0.0 2023-06-24 18:14:24,173 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.21 vs. limit=12.0 2023-06-24 18:14:42,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1805532.0, ans=0.125 2023-06-24 18:14:57,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1805592.0, ans=0.125 2023-06-24 18:15:33,616 INFO [train.py:996] (3/4) Epoch 10, batch 26500, loss[loss=0.2551, simple_loss=0.3353, pruned_loss=0.08742, over 21679.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3205, pruned_loss=0.08424, over 4283084.01 frames. ], batch size: 389, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:15:53,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1805772.0, ans=0.125 2023-06-24 18:16:05,792 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-06-24 18:16:19,139 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.291e+02 8.492e+02 1.713e+03 2.381e+03 4.815e+03, threshold=3.427e+03, percent-clipped=46.0 2023-06-24 18:16:42,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1805892.0, ans=0.125 2023-06-24 18:17:13,481 INFO [train.py:996] (3/4) Epoch 10, batch 26550, loss[loss=0.203, simple_loss=0.2844, pruned_loss=0.06075, over 21660.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3168, pruned_loss=0.08142, over 4269014.65 frames. ], batch size: 247, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:18:32,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1806192.0, ans=0.1 2023-06-24 18:18:51,160 INFO [train.py:996] (3/4) Epoch 10, batch 26600, loss[loss=0.2259, simple_loss=0.3018, pruned_loss=0.075, over 21645.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3154, pruned_loss=0.07869, over 4262993.91 frames. ], batch size: 332, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:19:38,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1806432.0, ans=0.04949747468305833 2023-06-24 18:19:50,174 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.252e+02 6.735e+02 9.367e+02 1.418e+03 2.947e+03, threshold=1.873e+03, percent-clipped=0.0 2023-06-24 18:20:27,316 INFO [train.py:996] (3/4) Epoch 10, batch 26650, loss[loss=0.2455, simple_loss=0.332, pruned_loss=0.07952, over 19953.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3085, pruned_loss=0.07744, over 4245555.22 frames. ], batch size: 702, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:20:38,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1806612.0, ans=0.95 2023-06-24 18:21:08,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1806732.0, ans=0.2 2023-06-24 18:21:09,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1806732.0, ans=0.125 2023-06-24 18:22:04,405 INFO [train.py:996] (3/4) Epoch 10, batch 26700, loss[loss=0.2229, simple_loss=0.2938, pruned_loss=0.07604, over 21858.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3009, pruned_loss=0.07408, over 4258029.37 frames. ], batch size: 282, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:22:07,070 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-24 18:22:53,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1807032.0, ans=0.1 2023-06-24 18:23:03,956 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.609e+02 5.904e+02 8.895e+02 1.266e+03 2.611e+03, threshold=1.779e+03, percent-clipped=6.0 2023-06-24 18:23:37,433 INFO [train.py:996] (3/4) Epoch 10, batch 26750, loss[loss=0.2265, simple_loss=0.3123, pruned_loss=0.07036, over 21618.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3007, pruned_loss=0.07327, over 4270901.34 frames. ], batch size: 389, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:23:44,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1807212.0, ans=0.1 2023-06-24 18:23:52,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1807212.0, ans=0.2 2023-06-24 18:24:13,923 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.82 vs. limit=22.5 2023-06-24 18:24:44,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1807392.0, ans=0.2 2023-06-24 18:25:03,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1807452.0, ans=0.0 2023-06-24 18:25:17,205 INFO [train.py:996] (3/4) Epoch 10, batch 26800, loss[loss=0.2312, simple_loss=0.315, pruned_loss=0.07371, over 21508.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3081, pruned_loss=0.07706, over 4271856.91 frames. ], batch size: 112, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:25:49,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1807572.0, ans=0.05 2023-06-24 18:26:06,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1807632.0, ans=0.1 2023-06-24 18:26:16,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1807632.0, ans=6.0 2023-06-24 18:26:16,840 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.147e+02 7.518e+02 9.832e+02 1.417e+03 2.844e+03, threshold=1.966e+03, percent-clipped=8.0 2023-06-24 18:26:26,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1807692.0, ans=0.0 2023-06-24 18:26:39,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1807752.0, ans=0.125 2023-06-24 18:26:53,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1807812.0, ans=0.125 2023-06-24 18:26:59,134 INFO [train.py:996] (3/4) Epoch 10, batch 26850, loss[loss=0.2312, simple_loss=0.3009, pruned_loss=0.08076, over 20792.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3096, pruned_loss=0.07969, over 4273674.94 frames. ], batch size: 609, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:27:30,003 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-24 18:27:34,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1807872.0, ans=0.0 2023-06-24 18:27:45,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1807932.0, ans=0.0 2023-06-24 18:28:27,306 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-24 18:28:30,937 INFO [train.py:996] (3/4) Epoch 10, batch 26900, loss[loss=0.2229, simple_loss=0.2717, pruned_loss=0.08702, over 21337.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.302, pruned_loss=0.07852, over 4275915.29 frames. ], batch size: 177, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:29:19,348 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.83 vs. limit=10.0 2023-06-24 18:29:26,060 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:29:26,672 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.71 vs. limit=10.0 2023-06-24 18:29:30,460 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.980e+02 7.494e+02 9.472e+02 1.508e+03 3.136e+03, threshold=1.894e+03, percent-clipped=8.0 2023-06-24 18:29:39,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1808292.0, ans=0.125 2023-06-24 18:29:47,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1808352.0, ans=0.125 2023-06-24 18:30:07,712 INFO [train.py:996] (3/4) Epoch 10, batch 26950, loss[loss=0.2029, simple_loss=0.2645, pruned_loss=0.07067, over 21650.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3017, pruned_loss=0.07885, over 4259204.52 frames. ], batch size: 333, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:30:31,148 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-24 18:30:52,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1808472.0, ans=0.125 2023-06-24 18:31:54,840 INFO [train.py:996] (3/4) Epoch 10, batch 27000, loss[loss=0.1919, simple_loss=0.2802, pruned_loss=0.05181, over 21700.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3024, pruned_loss=0.07699, over 4267364.56 frames. ], batch size: 332, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:31:54,840 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 18:32:05,034 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.7873, 1.7928, 2.6289, 2.3578, 1.3617, 2.6949, 2.7131, 1.3514], device='cuda:3') 2023-06-24 18:32:16,368 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2412, simple_loss=0.3374, pruned_loss=0.07247, over 1796401.00 frames. 2023-06-24 18:32:16,369 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-24 18:32:18,748 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:33:08,818 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.928e+02 6.448e+02 9.390e+02 1.351e+03 2.937e+03, threshold=1.878e+03, percent-clipped=11.0 2023-06-24 18:33:19,177 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-06-24 18:33:55,933 INFO [train.py:996] (3/4) Epoch 10, batch 27050, loss[loss=0.2223, simple_loss=0.3004, pruned_loss=0.07211, over 21871.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3063, pruned_loss=0.07523, over 4269724.96 frames. ], batch size: 351, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:34:02,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1809012.0, ans=0.125 2023-06-24 18:34:30,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1809132.0, ans=0.0 2023-06-24 18:35:28,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1809252.0, ans=0.09899494936611666 2023-06-24 18:35:31,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1809312.0, ans=0.5 2023-06-24 18:35:32,512 INFO [train.py:996] (3/4) Epoch 10, batch 27100, loss[loss=0.2359, simple_loss=0.3341, pruned_loss=0.06882, over 21750.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3063, pruned_loss=0.07512, over 4276659.32 frames. ], batch size: 247, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:36:24,772 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.409e+02 6.272e+02 8.340e+02 1.180e+03 2.454e+03, threshold=1.668e+03, percent-clipped=3.0 2023-06-24 18:36:39,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1809492.0, ans=0.0 2023-06-24 18:36:46,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1809492.0, ans=0.0 2023-06-24 18:36:52,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1809552.0, ans=0.125 2023-06-24 18:37:10,881 INFO [train.py:996] (3/4) Epoch 10, batch 27150, loss[loss=0.2909, simple_loss=0.3807, pruned_loss=0.1005, over 21827.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.318, pruned_loss=0.07881, over 4273861.59 frames. ], batch size: 371, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:37:22,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=1809612.0, ans=0.2 2023-06-24 18:37:31,841 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-06-24 18:37:36,917 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-24 18:37:39,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1809672.0, ans=0.07 2023-06-24 18:37:50,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1809732.0, ans=0.07 2023-06-24 18:38:30,802 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:38:49,495 INFO [train.py:996] (3/4) Epoch 10, batch 27200, loss[loss=0.2486, simple_loss=0.3318, pruned_loss=0.08275, over 20684.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3239, pruned_loss=0.08037, over 4264000.95 frames. ], batch size: 607, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:39:48,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1810032.0, ans=0.125 2023-06-24 18:39:56,029 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.592e+02 7.616e+02 1.186e+03 1.800e+03 4.357e+03, threshold=2.372e+03, percent-clipped=30.0 2023-06-24 18:40:27,550 INFO [train.py:996] (3/4) Epoch 10, batch 27250, loss[loss=0.2722, simple_loss=0.3434, pruned_loss=0.1005, over 21735.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3253, pruned_loss=0.08386, over 4266096.59 frames. ], batch size: 124, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:40:29,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1810212.0, ans=0.0 2023-06-24 18:40:59,703 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.83 vs. limit=15.0 2023-06-24 18:41:13,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1810332.0, ans=0.125 2023-06-24 18:41:21,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1810332.0, ans=0.0 2023-06-24 18:41:43,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1810392.0, ans=0.125 2023-06-24 18:42:12,489 INFO [train.py:996] (3/4) Epoch 10, batch 27300, loss[loss=0.2874, simple_loss=0.351, pruned_loss=0.1119, over 21334.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3285, pruned_loss=0.08622, over 4265856.44 frames. ], batch size: 159, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:42:51,343 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-24 18:43:13,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1810632.0, ans=0.0 2023-06-24 18:43:18,100 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.724e+02 6.828e+02 8.478e+02 1.184e+03 2.294e+03, threshold=1.696e+03, percent-clipped=0.0 2023-06-24 18:43:55,760 INFO [train.py:996] (3/4) Epoch 10, batch 27350, loss[loss=0.2035, simple_loss=0.29, pruned_loss=0.05853, over 21449.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3319, pruned_loss=0.08687, over 4267265.91 frames. ], batch size: 194, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:44:06,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1810812.0, ans=0.125 2023-06-24 18:44:06,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-24 18:44:34,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1810872.0, ans=0.125 2023-06-24 18:44:39,658 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-06-24 18:44:42,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1810872.0, ans=0.125 2023-06-24 18:44:48,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1810932.0, ans=0.2 2023-06-24 18:44:52,179 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-24 18:45:08,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1810992.0, ans=0.125 2023-06-24 18:45:10,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-24 18:45:48,112 INFO [train.py:996] (3/4) Epoch 10, batch 27400, loss[loss=0.2463, simple_loss=0.3114, pruned_loss=0.09055, over 21541.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.326, pruned_loss=0.0856, over 4272586.66 frames. ], batch size: 548, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:46:49,505 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.586e+02 6.744e+02 9.251e+02 1.244e+03 3.904e+03, threshold=1.850e+03, percent-clipped=13.0 2023-06-24 18:46:56,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1811292.0, ans=0.125 2023-06-24 18:47:23,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1811352.0, ans=0.125 2023-06-24 18:47:39,087 INFO [train.py:996] (3/4) Epoch 10, batch 27450, loss[loss=0.2302, simple_loss=0.3061, pruned_loss=0.0772, over 21421.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3205, pruned_loss=0.08487, over 4281781.67 frames. ], batch size: 211, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:48:02,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1811472.0, ans=0.125 2023-06-24 18:49:10,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1811652.0, ans=0.1 2023-06-24 18:49:19,528 INFO [train.py:996] (3/4) Epoch 10, batch 27500, loss[loss=0.2747, simple_loss=0.3501, pruned_loss=0.09963, over 21848.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3209, pruned_loss=0.08587, over 4279358.53 frames. ], batch size: 107, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:50:02,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1811832.0, ans=0.1 2023-06-24 18:50:08,590 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.92 vs. limit=5.0 2023-06-24 18:50:26,977 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 6.516e+02 9.018e+02 1.642e+03 4.000e+03, threshold=1.804e+03, percent-clipped=22.0 2023-06-24 18:50:56,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1811952.0, ans=0.025 2023-06-24 18:51:04,410 INFO [train.py:996] (3/4) Epoch 10, batch 27550, loss[loss=0.2406, simple_loss=0.3044, pruned_loss=0.08835, over 21482.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3153, pruned_loss=0.08333, over 4285092.78 frames. ], batch size: 389, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 18:52:37,503 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=22.5 2023-06-24 18:52:45,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1812312.0, ans=0.0 2023-06-24 18:52:52,098 INFO [train.py:996] (3/4) Epoch 10, batch 27600, loss[loss=0.2605, simple_loss=0.3042, pruned_loss=0.1084, over 21341.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3075, pruned_loss=0.08176, over 4287754.49 frames. ], batch size: 508, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 18:52:54,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1812312.0, ans=0.0 2023-06-24 18:53:51,030 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.487e+02 7.022e+02 8.920e+02 1.196e+03 2.930e+03, threshold=1.784e+03, percent-clipped=9.0 2023-06-24 18:53:59,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1812492.0, ans=0.125 2023-06-24 18:54:25,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1812552.0, ans=0.2 2023-06-24 18:54:33,757 INFO [train.py:996] (3/4) Epoch 10, batch 27650, loss[loss=0.216, simple_loss=0.3119, pruned_loss=0.06009, over 21714.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.302, pruned_loss=0.08072, over 4280977.92 frames. ], batch size: 298, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 18:55:07,453 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:56:23,898 INFO [train.py:996] (3/4) Epoch 10, batch 27700, loss[loss=0.2477, simple_loss=0.3438, pruned_loss=0.07582, over 21269.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3011, pruned_loss=0.07878, over 4282562.39 frames. ], batch size: 548, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 18:56:39,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1812912.0, ans=0.0 2023-06-24 18:56:48,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1812972.0, ans=0.125 2023-06-24 18:57:20,918 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.319e+02 7.274e+02 1.072e+03 1.414e+03 3.667e+03, threshold=2.145e+03, percent-clipped=20.0 2023-06-24 18:57:44,290 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-24 18:57:48,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1813152.0, ans=0.2 2023-06-24 18:58:08,650 INFO [train.py:996] (3/4) Epoch 10, batch 27750, loss[loss=0.2352, simple_loss=0.3094, pruned_loss=0.08055, over 21245.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3029, pruned_loss=0.07793, over 4279031.86 frames. ], batch size: 176, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 18:58:24,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1813212.0, ans=0.2 2023-06-24 18:59:31,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1813452.0, ans=0.125 2023-06-24 18:59:38,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1813452.0, ans=0.1 2023-06-24 18:59:45,892 INFO [train.py:996] (3/4) Epoch 10, batch 27800, loss[loss=0.2858, simple_loss=0.3387, pruned_loss=0.1165, over 21650.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3032, pruned_loss=0.07901, over 4286643.43 frames. ], batch size: 471, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:00:25,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1813572.0, ans=0.0 2023-06-24 19:00:49,652 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.608e+02 7.266e+02 8.786e+02 1.229e+03 3.001e+03, threshold=1.757e+03, percent-clipped=8.0 2023-06-24 19:01:10,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1813692.0, ans=0.125 2023-06-24 19:01:18,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1813752.0, ans=0.125 2023-06-24 19:01:19,621 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.02 vs. limit=6.0 2023-06-24 19:01:23,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1813752.0, ans=0.125 2023-06-24 19:01:40,618 INFO [train.py:996] (3/4) Epoch 10, batch 27850, loss[loss=0.2342, simple_loss=0.2967, pruned_loss=0.08587, over 21617.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3027, pruned_loss=0.08009, over 4290419.51 frames. ], batch size: 548, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:02:01,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.whiten.whitening_limit, batch_count=1813872.0, ans=12.0 2023-06-24 19:02:03,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1813872.0, ans=0.125 2023-06-24 19:02:14,314 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.84 vs. limit=22.5 2023-06-24 19:02:25,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1813932.0, ans=0.125 2023-06-24 19:03:14,365 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-24 19:03:15,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1814052.0, ans=0.125 2023-06-24 19:03:19,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-24 19:03:33,929 INFO [train.py:996] (3/4) Epoch 10, batch 27900, loss[loss=0.2271, simple_loss=0.3234, pruned_loss=0.06543, over 21630.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3135, pruned_loss=0.08076, over 4293048.77 frames. ], batch size: 263, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:04:13,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1814232.0, ans=0.1 2023-06-24 19:04:26,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1814232.0, ans=0.05 2023-06-24 19:04:37,096 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.993e+02 8.012e+02 1.137e+03 1.760e+03 3.581e+03, threshold=2.273e+03, percent-clipped=25.0 2023-06-24 19:05:14,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1814352.0, ans=0.1 2023-06-24 19:05:21,031 INFO [train.py:996] (3/4) Epoch 10, batch 27950, loss[loss=0.192, simple_loss=0.2822, pruned_loss=0.05087, over 21543.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3134, pruned_loss=0.07643, over 4289482.76 frames. ], batch size: 230, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:05:23,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1814412.0, ans=0.1 2023-06-24 19:05:23,887 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-24 19:05:43,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-06-24 19:06:06,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1814532.0, ans=0.2 2023-06-24 19:07:07,508 INFO [train.py:996] (3/4) Epoch 10, batch 28000, loss[loss=0.2662, simple_loss=0.3336, pruned_loss=0.09937, over 21658.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3116, pruned_loss=0.07468, over 4290216.51 frames. ], batch size: 471, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:07:15,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1814712.0, ans=0.0 2023-06-24 19:07:29,995 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=15.0 2023-06-24 19:08:13,045 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.030e+02 6.703e+02 9.141e+02 1.311e+03 2.491e+03, threshold=1.828e+03, percent-clipped=2.0 2023-06-24 19:08:56,431 INFO [train.py:996] (3/4) Epoch 10, batch 28050, loss[loss=0.2043, simple_loss=0.279, pruned_loss=0.06481, over 21668.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3085, pruned_loss=0.07582, over 4290077.91 frames. ], batch size: 263, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:08:56,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1815012.0, ans=0.125 2023-06-24 19:09:29,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1815072.0, ans=0.0 2023-06-24 19:09:47,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1815132.0, ans=0.125 2023-06-24 19:10:16,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1815192.0, ans=0.0 2023-06-24 19:10:17,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1815192.0, ans=0.125 2023-06-24 19:10:37,042 INFO [train.py:996] (3/4) Epoch 10, batch 28100, loss[loss=0.2333, simple_loss=0.2887, pruned_loss=0.08897, over 21576.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3061, pruned_loss=0.07641, over 4284992.23 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:11:20,237 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-24 19:11:45,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1815432.0, ans=0.125 2023-06-24 19:11:53,207 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.563e+02 7.316e+02 9.478e+02 1.492e+03 2.732e+03, threshold=1.896e+03, percent-clipped=11.0 2023-06-24 19:12:05,380 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:12:25,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1815552.0, ans=0.0 2023-06-24 19:12:28,316 INFO [train.py:996] (3/4) Epoch 10, batch 28150, loss[loss=0.2241, simple_loss=0.2871, pruned_loss=0.08052, over 21761.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2985, pruned_loss=0.07626, over 4277011.42 frames. ], batch size: 317, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:13:44,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1815792.0, ans=0.0 2023-06-24 19:13:48,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1815792.0, ans=0.125 2023-06-24 19:13:51,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1815792.0, ans=0.0 2023-06-24 19:14:22,720 INFO [train.py:996] (3/4) Epoch 10, batch 28200, loss[loss=0.2726, simple_loss=0.33, pruned_loss=0.1076, over 21366.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2985, pruned_loss=0.07778, over 4278239.97 frames. ], batch size: 471, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:14:34,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1815912.0, ans=0.125 2023-06-24 19:14:36,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1815912.0, ans=0.0 2023-06-24 19:14:43,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-24 19:14:47,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1815972.0, ans=0.125 2023-06-24 19:15:26,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1816092.0, ans=0.0 2023-06-24 19:15:29,195 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.74 vs. limit=6.0 2023-06-24 19:15:29,921 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.315e+02 7.583e+02 1.086e+03 1.711e+03 4.107e+03, threshold=2.171e+03, percent-clipped=18.0 2023-06-24 19:15:31,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1816092.0, ans=0.125 2023-06-24 19:15:33,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1816092.0, ans=0.125 2023-06-24 19:16:01,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1816152.0, ans=0.125 2023-06-24 19:16:09,709 INFO [train.py:996] (3/4) Epoch 10, batch 28250, loss[loss=0.1822, simple_loss=0.2486, pruned_loss=0.05785, over 20799.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3031, pruned_loss=0.08043, over 4273997.49 frames. ], batch size: 609, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:16:14,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1816212.0, ans=0.0 2023-06-24 19:16:25,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1816272.0, ans=0.0 2023-06-24 19:16:40,662 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=12.0 2023-06-24 19:17:01,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1816332.0, ans=0.0 2023-06-24 19:17:57,259 INFO [train.py:996] (3/4) Epoch 10, batch 28300, loss[loss=0.2022, simple_loss=0.2927, pruned_loss=0.05588, over 21627.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3021, pruned_loss=0.07815, over 4258083.88 frames. ], batch size: 414, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:18:07,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1816512.0, ans=0.0 2023-06-24 19:18:42,889 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:18:48,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1816632.0, ans=0.125 2023-06-24 19:19:02,671 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.412e+02 7.987e+02 1.217e+03 2.083e+03 3.970e+03, threshold=2.435e+03, percent-clipped=20.0 2023-06-24 19:19:11,635 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:19:32,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1816752.0, ans=0.125 2023-06-24 19:19:35,968 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=22.5 2023-06-24 19:19:43,769 INFO [train.py:996] (3/4) Epoch 10, batch 28350, loss[loss=0.1938, simple_loss=0.266, pruned_loss=0.06075, over 21485.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2982, pruned_loss=0.07307, over 4239828.00 frames. ], batch size: 230, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:20:18,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1816872.0, ans=0.125 2023-06-24 19:20:25,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1816872.0, ans=0.0 2023-06-24 19:20:39,216 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-24 19:21:15,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1817052.0, ans=0.09899494936611666 2023-06-24 19:21:30,124 INFO [train.py:996] (3/4) Epoch 10, batch 28400, loss[loss=0.238, simple_loss=0.2997, pruned_loss=0.08813, over 21395.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.295, pruned_loss=0.07358, over 4245026.05 frames. ], batch size: 211, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:21:59,027 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.96 vs. limit=15.0 2023-06-24 19:22:45,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1817292.0, ans=0.125 2023-06-24 19:22:48,472 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.250e+02 7.623e+02 1.000e+03 1.515e+03 3.438e+03, threshold=2.000e+03, percent-clipped=5.0 2023-06-24 19:22:49,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1817292.0, ans=0.0 2023-06-24 19:23:09,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-24 19:23:22,172 INFO [train.py:996] (3/4) Epoch 10, batch 28450, loss[loss=0.2375, simple_loss=0.311, pruned_loss=0.08196, over 21862.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3019, pruned_loss=0.07858, over 4259385.43 frames. ], batch size: 351, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:23:39,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1817412.0, ans=0.09899494936611666 2023-06-24 19:23:48,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1817472.0, ans=0.2 2023-06-24 19:23:54,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1817472.0, ans=0.0 2023-06-24 19:24:07,115 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-24 19:24:30,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1817532.0, ans=0.0 2023-06-24 19:24:33,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1817592.0, ans=0.125 2023-06-24 19:24:39,353 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=22.5 2023-06-24 19:25:19,428 INFO [train.py:996] (3/4) Epoch 10, batch 28500, loss[loss=0.2197, simple_loss=0.2933, pruned_loss=0.07304, over 20694.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3038, pruned_loss=0.08056, over 4266036.90 frames. ], batch size: 607, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:25:28,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1817712.0, ans=0.0 2023-06-24 19:26:27,795 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.659e+02 7.035e+02 9.248e+02 1.262e+03 2.146e+03, threshold=1.850e+03, percent-clipped=2.0 2023-06-24 19:27:07,855 INFO [train.py:996] (3/4) Epoch 10, batch 28550, loss[loss=0.269, simple_loss=0.3678, pruned_loss=0.08507, over 21786.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3131, pruned_loss=0.08347, over 4275039.69 frames. ], batch size: 282, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:27:10,244 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.568e-03 2023-06-24 19:27:52,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1818132.0, ans=0.125 2023-06-24 19:28:00,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1818132.0, ans=0.125 2023-06-24 19:28:20,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.76 vs. limit=15.0 2023-06-24 19:28:55,065 INFO [train.py:996] (3/4) Epoch 10, batch 28600, loss[loss=0.261, simple_loss=0.3329, pruned_loss=0.09451, over 21768.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3196, pruned_loss=0.08506, over 4279282.49 frames. ], batch size: 118, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:29:02,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1818312.0, ans=0.5 2023-06-24 19:29:11,527 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=22.5 2023-06-24 19:29:36,829 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-24 19:30:08,435 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.757e+02 7.121e+02 1.044e+03 1.505e+03 3.342e+03, threshold=2.089e+03, percent-clipped=18.0 2023-06-24 19:30:25,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1818552.0, ans=0.2 2023-06-24 19:30:33,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1818552.0, ans=0.0 2023-06-24 19:30:42,099 INFO [train.py:996] (3/4) Epoch 10, batch 28650, loss[loss=0.2089, simple_loss=0.2756, pruned_loss=0.07105, over 21542.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3124, pruned_loss=0.0837, over 4284039.61 frames. ], batch size: 263, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:30:59,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1818612.0, ans=0.07 2023-06-24 19:31:10,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1818672.0, ans=0.0 2023-06-24 19:32:14,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1818852.0, ans=0.125 2023-06-24 19:32:19,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1818852.0, ans=0.125 2023-06-24 19:32:33,997 INFO [train.py:996] (3/4) Epoch 10, batch 28700, loss[loss=0.2296, simple_loss=0.305, pruned_loss=0.07713, over 21871.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3123, pruned_loss=0.08515, over 4277354.16 frames. ], batch size: 371, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:32:53,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1818912.0, ans=0.0 2023-06-24 19:33:16,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1819032.0, ans=0.125 2023-06-24 19:33:46,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1819092.0, ans=0.125 2023-06-24 19:33:52,200 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.396e+02 6.260e+02 7.915e+02 1.085e+03 2.283e+03, threshold=1.583e+03, percent-clipped=3.0 2023-06-24 19:34:23,848 INFO [train.py:996] (3/4) Epoch 10, batch 28750, loss[loss=0.2117, simple_loss=0.293, pruned_loss=0.0652, over 21478.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3119, pruned_loss=0.08503, over 4275453.23 frames. ], batch size: 211, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:34:42,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1819212.0, ans=0.2 2023-06-24 19:35:39,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1819392.0, ans=0.0 2023-06-24 19:35:56,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1819452.0, ans=0.2 2023-06-24 19:36:21,341 INFO [train.py:996] (3/4) Epoch 10, batch 28800, loss[loss=0.2433, simple_loss=0.3189, pruned_loss=0.08386, over 21773.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3145, pruned_loss=0.08468, over 4285568.71 frames. ], batch size: 332, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:37:15,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1819632.0, ans=0.2 2023-06-24 19:37:29,863 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.249e+02 6.492e+02 9.041e+02 1.378e+03 2.887e+03, threshold=1.808e+03, percent-clipped=17.0 2023-06-24 19:37:35,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1819692.0, ans=0.04949747468305833 2023-06-24 19:38:01,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1819752.0, ans=0.125 2023-06-24 19:38:07,886 INFO [train.py:996] (3/4) Epoch 10, batch 28850, loss[loss=0.2235, simple_loss=0.2969, pruned_loss=0.07506, over 21860.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3152, pruned_loss=0.08564, over 4286082.81 frames. ], batch size: 332, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:39:02,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1819932.0, ans=0.0 2023-06-24 19:39:47,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1820052.0, ans=0.2 2023-06-24 19:39:49,351 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=22.5 2023-06-24 19:39:54,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1820052.0, ans=6.0 2023-06-24 19:39:58,234 INFO [train.py:996] (3/4) Epoch 10, batch 28900, loss[loss=0.2464, simple_loss=0.3105, pruned_loss=0.0911, over 21353.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3175, pruned_loss=0.08704, over 4287072.49 frames. ], batch size: 159, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:40:17,137 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-24 19:40:23,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1820172.0, ans=0.125 2023-06-24 19:40:39,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1820172.0, ans=0.125 2023-06-24 19:40:46,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1820232.0, ans=0.0 2023-06-24 19:41:22,759 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.789e+02 7.717e+02 1.074e+03 1.480e+03 3.570e+03, threshold=2.148e+03, percent-clipped=12.0 2023-06-24 19:41:56,134 INFO [train.py:996] (3/4) Epoch 10, batch 28950, loss[loss=0.2781, simple_loss=0.4013, pruned_loss=0.07747, over 19776.00 frames. ], tot_loss[loss=0.244, simple_loss=0.317, pruned_loss=0.08553, over 4282222.36 frames. ], batch size: 703, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:41:59,003 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-24 19:42:21,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1820472.0, ans=0.2 2023-06-24 19:43:18,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1820592.0, ans=0.05 2023-06-24 19:43:34,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1820652.0, ans=0.125 2023-06-24 19:43:53,412 INFO [train.py:996] (3/4) Epoch 10, batch 29000, loss[loss=0.285, simple_loss=0.3472, pruned_loss=0.1114, over 21288.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3204, pruned_loss=0.08454, over 4277285.10 frames. ], batch size: 143, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:44:13,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1820772.0, ans=0.1 2023-06-24 19:45:02,229 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.875e+02 6.963e+02 8.880e+02 1.390e+03 4.828e+03, threshold=1.776e+03, percent-clipped=11.0 2023-06-24 19:45:04,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1820892.0, ans=0.125 2023-06-24 19:45:12,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1820892.0, ans=0.125 2023-06-24 19:45:41,323 INFO [train.py:996] (3/4) Epoch 10, batch 29050, loss[loss=0.2321, simple_loss=0.303, pruned_loss=0.08058, over 21855.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3187, pruned_loss=0.085, over 4284042.76 frames. ], batch size: 371, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:45:42,456 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=22.5 2023-06-24 19:46:06,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1821072.0, ans=0.015 2023-06-24 19:46:14,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1821072.0, ans=0.125 2023-06-24 19:46:38,146 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2023-06-24 19:47:19,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1821252.0, ans=0.0 2023-06-24 19:47:27,535 INFO [train.py:996] (3/4) Epoch 10, batch 29100, loss[loss=0.1782, simple_loss=0.2394, pruned_loss=0.05852, over 21205.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3102, pruned_loss=0.08278, over 4275806.05 frames. ], batch size: 548, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:47:42,177 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-24 19:47:46,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1821372.0, ans=0.1 2023-06-24 19:48:15,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1821432.0, ans=0.0 2023-06-24 19:48:40,592 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.869e+02 7.571e+02 1.020e+03 1.542e+03 3.510e+03, threshold=2.040e+03, percent-clipped=14.0 2023-06-24 19:49:15,508 INFO [train.py:996] (3/4) Epoch 10, batch 29150, loss[loss=0.2738, simple_loss=0.3737, pruned_loss=0.08697, over 21214.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3109, pruned_loss=0.08246, over 4279537.12 frames. ], batch size: 548, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:49:32,172 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-24 19:50:25,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1821792.0, ans=0.125 2023-06-24 19:50:27,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1821792.0, ans=0.0 2023-06-24 19:50:42,197 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:51:04,094 INFO [train.py:996] (3/4) Epoch 10, batch 29200, loss[loss=0.216, simple_loss=0.2822, pruned_loss=0.07487, over 21718.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.308, pruned_loss=0.0819, over 4269717.24 frames. ], batch size: 316, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:51:43,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1821972.0, ans=0.125 2023-06-24 19:51:46,931 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-24 19:52:17,861 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.365e+02 7.371e+02 1.140e+03 1.562e+03 2.800e+03, threshold=2.281e+03, percent-clipped=10.0 2023-06-24 19:52:30,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1822092.0, ans=0.04949747468305833 2023-06-24 19:52:35,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1822152.0, ans=0.125 2023-06-24 19:52:53,159 INFO [train.py:996] (3/4) Epoch 10, batch 29250, loss[loss=0.2182, simple_loss=0.3063, pruned_loss=0.06509, over 21637.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3043, pruned_loss=0.07852, over 4259088.41 frames. ], batch size: 263, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:53:00,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1822212.0, ans=0.0 2023-06-24 19:53:02,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1822212.0, ans=0.0 2023-06-24 19:53:02,957 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-24 19:53:04,572 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=22.5 2023-06-24 19:53:36,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1822272.0, ans=0.015 2023-06-24 19:53:56,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1822332.0, ans=0.1 2023-06-24 19:54:41,029 INFO [train.py:996] (3/4) Epoch 10, batch 29300, loss[loss=0.2211, simple_loss=0.2821, pruned_loss=0.08004, over 21778.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3061, pruned_loss=0.07755, over 4263928.79 frames. ], batch size: 124, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:54:46,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1822512.0, ans=0.1 2023-06-24 19:55:17,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1822572.0, ans=0.05 2023-06-24 19:55:58,749 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-06-24 19:56:02,856 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.304e+02 6.696e+02 9.053e+02 1.473e+03 3.162e+03, threshold=1.811e+03, percent-clipped=5.0 2023-06-24 19:56:30,732 INFO [train.py:996] (3/4) Epoch 10, batch 29350, loss[loss=0.1988, simple_loss=0.2592, pruned_loss=0.06918, over 21583.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.302, pruned_loss=0.0765, over 4270875.08 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:57:04,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1822872.0, ans=0.0 2023-06-24 19:57:15,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1822872.0, ans=0.125 2023-06-24 19:57:21,088 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-24 19:57:30,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1822932.0, ans=0.125 2023-06-24 19:57:57,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1822992.0, ans=0.0 2023-06-24 19:58:31,662 INFO [train.py:996] (3/4) Epoch 10, batch 29400, loss[loss=0.2838, simple_loss=0.3565, pruned_loss=0.1055, over 21473.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3002, pruned_loss=0.07443, over 4269178.39 frames. ], batch size: 508, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:58:50,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1823112.0, ans=0.5 2023-06-24 19:59:06,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1823172.0, ans=0.1 2023-06-24 19:59:39,415 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.735e+02 7.217e+02 1.208e+03 1.784e+03 4.520e+03, threshold=2.416e+03, percent-clipped=24.0 2023-06-24 20:00:18,168 INFO [train.py:996] (3/4) Epoch 10, batch 29450, loss[loss=0.2503, simple_loss=0.3263, pruned_loss=0.08717, over 21369.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2999, pruned_loss=0.07392, over 4270592.87 frames. ], batch size: 549, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 20:00:18,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1823412.0, ans=0.0 2023-06-24 20:00:38,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1823472.0, ans=0.1 2023-06-24 20:00:38,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1823472.0, ans=0.0 2023-06-24 20:01:23,472 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.07 vs. limit=10.0 2023-06-24 20:01:58,215 INFO [train.py:996] (3/4) Epoch 10, batch 29500, loss[loss=0.2476, simple_loss=0.3167, pruned_loss=0.08927, over 21314.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3042, pruned_loss=0.07669, over 4271944.78 frames. ], batch size: 176, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 20:02:08,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1823712.0, ans=0.125 2023-06-24 20:02:10,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1823712.0, ans=0.1 2023-06-24 20:02:29,116 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=22.5 2023-06-24 20:03:06,147 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.744e+02 7.201e+02 9.951e+02 1.281e+03 3.079e+03, threshold=1.990e+03, percent-clipped=2.0 2023-06-24 20:03:22,076 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.94 vs. limit=10.0 2023-06-24 20:03:35,776 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:03:45,380 INFO [train.py:996] (3/4) Epoch 10, batch 29550, loss[loss=0.2352, simple_loss=0.305, pruned_loss=0.08276, over 21847.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.304, pruned_loss=0.07857, over 4275002.07 frames. ], batch size: 332, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 20:03:49,718 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=22.5 2023-06-24 20:04:47,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1824192.0, ans=0.125 2023-06-24 20:04:53,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1824192.0, ans=0.1 2023-06-24 20:05:14,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1824252.0, ans=0.0 2023-06-24 20:05:38,091 INFO [train.py:996] (3/4) Epoch 10, batch 29600, loss[loss=0.3522, simple_loss=0.4661, pruned_loss=0.1191, over 19839.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3123, pruned_loss=0.0815, over 4279099.93 frames. ], batch size: 702, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 20:06:12,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1824372.0, ans=0.0 2023-06-24 20:06:14,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1824372.0, ans=0.09899494936611666 2023-06-24 20:06:54,887 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.014e+02 8.080e+02 1.124e+03 1.418e+03 4.047e+03, threshold=2.247e+03, percent-clipped=9.0 2023-06-24 20:06:57,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1824492.0, ans=0.2 2023-06-24 20:07:22,971 INFO [train.py:996] (3/4) Epoch 10, batch 29650, loss[loss=0.1738, simple_loss=0.2591, pruned_loss=0.04422, over 21772.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.312, pruned_loss=0.07941, over 4284015.63 frames. ], batch size: 316, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:07:32,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1824612.0, ans=0.0 2023-06-24 20:07:51,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1824672.0, ans=0.125 2023-06-24 20:08:45,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1824792.0, ans=0.125 2023-06-24 20:09:07,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1824852.0, ans=0.2 2023-06-24 20:09:10,401 INFO [train.py:996] (3/4) Epoch 10, batch 29700, loss[loss=0.195, simple_loss=0.2674, pruned_loss=0.06133, over 21148.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3111, pruned_loss=0.07906, over 4287977.14 frames. ], batch size: 608, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:09:21,459 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=12.0 2023-06-24 20:09:32,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1824972.0, ans=0.025 2023-06-24 20:10:29,380 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.654e+02 7.202e+02 1.096e+03 1.682e+03 4.155e+03, threshold=2.193e+03, percent-clipped=13.0 2023-06-24 20:10:32,056 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-06-24 20:10:55,738 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=12.0 2023-06-24 20:10:58,048 INFO [train.py:996] (3/4) Epoch 10, batch 29750, loss[loss=0.2364, simple_loss=0.3552, pruned_loss=0.05881, over 20870.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3167, pruned_loss=0.07917, over 4285024.55 frames. ], batch size: 607, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:11:01,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1825212.0, ans=0.1 2023-06-24 20:12:25,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1825452.0, ans=0.0 2023-06-24 20:12:44,347 INFO [train.py:996] (3/4) Epoch 10, batch 29800, loss[loss=0.2663, simple_loss=0.331, pruned_loss=0.1009, over 21840.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3189, pruned_loss=0.08074, over 4294158.72 frames. ], batch size: 107, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:13:59,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1825692.0, ans=0.0 2023-06-24 20:14:03,887 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.147e+02 7.156e+02 9.428e+02 1.298e+03 2.212e+03, threshold=1.886e+03, percent-clipped=2.0 2023-06-24 20:14:09,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1825752.0, ans=0.0 2023-06-24 20:14:14,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1825752.0, ans=0.125 2023-06-24 20:14:32,482 INFO [train.py:996] (3/4) Epoch 10, batch 29850, loss[loss=0.1996, simple_loss=0.2857, pruned_loss=0.0568, over 21775.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3141, pruned_loss=0.07829, over 4295183.30 frames. ], batch size: 414, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:14:34,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1825812.0, ans=0.125 2023-06-24 20:15:34,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1825992.0, ans=0.05 2023-06-24 20:15:52,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1826052.0, ans=0.125 2023-06-24 20:16:10,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1826052.0, ans=0.5 2023-06-24 20:16:13,613 INFO [train.py:996] (3/4) Epoch 10, batch 29900, loss[loss=0.3035, simple_loss=0.3641, pruned_loss=0.1215, over 21817.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3122, pruned_loss=0.07983, over 4304047.21 frames. ], batch size: 441, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:17:37,192 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.094e+02 6.347e+02 7.852e+02 1.102e+03 2.302e+03, threshold=1.570e+03, percent-clipped=5.0 2023-06-24 20:18:07,360 INFO [train.py:996] (3/4) Epoch 10, batch 29950, loss[loss=0.2651, simple_loss=0.3367, pruned_loss=0.09673, over 21694.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3142, pruned_loss=0.08294, over 4298634.34 frames. ], batch size: 351, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:18:34,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=1826472.0, ans=0.1 2023-06-24 20:19:46,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1826652.0, ans=0.2 2023-06-24 20:19:55,931 INFO [train.py:996] (3/4) Epoch 10, batch 30000, loss[loss=0.2102, simple_loss=0.3066, pruned_loss=0.05687, over 21829.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3167, pruned_loss=0.0833, over 4297171.52 frames. ], batch size: 282, lr: 2.86e-03, grad_scale: 32.0 2023-06-24 20:19:55,931 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 20:20:14,322 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2483, simple_loss=0.3443, pruned_loss=0.07614, over 1796401.00 frames. 2023-06-24 20:20:14,323 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-24 20:20:16,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1826712.0, ans=0.125 2023-06-24 20:20:41,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1826712.0, ans=0.125 2023-06-24 20:20:44,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1826772.0, ans=0.0 2023-06-24 20:21:37,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1826892.0, ans=0.125 2023-06-24 20:21:40,535 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.194e+02 6.464e+02 9.828e+02 1.489e+03 3.469e+03, threshold=1.966e+03, percent-clipped=22.0 2023-06-24 20:22:11,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1826952.0, ans=0.1 2023-06-24 20:22:16,882 INFO [train.py:996] (3/4) Epoch 10, batch 30050, loss[loss=0.2469, simple_loss=0.341, pruned_loss=0.07644, over 21646.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3222, pruned_loss=0.08147, over 4300251.63 frames. ], batch size: 247, lr: 2.86e-03, grad_scale: 32.0 2023-06-24 20:23:09,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1827132.0, ans=0.125 2023-06-24 20:23:13,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1827132.0, ans=0.0 2023-06-24 20:23:22,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1827192.0, ans=0.125 2023-06-24 20:24:02,292 INFO [train.py:996] (3/4) Epoch 10, batch 30100, loss[loss=0.2256, simple_loss=0.2812, pruned_loss=0.085, over 21958.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3204, pruned_loss=0.0806, over 4295935.49 frames. ], batch size: 119, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:24:35,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-24 20:25:02,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1827492.0, ans=0.125 2023-06-24 20:25:20,859 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.518e+02 7.739e+02 1.138e+03 1.850e+03 3.841e+03, threshold=2.275e+03, percent-clipped=20.0 2023-06-24 20:25:39,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1827552.0, ans=0.125 2023-06-24 20:25:54,240 INFO [train.py:996] (3/4) Epoch 10, batch 30150, loss[loss=0.2527, simple_loss=0.319, pruned_loss=0.09316, over 21909.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3154, pruned_loss=0.08148, over 4293332.92 frames. ], batch size: 372, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:25:56,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1827612.0, ans=22.5 2023-06-24 20:26:05,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1827612.0, ans=0.125 2023-06-24 20:26:51,717 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.44 vs. limit=12.0 2023-06-24 20:26:52,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1827732.0, ans=0.5 2023-06-24 20:27:13,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1827792.0, ans=0.0 2023-06-24 20:27:43,928 INFO [train.py:996] (3/4) Epoch 10, batch 30200, loss[loss=0.2828, simple_loss=0.3581, pruned_loss=0.1037, over 21423.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3169, pruned_loss=0.0809, over 4284067.13 frames. ], batch size: 131, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:27:58,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1827912.0, ans=0.125 2023-06-24 20:28:03,549 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:28:13,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1827972.0, ans=0.125 2023-06-24 20:28:45,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1828032.0, ans=0.125 2023-06-24 20:29:01,908 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.86 vs. limit=15.0 2023-06-24 20:29:09,508 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.062e+02 7.107e+02 1.003e+03 1.591e+03 3.654e+03, threshold=2.006e+03, percent-clipped=8.0 2023-06-24 20:29:37,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1828212.0, ans=0.125 2023-06-24 20:29:38,177 INFO [train.py:996] (3/4) Epoch 10, batch 30250, loss[loss=0.2501, simple_loss=0.3594, pruned_loss=0.07039, over 21783.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3239, pruned_loss=0.08301, over 4278237.59 frames. ], batch size: 282, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:30:03,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1828272.0, ans=0.0 2023-06-24 20:30:03,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1828272.0, ans=0.125 2023-06-24 20:30:36,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1828332.0, ans=0.125 2023-06-24 20:30:46,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1828392.0, ans=0.0 2023-06-24 20:31:03,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1828452.0, ans=0.125 2023-06-24 20:31:03,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1828452.0, ans=0.125 2023-06-24 20:31:24,950 INFO [train.py:996] (3/4) Epoch 10, batch 30300, loss[loss=0.2317, simple_loss=0.2916, pruned_loss=0.08594, over 21614.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3228, pruned_loss=0.0834, over 4280824.84 frames. ], batch size: 298, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:32:05,529 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:32:42,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1828692.0, ans=0.125 2023-06-24 20:32:47,060 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.072e+02 7.345e+02 9.964e+02 1.414e+03 3.448e+03, threshold=1.993e+03, percent-clipped=9.0 2023-06-24 20:33:04,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1828752.0, ans=0.2 2023-06-24 20:33:07,250 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=22.5 2023-06-24 20:33:21,522 INFO [train.py:996] (3/4) Epoch 10, batch 30350, loss[loss=0.2287, simple_loss=0.3077, pruned_loss=0.07486, over 21573.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3233, pruned_loss=0.08441, over 4276142.40 frames. ], batch size: 263, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:33:55,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1828872.0, ans=0.0 2023-06-24 20:33:57,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1828872.0, ans=0.0 2023-06-24 20:34:40,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1829052.0, ans=0.125 2023-06-24 20:34:50,309 INFO [train.py:996] (3/4) Epoch 10, batch 30400, loss[loss=0.2325, simple_loss=0.2829, pruned_loss=0.09101, over 20187.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3168, pruned_loss=0.08297, over 4257823.15 frames. ], batch size: 702, lr: 2.86e-03, grad_scale: 32.0 2023-06-24 20:34:50,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1829112.0, ans=0.125 2023-06-24 20:35:57,896 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.475e+02 9.226e+02 1.502e+03 2.385e+03 9.827e+03, threshold=3.004e+03, percent-clipped=34.0 2023-06-24 20:36:18,151 INFO [train.py:996] (3/4) Epoch 10, batch 30450, loss[loss=0.3163, simple_loss=0.4307, pruned_loss=0.101, over 19872.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.319, pruned_loss=0.08233, over 4199333.44 frames. ], batch size: 702, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:36:40,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1829472.0, ans=0.0 2023-06-24 20:37:15,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1829592.0, ans=0.125 2023-06-24 20:39:22,093 INFO [train.py:996] (3/4) Epoch 11, batch 0, loss[loss=0.2386, simple_loss=0.2996, pruned_loss=0.08878, over 21592.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.2996, pruned_loss=0.08878, over 21592.00 frames. ], batch size: 247, lr: 2.72e-03, grad_scale: 32.0 2023-06-24 20:39:22,093 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 20:39:38,858 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.2455, simple_loss=0.3504, pruned_loss=0.07029, over 1796401.00 frames. 2023-06-24 20:39:38,859 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-24 20:39:53,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1829676.0, ans=0.125 2023-06-24 20:40:36,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1829856.0, ans=0.2 2023-06-24 20:41:04,937 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.679e+02 1.420e+03 2.190e+03 4.363e+03 1.061e+04, threshold=4.380e+03, percent-clipped=34.0 2023-06-24 20:41:20,405 INFO [train.py:996] (3/4) Epoch 11, batch 50, loss[loss=0.2846, simple_loss=0.3917, pruned_loss=0.0888, over 21794.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3169, pruned_loss=0.07887, over 954113.43 frames. ], batch size: 282, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:41:26,046 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.04 vs. limit=15.0 2023-06-24 20:41:49,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1830036.0, ans=0.125 2023-06-24 20:42:06,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1830096.0, ans=0.0 2023-06-24 20:42:19,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1830156.0, ans=0.025 2023-06-24 20:42:24,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1830156.0, ans=0.0 2023-06-24 20:42:33,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1830216.0, ans=0.1 2023-06-24 20:42:56,736 INFO [train.py:996] (3/4) Epoch 11, batch 100, loss[loss=0.2438, simple_loss=0.351, pruned_loss=0.06831, over 21734.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3301, pruned_loss=0.08067, over 1696964.34 frames. ], batch size: 332, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:43:10,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1830276.0, ans=0.0 2023-06-24 20:43:10,915 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=22.5 2023-06-24 20:43:41,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1830396.0, ans=0.125 2023-06-24 20:44:04,939 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-24 20:44:30,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.668e+02 7.799e+02 1.011e+03 1.345e+03 2.704e+03, threshold=2.023e+03, percent-clipped=0.0 2023-06-24 20:44:50,609 INFO [train.py:996] (3/4) Epoch 11, batch 150, loss[loss=0.2254, simple_loss=0.315, pruned_loss=0.0679, over 21212.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3333, pruned_loss=0.08169, over 2266763.31 frames. ], batch size: 176, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:44:57,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1830576.0, ans=0.05 2023-06-24 20:45:16,898 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.13 vs. limit=15.0 2023-06-24 20:46:01,996 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=22.5 2023-06-24 20:46:30,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1830876.0, ans=0.2 2023-06-24 20:46:31,225 INFO [train.py:996] (3/4) Epoch 11, batch 200, loss[loss=0.2567, simple_loss=0.317, pruned_loss=0.09816, over 21824.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3296, pruned_loss=0.08131, over 2711565.48 frames. ], batch size: 124, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:47:11,887 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.98 vs. limit=8.0 2023-06-24 20:47:20,946 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-24 20:47:49,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1831116.0, ans=0.0 2023-06-24 20:47:55,400 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.519e+02 7.230e+02 1.005e+03 1.517e+03 6.245e+03, threshold=2.009e+03, percent-clipped=15.0 2023-06-24 20:47:55,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1831116.0, ans=0.2 2023-06-24 20:48:05,702 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=22.5 2023-06-24 20:48:06,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1831116.0, ans=0.0 2023-06-24 20:48:15,604 INFO [train.py:996] (3/4) Epoch 11, batch 250, loss[loss=0.2363, simple_loss=0.3177, pruned_loss=0.07742, over 20770.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3275, pruned_loss=0.08046, over 3044595.66 frames. ], batch size: 608, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:48:48,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1831236.0, ans=0.1 2023-06-24 20:49:37,620 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-06-24 20:50:01,277 INFO [train.py:996] (3/4) Epoch 11, batch 300, loss[loss=0.2249, simple_loss=0.2913, pruned_loss=0.0793, over 21663.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3227, pruned_loss=0.08125, over 3321632.77 frames. ], batch size: 230, lr: 2.72e-03, grad_scale: 8.0 2023-06-24 20:50:21,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1831536.0, ans=0.1 2023-06-24 20:51:09,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1831656.0, ans=0.125 2023-06-24 20:51:12,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1831656.0, ans=0.1 2023-06-24 20:51:30,477 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.045e+02 7.499e+02 1.164e+03 1.692e+03 3.059e+03, threshold=2.329e+03, percent-clipped=16.0 2023-06-24 20:51:38,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1831716.0, ans=0.0 2023-06-24 20:51:50,456 INFO [train.py:996] (3/4) Epoch 11, batch 350, loss[loss=0.2031, simple_loss=0.2698, pruned_loss=0.06818, over 21483.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3159, pruned_loss=0.08055, over 3540775.46 frames. ], batch size: 132, lr: 2.72e-03, grad_scale: 8.0 2023-06-24 20:51:57,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1831776.0, ans=0.125 2023-06-24 20:52:02,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1831776.0, ans=0.1 2023-06-24 20:52:03,062 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-06-24 20:52:15,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1831836.0, ans=0.125 2023-06-24 20:52:19,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1831836.0, ans=0.0 2023-06-24 20:52:54,122 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=15.0 2023-06-24 20:53:11,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1832016.0, ans=0.125 2023-06-24 20:53:22,806 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=22.5 2023-06-24 20:53:25,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1832016.0, ans=0.1 2023-06-24 20:53:28,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1832016.0, ans=15.0 2023-06-24 20:53:31,091 INFO [train.py:996] (3/4) Epoch 11, batch 400, loss[loss=0.288, simple_loss=0.3936, pruned_loss=0.09123, over 21620.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3102, pruned_loss=0.07705, over 3702370.37 frames. ], batch size: 441, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:53:31,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1832076.0, ans=0.1 2023-06-24 20:53:39,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1832076.0, ans=0.2 2023-06-24 20:53:42,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1832076.0, ans=0.125 2023-06-24 20:54:16,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1832196.0, ans=0.2 2023-06-24 20:54:46,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1832256.0, ans=0.1 2023-06-24 20:55:11,318 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.125e+02 8.397e+02 1.523e+03 1.983e+03 4.862e+03, threshold=3.046e+03, percent-clipped=16.0 2023-06-24 20:55:18,132 INFO [train.py:996] (3/4) Epoch 11, batch 450, loss[loss=0.1737, simple_loss=0.2391, pruned_loss=0.05418, over 21359.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3043, pruned_loss=0.0755, over 3826645.94 frames. ], batch size: 131, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:55:41,025 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-24 20:55:45,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1832436.0, ans=0.125 2023-06-24 20:55:55,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1832436.0, ans=0.125 2023-06-24 20:56:49,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1832616.0, ans=0.125 2023-06-24 20:57:04,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1832616.0, ans=0.0 2023-06-24 20:57:08,311 INFO [train.py:996] (3/4) Epoch 11, batch 500, loss[loss=0.226, simple_loss=0.2959, pruned_loss=0.07809, over 21774.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3035, pruned_loss=0.07489, over 3927981.46 frames. ], batch size: 124, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:57:40,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1832736.0, ans=0.125 2023-06-24 20:58:18,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1832856.0, ans=0.1 2023-06-24 20:58:32,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1832916.0, ans=0.125 2023-06-24 20:58:40,203 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.448e+02 9.876e+02 1.724e+03 2.578e+03 4.436e+03, threshold=3.448e+03, percent-clipped=13.0 2023-06-24 20:58:53,037 INFO [train.py:996] (3/4) Epoch 11, batch 550, loss[loss=0.2435, simple_loss=0.335, pruned_loss=0.07596, over 21313.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.306, pruned_loss=0.07409, over 4006999.99 frames. ], batch size: 176, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:59:07,331 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.48 vs. limit=15.0 2023-06-24 20:59:20,434 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-24 21:00:16,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1833216.0, ans=0.125 2023-06-24 21:00:21,090 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-24 21:00:31,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1833216.0, ans=0.2 2023-06-24 21:00:38,826 INFO [train.py:996] (3/4) Epoch 11, batch 600, loss[loss=0.2403, simple_loss=0.3284, pruned_loss=0.07608, over 21446.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3133, pruned_loss=0.0754, over 4075036.68 frames. ], batch size: 211, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:00:50,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1833276.0, ans=0.125 2023-06-24 21:01:41,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1833456.0, ans=0.125 2023-06-24 21:02:13,828 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.680e+02 7.057e+02 1.048e+03 1.641e+03 3.624e+03, threshold=2.096e+03, percent-clipped=2.0 2023-06-24 21:02:26,655 INFO [train.py:996] (3/4) Epoch 11, batch 650, loss[loss=0.2237, simple_loss=0.2926, pruned_loss=0.07742, over 21272.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3175, pruned_loss=0.07665, over 4106833.13 frames. ], batch size: 159, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:03:22,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1833696.0, ans=0.1 2023-06-24 21:03:48,699 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-24 21:04:04,926 INFO [train.py:996] (3/4) Epoch 11, batch 700, loss[loss=0.2245, simple_loss=0.3002, pruned_loss=0.07439, over 21796.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3179, pruned_loss=0.07845, over 4151433.95 frames. ], batch size: 298, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:05:05,421 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=22.5 2023-06-24 21:05:18,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1834056.0, ans=0.0 2023-06-24 21:05:44,877 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.745e+02 9.389e+02 1.418e+03 2.158e+03 4.228e+03, threshold=2.836e+03, percent-clipped=28.0 2023-06-24 21:05:51,296 INFO [train.py:996] (3/4) Epoch 11, batch 750, loss[loss=0.2208, simple_loss=0.2972, pruned_loss=0.07217, over 21476.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3155, pruned_loss=0.07978, over 4180478.13 frames. ], batch size: 212, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:06:07,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1834176.0, ans=0.0 2023-06-24 21:06:25,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1834236.0, ans=0.2 2023-06-24 21:06:34,228 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.41 vs. limit=15.0 2023-06-24 21:07:06,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1834356.0, ans=0.125 2023-06-24 21:07:40,970 INFO [train.py:996] (3/4) Epoch 11, batch 800, loss[loss=0.3038, simple_loss=0.3934, pruned_loss=0.1071, over 21705.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3124, pruned_loss=0.08009, over 4209350.75 frames. ], batch size: 414, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:07:42,065 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.70 vs. limit=22.5 2023-06-24 21:07:50,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1834476.0, ans=0.125 2023-06-24 21:08:32,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1834596.0, ans=0.125 2023-06-24 21:09:09,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1834716.0, ans=0.0 2023-06-24 21:09:21,221 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.816e+02 7.606e+02 1.363e+03 2.031e+03 4.976e+03, threshold=2.727e+03, percent-clipped=7.0 2023-06-24 21:09:32,171 INFO [train.py:996] (3/4) Epoch 11, batch 850, loss[loss=0.2377, simple_loss=0.3094, pruned_loss=0.08296, over 21856.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3111, pruned_loss=0.08126, over 4227629.04 frames. ], batch size: 332, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:09:44,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1834776.0, ans=0.125 2023-06-24 21:10:32,001 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-24 21:10:52,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1834956.0, ans=10.0 2023-06-24 21:11:20,168 INFO [train.py:996] (3/4) Epoch 11, batch 900, loss[loss=0.1618, simple_loss=0.2411, pruned_loss=0.04121, over 21783.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3077, pruned_loss=0.07957, over 4242535.27 frames. ], batch size: 118, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:11:23,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1835076.0, ans=0.0 2023-06-24 21:11:36,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1835076.0, ans=0.125 2023-06-24 21:12:38,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1835256.0, ans=0.125 2023-06-24 21:12:55,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1835316.0, ans=0.0 2023-06-24 21:13:04,988 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.933e+02 7.581e+02 9.805e+02 1.489e+03 3.191e+03, threshold=1.961e+03, percent-clipped=4.0 2023-06-24 21:13:08,468 INFO [train.py:996] (3/4) Epoch 11, batch 950, loss[loss=0.2823, simple_loss=0.333, pruned_loss=0.1158, over 21507.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3046, pruned_loss=0.07848, over 4253319.48 frames. ], batch size: 471, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:13:35,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=22.5 2023-06-24 21:14:09,302 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=22.5 2023-06-24 21:14:16,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1835556.0, ans=0.125 2023-06-24 21:14:27,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1835556.0, ans=0.0 2023-06-24 21:14:29,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1835556.0, ans=0.2 2023-06-24 21:14:54,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1835616.0, ans=0.125 2023-06-24 21:14:54,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1835616.0, ans=0.0 2023-06-24 21:14:57,543 INFO [train.py:996] (3/4) Epoch 11, batch 1000, loss[loss=0.2208, simple_loss=0.2899, pruned_loss=0.07585, over 21343.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3048, pruned_loss=0.07877, over 4261462.30 frames. ], batch size: 143, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:15:16,081 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.14 vs. limit=15.0 2023-06-24 21:16:13,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1835856.0, ans=0.125 2023-06-24 21:16:34,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1835916.0, ans=0.1 2023-06-24 21:16:49,677 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.933e+02 6.726e+02 9.371e+02 1.402e+03 3.411e+03, threshold=1.874e+03, percent-clipped=8.0 2023-06-24 21:16:53,242 INFO [train.py:996] (3/4) Epoch 11, batch 1050, loss[loss=0.2426, simple_loss=0.3113, pruned_loss=0.08698, over 21486.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3041, pruned_loss=0.07859, over 4273817.22 frames. ], batch size: 211, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:17:26,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1836036.0, ans=0.0 2023-06-24 21:17:57,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1836156.0, ans=0.125 2023-06-24 21:18:09,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1836156.0, ans=0.0 2023-06-24 21:18:27,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1836216.0, ans=0.125 2023-06-24 21:18:43,427 INFO [train.py:996] (3/4) Epoch 11, batch 1100, loss[loss=0.2146, simple_loss=0.2898, pruned_loss=0.06969, over 20213.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3042, pruned_loss=0.07746, over 4274759.44 frames. ], batch size: 703, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:19:17,561 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2023-06-24 21:20:26,811 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.856e+02 8.203e+02 1.251e+03 2.125e+03 4.416e+03, threshold=2.502e+03, percent-clipped=31.0 2023-06-24 21:20:36,468 INFO [train.py:996] (3/4) Epoch 11, batch 1150, loss[loss=0.2568, simple_loss=0.3193, pruned_loss=0.09712, over 21259.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3057, pruned_loss=0.07699, over 4270736.16 frames. ], batch size: 176, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:20:59,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1836636.0, ans=0.1 2023-06-24 21:22:25,707 INFO [train.py:996] (3/4) Epoch 11, batch 1200, loss[loss=0.245, simple_loss=0.3207, pruned_loss=0.08464, over 21807.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3087, pruned_loss=0.07726, over 4271117.61 frames. ], batch size: 124, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:22:55,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1836936.0, ans=0.125 2023-06-24 21:23:25,584 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.42 vs. limit=10.0 2023-06-24 21:23:59,365 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2023-06-24 21:24:05,090 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.099e+02 7.723e+02 1.059e+03 1.468e+03 2.676e+03, threshold=2.118e+03, percent-clipped=4.0 2023-06-24 21:24:14,371 INFO [train.py:996] (3/4) Epoch 11, batch 1250, loss[loss=0.2179, simple_loss=0.2642, pruned_loss=0.08577, over 20256.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3108, pruned_loss=0.07764, over 4268909.47 frames. ], batch size: 703, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:24:24,237 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.27 vs. limit=15.0 2023-06-24 21:24:31,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1837236.0, ans=0.125 2023-06-24 21:24:35,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1837236.0, ans=0.2 2023-06-24 21:25:46,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1837416.0, ans=0.2 2023-06-24 21:26:04,412 INFO [train.py:996] (3/4) Epoch 11, batch 1300, loss[loss=0.2177, simple_loss=0.2976, pruned_loss=0.06886, over 21727.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3125, pruned_loss=0.07866, over 4278698.99 frames. ], batch size: 247, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:26:34,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1837536.0, ans=0.125 2023-06-24 21:27:06,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1837656.0, ans=0.2 2023-06-24 21:27:30,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1837716.0, ans=0.0 2023-06-24 21:27:52,120 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.439e+02 7.719e+02 9.846e+02 1.503e+03 2.792e+03, threshold=1.969e+03, percent-clipped=4.0 2023-06-24 21:27:53,885 INFO [train.py:996] (3/4) Epoch 11, batch 1350, loss[loss=0.2308, simple_loss=0.3068, pruned_loss=0.07737, over 21834.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3135, pruned_loss=0.07955, over 4281971.09 frames. ], batch size: 298, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:28:57,106 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.64 vs. limit=6.0 2023-06-24 21:28:58,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1837956.0, ans=0.1 2023-06-24 21:29:29,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1838016.0, ans=0.5 2023-06-24 21:29:32,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1838016.0, ans=0.125 2023-06-24 21:29:43,453 INFO [train.py:996] (3/4) Epoch 11, batch 1400, loss[loss=0.2255, simple_loss=0.3025, pruned_loss=0.07422, over 21775.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3123, pruned_loss=0.08035, over 4281974.18 frames. ], batch size: 98, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:30:16,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1838136.0, ans=0.0 2023-06-24 21:30:57,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1838256.0, ans=0.0 2023-06-24 21:31:12,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1838256.0, ans=0.125 2023-06-24 21:31:24,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1838316.0, ans=0.0 2023-06-24 21:31:31,853 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.111e+02 8.921e+02 1.291e+03 1.879e+03 3.355e+03, threshold=2.582e+03, percent-clipped=19.0 2023-06-24 21:31:33,592 INFO [train.py:996] (3/4) Epoch 11, batch 1450, loss[loss=0.2553, simple_loss=0.3552, pruned_loss=0.07766, over 21743.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3125, pruned_loss=0.08162, over 4282496.43 frames. ], batch size: 414, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:32:26,438 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.69 vs. limit=12.0 2023-06-24 21:33:01,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1838556.0, ans=0.1 2023-06-24 21:33:21,336 INFO [train.py:996] (3/4) Epoch 11, batch 1500, loss[loss=0.2547, simple_loss=0.3271, pruned_loss=0.0912, over 21795.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3142, pruned_loss=0.08264, over 4284891.19 frames. ], batch size: 124, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:34:18,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1838796.0, ans=0.125 2023-06-24 21:34:18,666 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=12.0 2023-06-24 21:35:08,568 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.229e+02 8.037e+02 1.041e+03 1.486e+03 3.371e+03, threshold=2.081e+03, percent-clipped=9.0 2023-06-24 21:35:10,365 INFO [train.py:996] (3/4) Epoch 11, batch 1550, loss[loss=0.1931, simple_loss=0.2658, pruned_loss=0.0602, over 21376.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3101, pruned_loss=0.08114, over 4288530.68 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:35:58,631 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.46 vs. limit=10.0 2023-06-24 21:36:40,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1839156.0, ans=0.125 2023-06-24 21:36:40,969 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-24 21:37:01,931 INFO [train.py:996] (3/4) Epoch 11, batch 1600, loss[loss=0.1793, simple_loss=0.2364, pruned_loss=0.06113, over 21812.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3082, pruned_loss=0.0806, over 4272807.53 frames. ], batch size: 118, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:37:14,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1839276.0, ans=0.125 2023-06-24 21:37:21,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1839276.0, ans=0.1 2023-06-24 21:38:18,410 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.55 vs. limit=10.0 2023-06-24 21:38:37,634 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.71 vs. limit=10.0 2023-06-24 21:38:59,349 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.877e+02 8.023e+02 1.190e+03 1.791e+03 3.601e+03, threshold=2.379e+03, percent-clipped=18.0 2023-06-24 21:39:01,023 INFO [train.py:996] (3/4) Epoch 11, batch 1650, loss[loss=0.2577, simple_loss=0.3086, pruned_loss=0.1034, over 21427.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3077, pruned_loss=0.08021, over 4272084.59 frames. ], batch size: 473, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:40:18,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1839756.0, ans=0.025 2023-06-24 21:40:50,795 INFO [train.py:996] (3/4) Epoch 11, batch 1700, loss[loss=0.2509, simple_loss=0.3233, pruned_loss=0.08922, over 21579.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.31, pruned_loss=0.08007, over 4283630.94 frames. ], batch size: 230, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:41:47,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1839996.0, ans=0.07 2023-06-24 21:42:36,911 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.423e-03 2023-06-24 21:42:47,740 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.610e+02 1.287e+03 1.983e+03 3.488e+03, threshold=2.574e+03, percent-clipped=18.0 2023-06-24 21:42:48,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1840176.0, ans=0.125 2023-06-24 21:42:49,511 INFO [train.py:996] (3/4) Epoch 11, batch 1750, loss[loss=0.2846, simple_loss=0.3706, pruned_loss=0.09928, over 21682.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3103, pruned_loss=0.07984, over 4282709.03 frames. ], batch size: 389, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:43:05,769 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-06-24 21:44:09,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1840356.0, ans=0.0 2023-06-24 21:44:11,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1840356.0, ans=0.125 2023-06-24 21:44:28,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1840416.0, ans=0.1 2023-06-24 21:44:41,063 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:44:50,778 INFO [train.py:996] (3/4) Epoch 11, batch 1800, loss[loss=0.2177, simple_loss=0.3124, pruned_loss=0.06146, over 21688.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3093, pruned_loss=0.07785, over 4281113.10 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:44:56,848 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:45:13,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1840536.0, ans=0.125 2023-06-24 21:45:50,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1840596.0, ans=0.125 2023-06-24 21:45:51,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1840596.0, ans=0.125 2023-06-24 21:45:53,995 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-24 21:45:54,006 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-24 21:46:27,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1840716.0, ans=0.05 2023-06-24 21:46:40,510 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 6.892e+02 1.019e+03 1.831e+03 4.064e+03, threshold=2.037e+03, percent-clipped=9.0 2023-06-24 21:46:48,436 INFO [train.py:996] (3/4) Epoch 11, batch 1850, loss[loss=0.2319, simple_loss=0.3178, pruned_loss=0.07297, over 21879.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3072, pruned_loss=0.07523, over 4273858.72 frames. ], batch size: 316, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:47:02,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1840776.0, ans=0.125 2023-06-24 21:47:43,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1840956.0, ans=0.1 2023-06-24 21:47:50,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1840956.0, ans=0.0 2023-06-24 21:48:08,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1841016.0, ans=0.125 2023-06-24 21:48:22,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1841016.0, ans=0.125 2023-06-24 21:48:32,720 INFO [train.py:996] (3/4) Epoch 11, batch 1900, loss[loss=0.1883, simple_loss=0.2754, pruned_loss=0.05064, over 21428.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3081, pruned_loss=0.07584, over 4281345.99 frames. ], batch size: 194, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:48:43,753 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-24 21:48:56,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1841136.0, ans=0.0 2023-06-24 21:49:22,619 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.59 vs. limit=15.0 2023-06-24 21:49:27,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1841256.0, ans=0.125 2023-06-24 21:49:55,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1841256.0, ans=0.125 2023-06-24 21:50:20,910 INFO [train.py:996] (3/4) Epoch 11, batch 1950, loss[loss=0.2351, simple_loss=0.3025, pruned_loss=0.08388, over 21438.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3057, pruned_loss=0.0749, over 4279567.75 frames. ], batch size: 211, lr: 2.71e-03, grad_scale: 4.0 2023-06-24 21:50:22,717 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.695e+02 9.605e+02 1.769e+03 2.616e+03 5.034e+03, threshold=3.539e+03, percent-clipped=42.0 2023-06-24 21:51:00,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1841496.0, ans=0.125 2023-06-24 21:51:32,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1841556.0, ans=0.125 2023-06-24 21:51:38,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1841556.0, ans=0.125 2023-06-24 21:52:02,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1841616.0, ans=0.2 2023-06-24 21:52:05,551 INFO [train.py:996] (3/4) Epoch 11, batch 2000, loss[loss=0.1647, simple_loss=0.2342, pruned_loss=0.04762, over 21760.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3003, pruned_loss=0.07328, over 4279884.70 frames. ], batch size: 118, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:52:11,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1841676.0, ans=0.125 2023-06-24 21:52:12,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1841676.0, ans=0.125 2023-06-24 21:52:19,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1841676.0, ans=0.125 2023-06-24 21:52:21,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1841736.0, ans=0.125 2023-06-24 21:52:53,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1841796.0, ans=0.05 2023-06-24 21:53:55,969 INFO [train.py:996] (3/4) Epoch 11, batch 2050, loss[loss=0.2121, simple_loss=0.2813, pruned_loss=0.07145, over 21638.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3008, pruned_loss=0.07359, over 4281844.86 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:53:57,626 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.961e+02 9.295e+02 1.430e+03 2.343e+03 5.111e+03, threshold=2.860e+03, percent-clipped=7.0 2023-06-24 21:54:03,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1841976.0, ans=0.125 2023-06-24 21:54:08,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1841976.0, ans=0.09899494936611666 2023-06-24 21:55:47,678 INFO [train.py:996] (3/4) Epoch 11, batch 2100, loss[loss=0.292, simple_loss=0.3659, pruned_loss=0.109, over 21902.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3059, pruned_loss=0.0756, over 4283436.56 frames. ], batch size: 372, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:56:04,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1842336.0, ans=0.5 2023-06-24 21:56:19,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1842336.0, ans=0.125 2023-06-24 21:56:24,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1842336.0, ans=0.2 2023-06-24 21:57:38,441 INFO [train.py:996] (3/4) Epoch 11, batch 2150, loss[loss=0.2107, simple_loss=0.2765, pruned_loss=0.07251, over 21866.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3074, pruned_loss=0.07768, over 4286834.12 frames. ], batch size: 373, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:57:39,927 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.865e+02 8.663e+02 1.127e+03 1.659e+03 3.855e+03, threshold=2.253e+03, percent-clipped=2.0 2023-06-24 21:57:41,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1842576.0, ans=0.125 2023-06-24 21:57:44,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=22.5 2023-06-24 21:58:14,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1842636.0, ans=0.0 2023-06-24 21:58:48,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1842756.0, ans=10.0 2023-06-24 21:59:26,324 INFO [train.py:996] (3/4) Epoch 11, batch 2200, loss[loss=0.219, simple_loss=0.3439, pruned_loss=0.0471, over 19876.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3099, pruned_loss=0.07865, over 4287469.31 frames. ], batch size: 702, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:59:27,278 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.11 vs. limit=15.0 2023-06-24 21:59:54,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1842936.0, ans=0.0 2023-06-24 22:00:04,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1842996.0, ans=0.2 2023-06-24 22:00:05,052 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=15.0 2023-06-24 22:00:08,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1842996.0, ans=0.0 2023-06-24 22:00:39,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1843056.0, ans=0.2 2023-06-24 22:00:43,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1843056.0, ans=0.0 2023-06-24 22:01:16,567 INFO [train.py:996] (3/4) Epoch 11, batch 2250, loss[loss=0.2173, simple_loss=0.3137, pruned_loss=0.06049, over 21459.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3087, pruned_loss=0.07685, over 4282053.45 frames. ], batch size: 211, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 22:01:18,199 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.153e+02 9.027e+02 1.396e+03 1.956e+03 3.592e+03, threshold=2.793e+03, percent-clipped=17.0 2023-06-24 22:01:36,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1843176.0, ans=0.1 2023-06-24 22:03:03,207 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-24 22:03:05,556 INFO [train.py:996] (3/4) Epoch 11, batch 2300, loss[loss=0.1986, simple_loss=0.2606, pruned_loss=0.06825, over 21830.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3054, pruned_loss=0.07594, over 4280233.99 frames. ], batch size: 98, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 22:03:31,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1843536.0, ans=0.02 2023-06-24 22:03:37,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1843536.0, ans=0.0 2023-06-24 22:04:31,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1843656.0, ans=0.1 2023-06-24 22:04:47,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1843716.0, ans=0.125 2023-06-24 22:04:57,277 INFO [train.py:996] (3/4) Epoch 11, batch 2350, loss[loss=0.249, simple_loss=0.3045, pruned_loss=0.09669, over 21744.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2997, pruned_loss=0.07526, over 4281203.74 frames. ], batch size: 102, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 22:04:58,965 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.967e+02 8.353e+02 1.301e+03 1.765e+03 5.491e+03, threshold=2.603e+03, percent-clipped=6.0 2023-06-24 22:05:15,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1843776.0, ans=0.125 2023-06-24 22:05:36,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-06-24 22:05:43,281 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-06-24 22:06:47,706 INFO [train.py:996] (3/4) Epoch 11, batch 2400, loss[loss=0.2423, simple_loss=0.2955, pruned_loss=0.09455, over 21527.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3008, pruned_loss=0.07749, over 4283310.95 frames. ], batch size: 391, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:06:54,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1844076.0, ans=0.2 2023-06-24 22:07:22,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1844136.0, ans=0.0 2023-06-24 22:07:54,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1844196.0, ans=0.0 2023-06-24 22:08:07,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1844256.0, ans=0.125 2023-06-24 22:08:20,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1844316.0, ans=0.0 2023-06-24 22:08:29,862 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=15.0 2023-06-24 22:08:31,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1844316.0, ans=0.1 2023-06-24 22:08:44,042 INFO [train.py:996] (3/4) Epoch 11, batch 2450, loss[loss=0.2968, simple_loss=0.3652, pruned_loss=0.1141, over 21471.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.308, pruned_loss=0.08177, over 4288973.20 frames. ], batch size: 471, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:08:45,753 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.999e+02 9.036e+02 1.390e+03 1.907e+03 3.347e+03, threshold=2.779e+03, percent-clipped=7.0 2023-06-24 22:09:07,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1844436.0, ans=0.05 2023-06-24 22:09:08,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.76 vs. limit=15.0 2023-06-24 22:09:09,927 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.37 vs. limit=15.0 2023-06-24 22:10:24,537 INFO [train.py:996] (3/4) Epoch 11, batch 2500, loss[loss=0.2201, simple_loss=0.3199, pruned_loss=0.06019, over 21697.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3059, pruned_loss=0.08196, over 4261396.76 frames. ], batch size: 247, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:11:04,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1844736.0, ans=0.0 2023-06-24 22:11:09,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1844736.0, ans=0.1 2023-06-24 22:11:26,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-06-24 22:11:40,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1844856.0, ans=0.1 2023-06-24 22:11:40,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1844856.0, ans=0.125 2023-06-24 22:12:21,377 INFO [train.py:996] (3/4) Epoch 11, batch 2550, loss[loss=0.2088, simple_loss=0.287, pruned_loss=0.06536, over 21502.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3048, pruned_loss=0.07982, over 4259488.46 frames. ], batch size: 389, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:12:22,884 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.374e+02 8.761e+02 1.237e+03 1.691e+03 3.223e+03, threshold=2.475e+03, percent-clipped=6.0 2023-06-24 22:12:34,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1844976.0, ans=0.125 2023-06-24 22:13:30,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1845156.0, ans=0.125 2023-06-24 22:13:33,757 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=15.0 2023-06-24 22:13:54,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-24 22:14:11,172 INFO [train.py:996] (3/4) Epoch 11, batch 2600, loss[loss=0.2315, simple_loss=0.3292, pruned_loss=0.06684, over 21400.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3067, pruned_loss=0.08029, over 4261725.96 frames. ], batch size: 211, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:14:40,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1845336.0, ans=0.02 2023-06-24 22:14:43,188 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-24 22:15:15,032 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-24 22:15:21,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1845456.0, ans=0.125 2023-06-24 22:16:00,037 INFO [train.py:996] (3/4) Epoch 11, batch 2650, loss[loss=0.1883, simple_loss=0.2494, pruned_loss=0.06359, over 21173.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3097, pruned_loss=0.08191, over 4259085.98 frames. ], batch size: 548, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:16:00,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1845576.0, ans=0.1 2023-06-24 22:16:01,623 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.364e+02 1.066e+03 1.667e+03 2.223e+03 5.089e+03, threshold=3.334e+03, percent-clipped=18.0 2023-06-24 22:16:22,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1845636.0, ans=0.125 2023-06-24 22:16:44,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1845636.0, ans=0.125 2023-06-24 22:17:13,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1845756.0, ans=0.125 2023-06-24 22:17:39,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1845816.0, ans=0.1 2023-06-24 22:17:46,125 INFO [train.py:996] (3/4) Epoch 11, batch 2700, loss[loss=0.2247, simple_loss=0.3118, pruned_loss=0.06876, over 21787.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3088, pruned_loss=0.08094, over 4257623.91 frames. ], batch size: 282, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:17:48,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1845876.0, ans=0.1 2023-06-24 22:17:54,091 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.16 vs. limit=10.0 2023-06-24 22:19:05,166 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=22.5 2023-06-24 22:19:33,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1846116.0, ans=0.125 2023-06-24 22:19:36,535 INFO [train.py:996] (3/4) Epoch 11, batch 2750, loss[loss=0.224, simple_loss=0.2998, pruned_loss=0.07408, over 21503.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3076, pruned_loss=0.08086, over 4269009.29 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:19:38,337 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.314e+02 7.487e+02 1.146e+03 1.660e+03 3.901e+03, threshold=2.292e+03, percent-clipped=2.0 2023-06-24 22:20:12,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1846236.0, ans=0.125 2023-06-24 22:20:21,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1846296.0, ans=0.2 2023-06-24 22:20:24,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1846296.0, ans=0.09899494936611666 2023-06-24 22:21:17,726 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.62 vs. limit=10.0 2023-06-24 22:21:19,844 INFO [train.py:996] (3/4) Epoch 11, batch 2800, loss[loss=0.2596, simple_loss=0.35, pruned_loss=0.08457, over 21818.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3129, pruned_loss=0.08206, over 4274285.57 frames. ], batch size: 316, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:21:33,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1846476.0, ans=0.0 2023-06-24 22:21:53,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1846536.0, ans=0.0 2023-06-24 22:22:37,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1846656.0, ans=0.0 2023-06-24 22:23:10,949 INFO [train.py:996] (3/4) Epoch 11, batch 2850, loss[loss=0.168, simple_loss=0.222, pruned_loss=0.05698, over 21146.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3141, pruned_loss=0.08328, over 4274494.56 frames. ], batch size: 143, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:23:19,719 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.356e+02 9.385e+02 1.588e+03 2.448e+03 5.122e+03, threshold=3.175e+03, percent-clipped=28.0 2023-06-24 22:23:49,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1846836.0, ans=0.2 2023-06-24 22:24:31,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1846956.0, ans=0.1 2023-06-24 22:24:42,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1847016.0, ans=0.125 2023-06-24 22:24:42,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1847016.0, ans=0.125 2023-06-24 22:24:44,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1847016.0, ans=0.0 2023-06-24 22:24:55,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1847016.0, ans=0.125 2023-06-24 22:24:59,817 INFO [train.py:996] (3/4) Epoch 11, batch 2900, loss[loss=0.216, simple_loss=0.2844, pruned_loss=0.07374, over 21680.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3124, pruned_loss=0.08337, over 4272402.67 frames. ], batch size: 230, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:25:17,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1847076.0, ans=0.125 2023-06-24 22:26:48,156 INFO [train.py:996] (3/4) Epoch 11, batch 2950, loss[loss=0.2311, simple_loss=0.3181, pruned_loss=0.07208, over 21457.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3115, pruned_loss=0.0826, over 4278960.93 frames. ], batch size: 194, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:26:51,466 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.380e+02 7.849e+02 1.003e+03 1.596e+03 3.041e+03, threshold=2.006e+03, percent-clipped=1.0 2023-06-24 22:27:26,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1847436.0, ans=0.0 2023-06-24 22:27:53,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1847496.0, ans=0.1 2023-06-24 22:28:11,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1847556.0, ans=0.1 2023-06-24 22:28:15,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1847616.0, ans=0.025 2023-06-24 22:28:34,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1847616.0, ans=0.1 2023-06-24 22:28:39,686 INFO [train.py:996] (3/4) Epoch 11, batch 3000, loss[loss=0.2897, simple_loss=0.3609, pruned_loss=0.1093, over 21424.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3155, pruned_loss=0.08187, over 4273069.88 frames. ], batch size: 131, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:28:39,687 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 22:29:02,932 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.2533, simple_loss=0.3467, pruned_loss=0.07995, over 1796401.00 frames. 2023-06-24 22:29:02,933 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-24 22:29:03,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1847676.0, ans=0.035 2023-06-24 22:29:41,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1847796.0, ans=10.0 2023-06-24 22:30:03,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1847856.0, ans=0.0 2023-06-24 22:30:50,700 INFO [train.py:996] (3/4) Epoch 11, batch 3050, loss[loss=0.2011, simple_loss=0.3021, pruned_loss=0.05006, over 21723.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3165, pruned_loss=0.08117, over 4275806.74 frames. ], batch size: 298, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:30:56,004 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 9.249e+02 1.451e+03 2.091e+03 4.098e+03, threshold=2.902e+03, percent-clipped=32.0 2023-06-24 22:31:45,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-24 22:32:00,746 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.68 vs. limit=5.0 2023-06-24 22:32:39,979 INFO [train.py:996] (3/4) Epoch 11, batch 3100, loss[loss=0.2187, simple_loss=0.2967, pruned_loss=0.07038, over 21518.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3156, pruned_loss=0.08035, over 4279922.39 frames. ], batch size: 195, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:33:19,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1848396.0, ans=0.1 2023-06-24 22:33:21,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1848396.0, ans=0.2 2023-06-24 22:33:44,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1848396.0, ans=0.1 2023-06-24 22:33:45,507 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.06 vs. limit=10.0 2023-06-24 22:34:30,823 INFO [train.py:996] (3/4) Epoch 11, batch 3150, loss[loss=0.2887, simple_loss=0.3648, pruned_loss=0.1063, over 21630.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3158, pruned_loss=0.07973, over 4281395.34 frames. ], batch size: 414, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:34:38,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1848576.0, ans=0.125 2023-06-24 22:34:41,494 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.651e+02 8.004e+02 1.417e+03 1.894e+03 2.816e+03, threshold=2.834e+03, percent-clipped=0.0 2023-06-24 22:34:46,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1848576.0, ans=0.07 2023-06-24 22:35:34,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1848696.0, ans=10.0 2023-06-24 22:36:20,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1848816.0, ans=0.125 2023-06-24 22:36:27,210 INFO [train.py:996] (3/4) Epoch 11, batch 3200, loss[loss=0.2139, simple_loss=0.3007, pruned_loss=0.06353, over 21804.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3177, pruned_loss=0.08031, over 4280590.71 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:36:43,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1848936.0, ans=0.0 2023-06-24 22:36:54,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1848936.0, ans=0.125 2023-06-24 22:37:11,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1848996.0, ans=10.0 2023-06-24 22:37:40,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1849056.0, ans=0.125 2023-06-24 22:38:15,090 INFO [train.py:996] (3/4) Epoch 11, batch 3250, loss[loss=0.2393, simple_loss=0.3152, pruned_loss=0.08173, over 16117.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3203, pruned_loss=0.0828, over 4272674.91 frames. ], batch size: 60, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:38:20,055 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.217e+02 9.282e+02 1.304e+03 1.953e+03 5.530e+03, threshold=2.608e+03, percent-clipped=11.0 2023-06-24 22:38:25,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1849176.0, ans=0.1 2023-06-24 22:39:00,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1849296.0, ans=0.2 2023-06-24 22:39:41,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1849356.0, ans=0.125 2023-06-24 22:40:04,402 INFO [train.py:996] (3/4) Epoch 11, batch 3300, loss[loss=0.2616, simple_loss=0.35, pruned_loss=0.08664, over 21415.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3186, pruned_loss=0.08269, over 4266539.31 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:40:19,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1849476.0, ans=0.125 2023-06-24 22:41:25,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1849656.0, ans=0.0 2023-06-24 22:41:37,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1849716.0, ans=0.0 2023-06-24 22:41:41,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1849716.0, ans=0.1 2023-06-24 22:41:49,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1849716.0, ans=0.1 2023-06-24 22:41:54,785 INFO [train.py:996] (3/4) Epoch 11, batch 3350, loss[loss=0.2478, simple_loss=0.3134, pruned_loss=0.09111, over 21641.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3195, pruned_loss=0.08201, over 4271463.35 frames. ], batch size: 441, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:42:01,404 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.009e+02 8.020e+02 1.184e+03 1.979e+03 5.260e+03, threshold=2.368e+03, percent-clipped=15.0 2023-06-24 22:43:50,813 INFO [train.py:996] (3/4) Epoch 11, batch 3400, loss[loss=0.2084, simple_loss=0.2927, pruned_loss=0.06209, over 21534.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3201, pruned_loss=0.08339, over 4279975.35 frames. ], batch size: 389, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:43:58,426 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-24 22:44:28,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1850136.0, ans=0.125 2023-06-24 22:44:40,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1850196.0, ans=0.125 2023-06-24 22:45:40,227 INFO [train.py:996] (3/4) Epoch 11, batch 3450, loss[loss=0.2376, simple_loss=0.2984, pruned_loss=0.08837, over 21598.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3141, pruned_loss=0.08183, over 4275032.05 frames. ], batch size: 393, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:45:52,981 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.862e+02 7.621e+02 1.155e+03 1.643e+03 3.444e+03, threshold=2.310e+03, percent-clipped=7.0 2023-06-24 22:45:55,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1850376.0, ans=0.1 2023-06-24 22:46:20,833 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-24 22:47:35,902 INFO [train.py:996] (3/4) Epoch 11, batch 3500, loss[loss=0.2303, simple_loss=0.2915, pruned_loss=0.08451, over 21497.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3203, pruned_loss=0.08477, over 4269917.51 frames. ], batch size: 441, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:47:37,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1850676.0, ans=0.2 2023-06-24 22:48:10,056 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=22.5 2023-06-24 22:48:22,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1850796.0, ans=0.125 2023-06-24 22:48:53,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1850856.0, ans=0.0 2023-06-24 22:49:32,537 INFO [train.py:996] (3/4) Epoch 11, batch 3550, loss[loss=0.2654, simple_loss=0.3277, pruned_loss=0.1015, over 21327.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3227, pruned_loss=0.08639, over 4273567.53 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:49:34,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1850976.0, ans=0.0 2023-06-24 22:49:39,361 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.955e+02 9.872e+02 1.548e+03 2.414e+03 6.693e+03, threshold=3.097e+03, percent-clipped=26.0 2023-06-24 22:49:41,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1850976.0, ans=0.125 2023-06-24 22:49:58,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1851036.0, ans=0.125 2023-06-24 22:50:20,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1851096.0, ans=0.125 2023-06-24 22:50:20,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1851096.0, ans=0.125 2023-06-24 22:50:21,044 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:51:21,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1851276.0, ans=0.2 2023-06-24 22:51:22,529 INFO [train.py:996] (3/4) Epoch 11, batch 3600, loss[loss=0.2386, simple_loss=0.2999, pruned_loss=0.08865, over 21788.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3181, pruned_loss=0.08643, over 4272481.71 frames. ], batch size: 98, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:51:33,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1851276.0, ans=0.125 2023-06-24 22:52:31,415 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-06-24 22:53:11,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1851516.0, ans=0.125 2023-06-24 22:53:14,647 INFO [train.py:996] (3/4) Epoch 11, batch 3650, loss[loss=0.2227, simple_loss=0.2968, pruned_loss=0.07429, over 21609.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3177, pruned_loss=0.08593, over 4277904.24 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:53:21,481 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.826e+02 7.978e+02 1.076e+03 1.568e+03 3.181e+03, threshold=2.152e+03, percent-clipped=1.0 2023-06-24 22:53:47,487 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=12.0 2023-06-24 22:54:01,061 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=15.0 2023-06-24 22:54:07,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1851696.0, ans=0.125 2023-06-24 22:54:25,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1851756.0, ans=0.125 2023-06-24 22:54:38,222 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=15.0 2023-06-24 22:54:55,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1851816.0, ans=0.125 2023-06-24 22:55:01,468 INFO [train.py:996] (3/4) Epoch 11, batch 3700, loss[loss=0.2391, simple_loss=0.3182, pruned_loss=0.08, over 21359.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3151, pruned_loss=0.08422, over 4278346.32 frames. ], batch size: 159, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:55:36,369 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=12.0 2023-06-24 22:55:39,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1851996.0, ans=0.0 2023-06-24 22:56:55,505 INFO [train.py:996] (3/4) Epoch 11, batch 3750, loss[loss=0.1998, simple_loss=0.2659, pruned_loss=0.06683, over 21249.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3146, pruned_loss=0.08451, over 4282531.08 frames. ], batch size: 159, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:57:02,977 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.268e+02 7.373e+02 1.096e+03 1.771e+03 3.259e+03, threshold=2.192e+03, percent-clipped=16.0 2023-06-24 22:57:12,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1852236.0, ans=0.125 2023-06-24 22:57:40,058 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-24 22:58:44,866 INFO [train.py:996] (3/4) Epoch 11, batch 3800, loss[loss=0.2175, simple_loss=0.2899, pruned_loss=0.07253, over 21693.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3118, pruned_loss=0.08264, over 4283801.69 frames. ], batch size: 112, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:59:33,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1852596.0, ans=0.2 2023-06-24 22:59:40,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1852596.0, ans=0.0 2023-06-24 22:59:48,371 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-24 23:00:32,132 INFO [train.py:996] (3/4) Epoch 11, batch 3850, loss[loss=0.2082, simple_loss=0.2704, pruned_loss=0.07302, over 21657.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.309, pruned_loss=0.08267, over 4273856.51 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:00:39,277 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.734e+02 8.310e+02 1.331e+03 1.906e+03 3.711e+03, threshold=2.662e+03, percent-clipped=19.0 2023-06-24 23:00:41,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1852776.0, ans=0.125 2023-06-24 23:00:49,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1852836.0, ans=0.125 2023-06-24 23:01:19,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1852896.0, ans=0.125 2023-06-24 23:02:19,600 INFO [train.py:996] (3/4) Epoch 11, batch 3900, loss[loss=0.2217, simple_loss=0.2865, pruned_loss=0.07848, over 21842.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3051, pruned_loss=0.08285, over 4271560.27 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:02:39,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1853136.0, ans=0.2 2023-06-24 23:03:32,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1853256.0, ans=0.0 2023-06-24 23:03:35,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1853256.0, ans=10.0 2023-06-24 23:04:02,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1853316.0, ans=0.0 2023-06-24 23:04:11,744 INFO [train.py:996] (3/4) Epoch 11, batch 3950, loss[loss=0.2037, simple_loss=0.2892, pruned_loss=0.05907, over 21406.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3059, pruned_loss=0.08145, over 4277391.21 frames. ], batch size: 211, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:04:18,254 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.637e+02 6.470e+02 9.111e+02 1.353e+03 4.725e+03, threshold=1.822e+03, percent-clipped=4.0 2023-06-24 23:05:11,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1853496.0, ans=0.125 2023-06-24 23:05:50,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1853616.0, ans=10.0 2023-06-24 23:05:58,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1853616.0, ans=0.125 2023-06-24 23:06:01,582 INFO [train.py:996] (3/4) Epoch 11, batch 4000, loss[loss=0.2382, simple_loss=0.3048, pruned_loss=0.08582, over 21948.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3057, pruned_loss=0.07975, over 4274781.60 frames. ], batch size: 103, lr: 2.70e-03, grad_scale: 32.0 2023-06-24 23:06:10,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1853676.0, ans=0.1 2023-06-24 23:06:47,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1853796.0, ans=0.0 2023-06-24 23:07:49,285 INFO [train.py:996] (3/4) Epoch 11, batch 4050, loss[loss=0.2669, simple_loss=0.3442, pruned_loss=0.09479, over 21409.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3042, pruned_loss=0.07774, over 4267020.47 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:07:57,273 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.864e+02 8.238e+02 1.474e+03 2.566e+03 6.233e+03, threshold=2.948e+03, percent-clipped=38.0 2023-06-24 23:09:11,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.23 vs. limit=10.0 2023-06-24 23:09:13,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1854156.0, ans=0.125 2023-06-24 23:09:13,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1854156.0, ans=0.0 2023-06-24 23:09:37,366 INFO [train.py:996] (3/4) Epoch 11, batch 4100, loss[loss=0.2504, simple_loss=0.3126, pruned_loss=0.09408, over 21905.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3045, pruned_loss=0.07847, over 4270371.71 frames. ], batch size: 316, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:09:41,737 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.89 vs. limit=10.0 2023-06-24 23:10:02,035 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-06-24 23:10:56,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1854456.0, ans=0.125 2023-06-24 23:11:16,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1854516.0, ans=0.0 2023-06-24 23:11:27,833 INFO [train.py:996] (3/4) Epoch 11, batch 4150, loss[loss=0.1899, simple_loss=0.2824, pruned_loss=0.04869, over 21677.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3042, pruned_loss=0.07533, over 4273544.64 frames. ], batch size: 247, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:11:44,220 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 6.391e+02 9.658e+02 1.367e+03 3.515e+03, threshold=1.932e+03, percent-clipped=2.0 2023-06-24 23:11:52,288 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-06-24 23:12:35,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1854696.0, ans=0.125 2023-06-24 23:12:49,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1854756.0, ans=0.0 2023-06-24 23:13:08,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1854816.0, ans=0.2 2023-06-24 23:13:27,335 INFO [train.py:996] (3/4) Epoch 11, batch 4200, loss[loss=0.1974, simple_loss=0.2932, pruned_loss=0.0508, over 19914.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3044, pruned_loss=0.07423, over 4275143.22 frames. ], batch size: 703, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:14:10,217 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-24 23:14:12,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1854936.0, ans=0.2 2023-06-24 23:14:32,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1854996.0, ans=15.0 2023-06-24 23:15:24,073 INFO [train.py:996] (3/4) Epoch 11, batch 4250, loss[loss=0.2673, simple_loss=0.3421, pruned_loss=0.09622, over 21781.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3087, pruned_loss=0.07572, over 4276479.28 frames. ], batch size: 441, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:15:32,159 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.310e+02 8.743e+02 1.334e+03 2.102e+03 4.812e+03, threshold=2.669e+03, percent-clipped=26.0 2023-06-24 23:15:53,580 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-24 23:16:06,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1855236.0, ans=0.1 2023-06-24 23:17:15,269 INFO [train.py:996] (3/4) Epoch 11, batch 4300, loss[loss=0.2417, simple_loss=0.3092, pruned_loss=0.0871, over 21535.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.315, pruned_loss=0.07822, over 4276470.63 frames. ], batch size: 548, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:17:15,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1855476.0, ans=0.0 2023-06-24 23:17:41,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1855536.0, ans=0.125 2023-06-24 23:18:07,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1855596.0, ans=0.1 2023-06-24 23:18:32,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1855656.0, ans=0.125 2023-06-24 23:19:09,584 INFO [train.py:996] (3/4) Epoch 11, batch 4350, loss[loss=0.2198, simple_loss=0.2911, pruned_loss=0.07423, over 21768.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.315, pruned_loss=0.07782, over 4276363.27 frames. ], batch size: 371, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 23:19:25,419 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.637e+02 8.181e+02 1.022e+03 1.627e+03 5.028e+03, threshold=2.045e+03, percent-clipped=6.0 2023-06-24 23:19:51,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1855896.0, ans=0.125 2023-06-24 23:20:54,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1856016.0, ans=0.0 2023-06-24 23:21:06,785 INFO [train.py:996] (3/4) Epoch 11, batch 4400, loss[loss=0.2221, simple_loss=0.3025, pruned_loss=0.07081, over 21356.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3111, pruned_loss=0.07692, over 4272238.98 frames. ], batch size: 160, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:21:08,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1856076.0, ans=0.0 2023-06-24 23:21:12,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1856076.0, ans=0.1 2023-06-24 23:22:57,254 INFO [train.py:996] (3/4) Epoch 11, batch 4450, loss[loss=0.2524, simple_loss=0.3503, pruned_loss=0.07721, over 21832.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3184, pruned_loss=0.07846, over 4277023.58 frames. ], batch size: 316, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:23:07,954 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.276e+02 9.671e+02 1.476e+03 2.549e+03 6.148e+03, threshold=2.952e+03, percent-clipped=35.0 2023-06-24 23:24:14,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1856556.0, ans=0.125 2023-06-24 23:24:47,344 INFO [train.py:996] (3/4) Epoch 11, batch 4500, loss[loss=0.2451, simple_loss=0.3321, pruned_loss=0.07907, over 21609.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3203, pruned_loss=0.0802, over 4285018.20 frames. ], batch size: 230, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:25:36,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1856796.0, ans=0.0 2023-06-24 23:25:38,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1856796.0, ans=0.125 2023-06-24 23:26:15,103 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=22.5 2023-06-24 23:26:33,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1856976.0, ans=0.2 2023-06-24 23:26:34,490 INFO [train.py:996] (3/4) Epoch 11, batch 4550, loss[loss=0.2692, simple_loss=0.3863, pruned_loss=0.076, over 21234.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3227, pruned_loss=0.08051, over 4287534.75 frames. ], batch size: 549, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:26:44,478 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.220e+02 1.037e+03 1.526e+03 2.248e+03 5.276e+03, threshold=3.053e+03, percent-clipped=11.0 2023-06-24 23:26:48,516 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:27:57,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1857156.0, ans=0.125 2023-06-24 23:28:17,289 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:28:23,405 INFO [train.py:996] (3/4) Epoch 11, batch 4600, loss[loss=0.2726, simple_loss=0.3681, pruned_loss=0.08857, over 17094.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3244, pruned_loss=0.08226, over 4286711.58 frames. ], batch size: 61, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:28:26,111 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-06-24 23:28:56,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1857336.0, ans=0.1 2023-06-24 23:29:44,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1857456.0, ans=0.125 2023-06-24 23:30:12,147 INFO [train.py:996] (3/4) Epoch 11, batch 4650, loss[loss=0.206, simple_loss=0.2758, pruned_loss=0.06807, over 21744.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3188, pruned_loss=0.08041, over 4285763.09 frames. ], batch size: 298, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:30:29,042 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.938e+02 7.698e+02 1.029e+03 1.673e+03 3.855e+03, threshold=2.058e+03, percent-clipped=3.0 2023-06-24 23:30:29,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1857576.0, ans=0.125 2023-06-24 23:30:31,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1857576.0, ans=0.125 2023-06-24 23:31:34,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1857756.0, ans=0.125 2023-06-24 23:31:53,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1857816.0, ans=0.07 2023-06-24 23:32:07,099 INFO [train.py:996] (3/4) Epoch 11, batch 4700, loss[loss=0.1853, simple_loss=0.2518, pruned_loss=0.05939, over 21591.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3115, pruned_loss=0.0783, over 4277030.29 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:33:12,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1857996.0, ans=0.0 2023-06-24 23:33:14,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1858056.0, ans=0.02 2023-06-24 23:33:32,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1858116.0, ans=0.125 2023-06-24 23:33:38,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1858116.0, ans=10.0 2023-06-24 23:33:48,518 INFO [train.py:996] (3/4) Epoch 11, batch 4750, loss[loss=0.2423, simple_loss=0.3015, pruned_loss=0.09159, over 21579.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3069, pruned_loss=0.0786, over 4283329.59 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:34:05,957 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.601e+02 8.320e+02 1.239e+03 2.079e+03 4.364e+03, threshold=2.479e+03, percent-clipped=25.0 2023-06-24 23:34:24,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1858236.0, ans=0.125 2023-06-24 23:34:54,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1858296.0, ans=10.0 2023-06-24 23:35:20,420 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=22.5 2023-06-24 23:35:42,499 INFO [train.py:996] (3/4) Epoch 11, batch 4800, loss[loss=0.2289, simple_loss=0.331, pruned_loss=0.06344, over 21786.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3067, pruned_loss=0.07952, over 4283844.33 frames. ], batch size: 332, lr: 2.70e-03, grad_scale: 32.0 2023-06-24 23:35:50,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1858476.0, ans=0.125 2023-06-24 23:36:50,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1858656.0, ans=0.0 2023-06-24 23:36:53,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1858656.0, ans=0.125 2023-06-24 23:36:53,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1858656.0, ans=0.1 2023-06-24 23:37:23,178 INFO [train.py:996] (3/4) Epoch 11, batch 4850, loss[loss=0.2095, simple_loss=0.2829, pruned_loss=0.06799, over 21640.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3068, pruned_loss=0.07911, over 4282447.78 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:37:40,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1858776.0, ans=0.1 2023-06-24 23:37:41,869 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.322e+02 1.130e+03 1.666e+03 2.337e+03 4.462e+03, threshold=3.333e+03, percent-clipped=23.0 2023-06-24 23:39:15,277 INFO [train.py:996] (3/4) Epoch 11, batch 4900, loss[loss=0.2273, simple_loss=0.299, pruned_loss=0.07777, over 21872.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3093, pruned_loss=0.07985, over 4288588.92 frames. ], batch size: 124, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:39:42,657 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-24 23:39:45,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1859136.0, ans=0.1 2023-06-24 23:39:55,983 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-24 23:40:09,714 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=12.0 2023-06-24 23:40:17,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1859196.0, ans=0.0 2023-06-24 23:41:05,220 INFO [train.py:996] (3/4) Epoch 11, batch 4950, loss[loss=0.2333, simple_loss=0.3312, pruned_loss=0.06769, over 21420.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.312, pruned_loss=0.07808, over 4275913.43 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:41:17,427 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-24 23:41:23,339 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.980e+02 1.108e+03 1.676e+03 3.345e+03, threshold=2.216e+03, percent-clipped=1.0 2023-06-24 23:42:14,273 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.68 vs. limit=22.5 2023-06-24 23:42:53,674 INFO [train.py:996] (3/4) Epoch 11, batch 5000, loss[loss=0.2221, simple_loss=0.3058, pruned_loss=0.06922, over 21854.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3108, pruned_loss=0.07464, over 4281901.31 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:43:17,489 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.14 vs. limit=15.0 2023-06-24 23:43:29,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1859736.0, ans=0.1 2023-06-24 23:43:31,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1859736.0, ans=0.07 2023-06-24 23:43:47,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1859796.0, ans=0.125 2023-06-24 23:44:22,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1859916.0, ans=0.125 2023-06-24 23:44:39,751 INFO [train.py:996] (3/4) Epoch 11, batch 5050, loss[loss=0.2686, simple_loss=0.3231, pruned_loss=0.107, over 21797.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3116, pruned_loss=0.07605, over 4288667.21 frames. ], batch size: 508, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:44:57,980 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 7.469e+02 1.066e+03 1.616e+03 3.471e+03, threshold=2.133e+03, percent-clipped=8.0 2023-06-24 23:46:26,227 INFO [train.py:996] (3/4) Epoch 11, batch 5100, loss[loss=0.1973, simple_loss=0.2801, pruned_loss=0.05726, over 21693.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3096, pruned_loss=0.07694, over 4294806.70 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:46:28,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1860276.0, ans=0.0 2023-06-24 23:47:07,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1860336.0, ans=0.125 2023-06-24 23:47:33,703 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-24 23:47:49,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1860516.0, ans=0.1 2023-06-24 23:47:55,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1860516.0, ans=0.125 2023-06-24 23:48:13,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=22.5 2023-06-24 23:48:21,808 INFO [train.py:996] (3/4) Epoch 11, batch 5150, loss[loss=0.2534, simple_loss=0.3153, pruned_loss=0.09576, over 21352.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3074, pruned_loss=0.07841, over 4296095.77 frames. ], batch size: 144, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:48:34,355 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.000e+02 7.764e+02 1.031e+03 1.609e+03 3.475e+03, threshold=2.061e+03, percent-clipped=12.0 2023-06-24 23:48:36,493 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:48:56,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1860636.0, ans=0.125 2023-06-24 23:49:06,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1860696.0, ans=0.2 2023-06-24 23:49:17,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1860696.0, ans=0.0 2023-06-24 23:49:23,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1860756.0, ans=0.125 2023-06-24 23:50:05,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1860816.0, ans=0.2 2023-06-24 23:50:08,741 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:50:11,762 INFO [train.py:996] (3/4) Epoch 11, batch 5200, loss[loss=0.1964, simple_loss=0.2755, pruned_loss=0.0587, over 19981.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3088, pruned_loss=0.07817, over 4288179.89 frames. ], batch size: 703, lr: 2.69e-03, grad_scale: 32.0 2023-06-24 23:50:29,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1860876.0, ans=0.125 2023-06-24 23:51:04,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1860996.0, ans=0.0 2023-06-24 23:51:08,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1860996.0, ans=0.0 2023-06-24 23:51:51,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1861116.0, ans=0.125 2023-06-24 23:51:58,027 INFO [train.py:996] (3/4) Epoch 11, batch 5250, loss[loss=0.1924, simple_loss=0.2964, pruned_loss=0.04419, over 19879.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3109, pruned_loss=0.07665, over 4277988.80 frames. ], batch size: 702, lr: 2.69e-03, grad_scale: 16.0 2023-06-24 23:52:18,260 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.878e+02 9.518e+02 1.553e+03 2.129e+03 4.596e+03, threshold=3.106e+03, percent-clipped=26.0 2023-06-24 23:53:11,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1861356.0, ans=0.0 2023-06-24 23:53:38,134 INFO [train.py:996] (3/4) Epoch 11, batch 5300, loss[loss=0.2146, simple_loss=0.2705, pruned_loss=0.07938, over 20206.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3098, pruned_loss=0.07726, over 4280294.60 frames. ], batch size: 702, lr: 2.69e-03, grad_scale: 16.0 2023-06-24 23:54:34,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1861596.0, ans=0.0 2023-06-24 23:54:34,661 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-24 23:55:22,250 INFO [train.py:996] (3/4) Epoch 11, batch 5350, loss[loss=0.2098, simple_loss=0.3094, pruned_loss=0.05513, over 19908.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3087, pruned_loss=0.07859, over 4284123.43 frames. ], batch size: 702, lr: 2.69e-03, grad_scale: 16.0 2023-06-24 23:55:35,080 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.646e+02 7.558e+02 1.125e+03 1.569e+03 2.899e+03, threshold=2.250e+03, percent-clipped=0.0 2023-06-24 23:56:29,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1861956.0, ans=0.09899494936611666 2023-06-24 23:56:49,154 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=15.0 2023-06-24 23:57:01,680 INFO [train.py:996] (3/4) Epoch 11, batch 5400, loss[loss=0.2266, simple_loss=0.2925, pruned_loss=0.08034, over 21388.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3084, pruned_loss=0.07929, over 4282703.87 frames. ], batch size: 144, lr: 2.69e-03, grad_scale: 8.0 2023-06-24 23:57:59,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1862196.0, ans=0.125 2023-06-24 23:58:31,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1862316.0, ans=0.1 2023-06-24 23:58:48,743 INFO [train.py:996] (3/4) Epoch 11, batch 5450, loss[loss=0.2636, simple_loss=0.3705, pruned_loss=0.0783, over 21410.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3103, pruned_loss=0.07697, over 4276633.97 frames. ], batch size: 211, lr: 2.69e-03, grad_scale: 8.0 2023-06-24 23:58:49,786 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.60 vs. limit=10.0 2023-06-24 23:59:10,601 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.863e+02 8.635e+02 1.460e+03 2.379e+03 5.903e+03, threshold=2.920e+03, percent-clipped=27.0 2023-06-24 23:59:26,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1862436.0, ans=0.125 2023-06-24 23:59:26,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1862436.0, ans=0.125 2023-06-25 00:00:42,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1862616.0, ans=0.125 2023-06-25 00:00:45,472 INFO [train.py:996] (3/4) Epoch 11, batch 5500, loss[loss=0.2765, simple_loss=0.3667, pruned_loss=0.09316, over 21690.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3147, pruned_loss=0.07446, over 4275674.82 frames. ], batch size: 389, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:01:04,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1862736.0, ans=0.125 2023-06-25 00:02:33,321 INFO [train.py:996] (3/4) Epoch 11, batch 5550, loss[loss=0.1894, simple_loss=0.2779, pruned_loss=0.05044, over 21585.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3136, pruned_loss=0.07195, over 4272010.14 frames. ], batch size: 263, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:02:39,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1862976.0, ans=0.0 2023-06-25 00:02:48,984 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.684e+02 8.321e+02 1.311e+03 1.956e+03 3.720e+03, threshold=2.623e+03, percent-clipped=7.0 2023-06-25 00:04:04,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1863216.0, ans=0.0 2023-06-25 00:04:10,516 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=15.0 2023-06-25 00:04:21,198 INFO [train.py:996] (3/4) Epoch 11, batch 5600, loss[loss=0.2633, simple_loss=0.3866, pruned_loss=0.06997, over 19696.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3124, pruned_loss=0.06909, over 4276097.08 frames. ], batch size: 703, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:05:00,705 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.81 vs. limit=15.0 2023-06-25 00:05:40,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1863456.0, ans=0.0 2023-06-25 00:06:06,170 INFO [train.py:996] (3/4) Epoch 11, batch 5650, loss[loss=0.2431, simple_loss=0.3078, pruned_loss=0.08915, over 21509.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3173, pruned_loss=0.07258, over 4280866.26 frames. ], batch size: 131, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:06:32,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.004e+02 8.541e+02 1.292e+03 2.009e+03 3.827e+03, threshold=2.583e+03, percent-clipped=13.0 2023-06-25 00:07:39,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1863816.0, ans=0.125 2023-06-25 00:07:57,645 INFO [train.py:996] (3/4) Epoch 11, batch 5700, loss[loss=0.2031, simple_loss=0.2776, pruned_loss=0.06428, over 21362.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3154, pruned_loss=0.07389, over 4289256.42 frames. ], batch size: 131, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:07:59,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1863876.0, ans=0.125 2023-06-25 00:09:00,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.85 vs. limit=15.0 2023-06-25 00:09:08,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1864056.0, ans=0.125 2023-06-25 00:09:53,480 INFO [train.py:996] (3/4) Epoch 11, batch 5750, loss[loss=0.1839, simple_loss=0.2647, pruned_loss=0.05158, over 21338.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3126, pruned_loss=0.07151, over 4287383.60 frames. ], batch size: 176, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:09:57,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1864176.0, ans=0.125 2023-06-25 00:10:08,438 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.527e+02 8.365e+02 1.283e+03 1.865e+03 4.523e+03, threshold=2.566e+03, percent-clipped=10.0 2023-06-25 00:10:41,023 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-25 00:10:42,452 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.24 vs. limit=10.0 2023-06-25 00:10:43,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1864296.0, ans=0.1 2023-06-25 00:11:14,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1864356.0, ans=0.125 2023-06-25 00:11:39,480 INFO [train.py:996] (3/4) Epoch 11, batch 5800, loss[loss=0.2082, simple_loss=0.3043, pruned_loss=0.05611, over 21589.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3134, pruned_loss=0.07035, over 4287430.47 frames. ], batch size: 230, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:11:57,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1864476.0, ans=0.125 2023-06-25 00:12:00,219 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-06-25 00:12:13,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1864536.0, ans=0.95 2023-06-25 00:13:05,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1864656.0, ans=0.125 2023-06-25 00:13:09,259 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.15 vs. limit=15.0 2023-06-25 00:13:19,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1864716.0, ans=0.1 2023-06-25 00:13:32,882 INFO [train.py:996] (3/4) Epoch 11, batch 5850, loss[loss=0.2062, simple_loss=0.2868, pruned_loss=0.06275, over 21205.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3097, pruned_loss=0.0667, over 4286911.32 frames. ], batch size: 608, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:13:53,412 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.480e+02 6.927e+02 1.116e+03 1.995e+03 4.965e+03, threshold=2.231e+03, percent-clipped=19.0 2023-06-25 00:14:13,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1864896.0, ans=0.125 2023-06-25 00:14:14,216 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.35 vs. limit=22.5 2023-06-25 00:14:18,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1864896.0, ans=0.0 2023-06-25 00:14:42,849 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-25 00:14:55,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1865016.0, ans=0.07 2023-06-25 00:14:58,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1865016.0, ans=0.125 2023-06-25 00:15:17,032 INFO [train.py:996] (3/4) Epoch 11, batch 5900, loss[loss=0.1725, simple_loss=0.2585, pruned_loss=0.0432, over 21620.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.302, pruned_loss=0.06182, over 4285486.36 frames. ], batch size: 230, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:16:28,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1865256.0, ans=0.125 2023-06-25 00:16:37,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1865256.0, ans=0.125 2023-06-25 00:17:06,585 INFO [train.py:996] (3/4) Epoch 11, batch 5950, loss[loss=0.2188, simple_loss=0.2801, pruned_loss=0.07877, over 21637.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.3002, pruned_loss=0.06525, over 4292670.41 frames. ], batch size: 247, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:17:07,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1865376.0, ans=0.125 2023-06-25 00:17:20,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.24 vs. limit=15.0 2023-06-25 00:17:21,628 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.131e+02 6.600e+02 8.461e+02 1.275e+03 2.602e+03, threshold=1.692e+03, percent-clipped=3.0 2023-06-25 00:18:39,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1865616.0, ans=0.04949747468305833 2023-06-25 00:18:51,531 INFO [train.py:996] (3/4) Epoch 11, batch 6000, loss[loss=0.1826, simple_loss=0.2317, pruned_loss=0.06676, over 20039.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2955, pruned_loss=0.06831, over 4285353.72 frames. ], batch size: 703, lr: 2.69e-03, grad_scale: 32.0 2023-06-25 00:18:51,531 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 00:19:08,580 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.2642, simple_loss=0.3568, pruned_loss=0.08578, over 1796401.00 frames. 2023-06-25 00:19:08,580 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-25 00:19:11,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1865676.0, ans=0.2 2023-06-25 00:19:15,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1865676.0, ans=0.05 2023-06-25 00:19:15,562 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:20:53,371 INFO [train.py:996] (3/4) Epoch 11, batch 6050, loss[loss=0.1652, simple_loss=0.2579, pruned_loss=0.03622, over 21698.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2915, pruned_loss=0.06971, over 4276651.28 frames. ], batch size: 332, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:21:18,393 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.947e+02 8.062e+02 1.043e+03 1.359e+03 2.248e+03, threshold=2.086e+03, percent-clipped=5.0 2023-06-25 00:21:18,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1866036.0, ans=0.125 2023-06-25 00:21:38,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1866096.0, ans=0.0 2023-06-25 00:21:41,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2023-06-25 00:22:20,564 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-25 00:22:39,208 INFO [train.py:996] (3/4) Epoch 11, batch 6100, loss[loss=0.2176, simple_loss=0.2949, pruned_loss=0.07013, over 21733.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2902, pruned_loss=0.06782, over 4277475.33 frames. ], batch size: 389, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:22:57,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1866276.0, ans=0.125 2023-06-25 00:23:19,559 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=22.5 2023-06-25 00:23:30,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1866396.0, ans=0.05 2023-06-25 00:24:08,380 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=22.5 2023-06-25 00:24:11,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1866516.0, ans=0.1 2023-06-25 00:24:22,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1866516.0, ans=0.2 2023-06-25 00:24:27,318 INFO [train.py:996] (3/4) Epoch 11, batch 6150, loss[loss=0.2173, simple_loss=0.2874, pruned_loss=0.07362, over 21982.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2936, pruned_loss=0.07056, over 4286456.71 frames. ], batch size: 119, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:24:40,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1866576.0, ans=0.0 2023-06-25 00:24:47,568 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-25 00:24:58,587 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.664e+02 7.679e+02 1.290e+03 1.928e+03 3.741e+03, threshold=2.581e+03, percent-clipped=18.0 2023-06-25 00:25:02,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1866636.0, ans=0.0 2023-06-25 00:25:18,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1866696.0, ans=0.0 2023-06-25 00:25:26,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1866696.0, ans=0.125 2023-06-25 00:25:48,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1866756.0, ans=0.125 2023-06-25 00:26:13,472 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.55 vs. limit=12.0 2023-06-25 00:26:19,936 INFO [train.py:996] (3/4) Epoch 11, batch 6200, loss[loss=0.2708, simple_loss=0.3537, pruned_loss=0.09396, over 21824.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2984, pruned_loss=0.07112, over 4280847.66 frames. ], batch size: 415, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:26:41,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1866936.0, ans=0.125 2023-06-25 00:26:44,576 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:28:06,347 INFO [train.py:996] (3/4) Epoch 11, batch 6250, loss[loss=0.2441, simple_loss=0.3571, pruned_loss=0.06557, over 21227.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3048, pruned_loss=0.07157, over 4279179.35 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:28:22,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1867176.0, ans=0.125 2023-06-25 00:28:31,517 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.849e+02 8.847e+02 1.490e+03 2.226e+03 5.467e+03, threshold=2.981e+03, percent-clipped=18.0 2023-06-25 00:28:50,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1867296.0, ans=0.125 2023-06-25 00:28:53,759 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:29:03,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1867296.0, ans=0.0 2023-06-25 00:29:30,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1867416.0, ans=0.125 2023-06-25 00:29:33,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1867416.0, ans=0.125 2023-06-25 00:29:49,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1867416.0, ans=0.1 2023-06-25 00:29:51,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1867476.0, ans=0.125 2023-06-25 00:29:52,768 INFO [train.py:996] (3/4) Epoch 11, batch 6300, loss[loss=0.2314, simple_loss=0.2986, pruned_loss=0.0821, over 21212.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.308, pruned_loss=0.07058, over 4277559.27 frames. ], batch size: 143, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:30:26,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1867536.0, ans=0.125 2023-06-25 00:30:26,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1867536.0, ans=0.125 2023-06-25 00:30:47,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1867596.0, ans=0.0 2023-06-25 00:30:50,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1867596.0, ans=0.125 2023-06-25 00:31:06,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1867656.0, ans=0.125 2023-06-25 00:31:44,199 INFO [train.py:996] (3/4) Epoch 11, batch 6350, loss[loss=0.2773, simple_loss=0.3506, pruned_loss=0.102, over 21837.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3103, pruned_loss=0.07506, over 4279102.56 frames. ], batch size: 118, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:31:48,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1867776.0, ans=0.5 2023-06-25 00:32:08,037 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.095e+02 6.705e+02 8.360e+02 1.250e+03 2.332e+03, threshold=1.672e+03, percent-clipped=0.0 2023-06-25 00:32:09,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1867836.0, ans=0.0 2023-06-25 00:32:11,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1867836.0, ans=0.0 2023-06-25 00:32:18,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=1867836.0, ans=8.0 2023-06-25 00:32:31,512 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-25 00:33:05,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1867956.0, ans=0.125 2023-06-25 00:33:09,215 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-25 00:33:37,562 INFO [train.py:996] (3/4) Epoch 11, batch 6400, loss[loss=0.2829, simple_loss=0.3542, pruned_loss=0.1058, over 21425.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3154, pruned_loss=0.07903, over 4278892.65 frames. ], batch size: 471, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:33:47,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1868076.0, ans=0.125 2023-06-25 00:33:50,428 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=15.82 vs. limit=15.0 2023-06-25 00:34:15,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1868136.0, ans=0.125 2023-06-25 00:34:44,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1868256.0, ans=0.2 2023-06-25 00:35:00,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1868256.0, ans=0.125 2023-06-25 00:35:26,294 INFO [train.py:996] (3/4) Epoch 11, batch 6450, loss[loss=0.1792, simple_loss=0.2624, pruned_loss=0.04796, over 21222.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3175, pruned_loss=0.07799, over 4277343.15 frames. ], batch size: 176, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:35:51,633 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.519e+02 9.176e+02 1.134e+03 1.706e+03 4.418e+03, threshold=2.268e+03, percent-clipped=27.0 2023-06-25 00:35:58,010 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-06-25 00:37:13,873 INFO [train.py:996] (3/4) Epoch 11, batch 6500, loss[loss=0.2077, simple_loss=0.2691, pruned_loss=0.07318, over 21550.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3113, pruned_loss=0.07711, over 4276470.17 frames. ], batch size: 213, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:37:26,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1868676.0, ans=0.2 2023-06-25 00:37:59,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1868796.0, ans=0.125 2023-06-25 00:38:30,481 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=12.0 2023-06-25 00:38:59,852 INFO [train.py:996] (3/4) Epoch 11, batch 6550, loss[loss=0.2218, simple_loss=0.2919, pruned_loss=0.07588, over 21428.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3115, pruned_loss=0.07628, over 4280005.70 frames. ], batch size: 211, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:39:24,185 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.737e+02 9.229e+02 1.425e+03 2.181e+03 3.625e+03, threshold=2.850e+03, percent-clipped=21.0 2023-06-25 00:39:41,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1869036.0, ans=0.1 2023-06-25 00:40:25,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1869156.0, ans=0.2 2023-06-25 00:40:37,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1869216.0, ans=0.125 2023-06-25 00:40:47,109 INFO [train.py:996] (3/4) Epoch 11, batch 6600, loss[loss=0.1813, simple_loss=0.2445, pruned_loss=0.0591, over 21209.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3059, pruned_loss=0.07548, over 4278977.56 frames. ], batch size: 176, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:42:21,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1869516.0, ans=0.125 2023-06-25 00:42:33,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1869516.0, ans=0.125 2023-06-25 00:42:36,117 INFO [train.py:996] (3/4) Epoch 11, batch 6650, loss[loss=0.1877, simple_loss=0.2476, pruned_loss=0.06389, over 21834.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2971, pruned_loss=0.07274, over 4274422.74 frames. ], batch size: 98, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:43:02,932 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-25 00:43:06,911 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.219e+02 5.753e+02 7.174e+02 1.040e+03 2.181e+03, threshold=1.435e+03, percent-clipped=0.0 2023-06-25 00:43:31,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1869696.0, ans=0.125 2023-06-25 00:43:35,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1869696.0, ans=0.0 2023-06-25 00:43:35,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1869696.0, ans=0.0 2023-06-25 00:43:44,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1869756.0, ans=0.0 2023-06-25 00:44:15,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1869816.0, ans=0.0 2023-06-25 00:44:32,436 INFO [train.py:996] (3/4) Epoch 11, batch 6700, loss[loss=0.2487, simple_loss=0.3148, pruned_loss=0.09129, over 21474.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2925, pruned_loss=0.07338, over 4275391.69 frames. ], batch size: 509, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:44:52,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1869876.0, ans=0.125 2023-06-25 00:45:31,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1869996.0, ans=0.0 2023-06-25 00:46:08,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1870116.0, ans=0.0 2023-06-25 00:46:14,642 INFO [train.py:996] (3/4) Epoch 11, batch 6750, loss[loss=0.206, simple_loss=0.2724, pruned_loss=0.06983, over 21797.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2909, pruned_loss=0.07375, over 4266631.08 frames. ], batch size: 118, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:46:27,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1870176.0, ans=0.125 2023-06-25 00:46:30,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1870176.0, ans=0.07 2023-06-25 00:46:43,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1870236.0, ans=0.035 2023-06-25 00:46:46,791 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.712e+02 8.208e+02 1.148e+03 1.600e+03 3.333e+03, threshold=2.296e+03, percent-clipped=33.0 2023-06-25 00:46:48,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1870236.0, ans=0.0 2023-06-25 00:47:41,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1870416.0, ans=0.1 2023-06-25 00:47:59,240 INFO [train.py:996] (3/4) Epoch 11, batch 6800, loss[loss=0.2279, simple_loss=0.2919, pruned_loss=0.08192, over 21791.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2925, pruned_loss=0.07556, over 4261055.63 frames. ], batch size: 351, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:48:27,830 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-25 00:48:56,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1870596.0, ans=0.0 2023-06-25 00:49:14,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1870656.0, ans=22.5 2023-06-25 00:49:37,433 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-25 00:49:44,480 INFO [train.py:996] (3/4) Epoch 11, batch 6850, loss[loss=0.2093, simple_loss=0.2677, pruned_loss=0.07547, over 21557.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2915, pruned_loss=0.07647, over 4267222.82 frames. ], batch size: 230, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:50:16,079 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 8.303e+02 1.235e+03 2.153e+03 3.729e+03, threshold=2.471e+03, percent-clipped=22.0 2023-06-25 00:50:16,609 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:50:36,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1870896.0, ans=0.0 2023-06-25 00:50:55,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1870956.0, ans=0.2 2023-06-25 00:50:59,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1870956.0, ans=0.125 2023-06-25 00:51:06,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1870956.0, ans=0.1 2023-06-25 00:51:31,180 INFO [train.py:996] (3/4) Epoch 11, batch 6900, loss[loss=0.2663, simple_loss=0.3384, pruned_loss=0.09716, over 21649.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2943, pruned_loss=0.07725, over 4267312.62 frames. ], batch size: 471, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:51:35,663 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-25 00:52:32,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1871196.0, ans=0.0 2023-06-25 00:52:37,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1871196.0, ans=0.125 2023-06-25 00:52:38,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1871196.0, ans=0.125 2023-06-25 00:53:27,048 INFO [train.py:996] (3/4) Epoch 11, batch 6950, loss[loss=0.303, simple_loss=0.3618, pruned_loss=0.1221, over 21305.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2974, pruned_loss=0.07492, over 4271661.07 frames. ], batch size: 507, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:53:45,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1871376.0, ans=0.0 2023-06-25 00:53:53,753 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.630e+02 7.235e+02 1.015e+03 1.522e+03 6.325e+03, threshold=2.030e+03, percent-clipped=9.0 2023-06-25 00:54:09,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1871436.0, ans=0.125 2023-06-25 00:54:17,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1871496.0, ans=0.0 2023-06-25 00:54:22,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1871496.0, ans=0.125 2023-06-25 00:54:34,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1871556.0, ans=0.0 2023-06-25 00:55:15,832 INFO [train.py:996] (3/4) Epoch 11, batch 7000, loss[loss=0.2197, simple_loss=0.2915, pruned_loss=0.07399, over 21353.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2998, pruned_loss=0.07639, over 4274196.72 frames. ], batch size: 131, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:55:16,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1871676.0, ans=0.125 2023-06-25 00:55:35,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1871676.0, ans=0.125 2023-06-25 00:55:40,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1871736.0, ans=0.125 2023-06-25 00:56:05,522 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-25 00:56:49,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1871916.0, ans=0.1 2023-06-25 00:57:10,143 INFO [train.py:996] (3/4) Epoch 11, batch 7050, loss[loss=0.2189, simple_loss=0.3194, pruned_loss=0.05923, over 21270.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2967, pruned_loss=0.07488, over 4268717.91 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:57:37,675 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.311e+02 8.822e+02 1.310e+03 1.745e+03 4.662e+03, threshold=2.619e+03, percent-clipped=19.0 2023-06-25 00:58:11,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1872096.0, ans=0.1 2023-06-25 00:58:20,965 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.75 vs. limit=8.0 2023-06-25 00:58:35,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1872156.0, ans=0.0 2023-06-25 00:59:02,532 INFO [train.py:996] (3/4) Epoch 11, batch 7100, loss[loss=0.1736, simple_loss=0.2484, pruned_loss=0.04937, over 21475.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3016, pruned_loss=0.07664, over 4272180.48 frames. ], batch size: 211, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:59:16,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1872276.0, ans=0.0 2023-06-25 01:00:07,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1872456.0, ans=0.125 2023-06-25 01:00:12,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1872456.0, ans=0.1 2023-06-25 01:00:27,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1872456.0, ans=0.125 2023-06-25 01:00:51,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1872576.0, ans=0.125 2023-06-25 01:00:53,312 INFO [train.py:996] (3/4) Epoch 11, batch 7150, loss[loss=0.2502, simple_loss=0.3238, pruned_loss=0.08833, over 21978.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2988, pruned_loss=0.07461, over 4268634.98 frames. ], batch size: 317, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:01:10,384 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-25 01:01:25,397 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.968e+02 7.662e+02 1.147e+03 1.671e+03 2.803e+03, threshold=2.294e+03, percent-clipped=2.0 2023-06-25 01:02:03,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1872756.0, ans=0.2 2023-06-25 01:02:21,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1872756.0, ans=0.0 2023-06-25 01:02:38,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1872816.0, ans=0.2 2023-06-25 01:02:51,307 INFO [train.py:996] (3/4) Epoch 11, batch 7200, loss[loss=0.2195, simple_loss=0.2811, pruned_loss=0.07898, over 21225.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.301, pruned_loss=0.07629, over 4266777.02 frames. ], batch size: 176, lr: 2.69e-03, grad_scale: 32.0 2023-06-25 01:03:01,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1872876.0, ans=0.035 2023-06-25 01:03:30,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1872996.0, ans=0.0 2023-06-25 01:03:31,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1872996.0, ans=0.125 2023-06-25 01:03:49,225 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=12.0 2023-06-25 01:03:53,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1873056.0, ans=0.125 2023-06-25 01:04:25,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1873116.0, ans=0.1 2023-06-25 01:04:40,393 INFO [train.py:996] (3/4) Epoch 11, batch 7250, loss[loss=0.1973, simple_loss=0.2609, pruned_loss=0.06681, over 21525.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2969, pruned_loss=0.07592, over 4256800.14 frames. ], batch size: 230, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:05:06,922 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.525e+02 1.021e+03 1.447e+03 2.035e+03 4.041e+03, threshold=2.893e+03, percent-clipped=18.0 2023-06-25 01:05:18,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1873296.0, ans=0.125 2023-06-25 01:05:18,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1873296.0, ans=0.0 2023-06-25 01:05:35,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1873296.0, ans=0.125 2023-06-25 01:05:52,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1873356.0, ans=0.1 2023-06-25 01:06:00,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1873356.0, ans=0.0 2023-06-25 01:06:27,155 INFO [train.py:996] (3/4) Epoch 11, batch 7300, loss[loss=0.2112, simple_loss=0.2743, pruned_loss=0.074, over 21850.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2913, pruned_loss=0.07498, over 4267206.92 frames. ], batch size: 98, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:07:18,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1873596.0, ans=0.0 2023-06-25 01:07:35,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1873656.0, ans=0.0 2023-06-25 01:08:01,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1873716.0, ans=0.1 2023-06-25 01:08:16,328 INFO [train.py:996] (3/4) Epoch 11, batch 7350, loss[loss=0.2756, simple_loss=0.3458, pruned_loss=0.1028, over 21495.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2911, pruned_loss=0.07609, over 4267302.93 frames. ], batch size: 131, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:08:16,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1873776.0, ans=0.0 2023-06-25 01:08:35,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1873776.0, ans=0.1 2023-06-25 01:08:43,117 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.949e+02 8.143e+02 1.181e+03 1.694e+03 4.027e+03, threshold=2.361e+03, percent-clipped=4.0 2023-06-25 01:08:43,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1873836.0, ans=0.2 2023-06-25 01:09:01,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1873896.0, ans=0.125 2023-06-25 01:09:18,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1873896.0, ans=0.0 2023-06-25 01:10:11,694 INFO [train.py:996] (3/4) Epoch 11, batch 7400, loss[loss=0.2466, simple_loss=0.3413, pruned_loss=0.07595, over 21838.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2964, pruned_loss=0.07793, over 4267781.77 frames. ], batch size: 372, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:11:10,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1874196.0, ans=0.125 2023-06-25 01:11:14,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1874196.0, ans=0.125 2023-06-25 01:11:28,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1874256.0, ans=0.125 2023-06-25 01:11:58,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1874316.0, ans=0.0 2023-06-25 01:12:03,184 INFO [train.py:996] (3/4) Epoch 11, batch 7450, loss[loss=0.1683, simple_loss=0.2174, pruned_loss=0.0596, over 20076.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2951, pruned_loss=0.07756, over 4272918.17 frames. ], batch size: 702, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:12:09,013 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2023-06-25 01:12:33,128 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.846e+02 7.768e+02 1.010e+03 1.629e+03 4.953e+03, threshold=2.020e+03, percent-clipped=6.0 2023-06-25 01:12:58,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1874496.0, ans=0.1 2023-06-25 01:13:54,434 INFO [train.py:996] (3/4) Epoch 11, batch 7500, loss[loss=0.1933, simple_loss=0.254, pruned_loss=0.06635, over 20836.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3004, pruned_loss=0.07915, over 4268372.28 frames. ], batch size: 608, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:14:27,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1874736.0, ans=0.2 2023-06-25 01:14:44,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1874796.0, ans=0.125 2023-06-25 01:15:23,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1874856.0, ans=0.2 2023-06-25 01:15:23,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1874856.0, ans=0.0 2023-06-25 01:15:28,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1874916.0, ans=0.125 2023-06-25 01:15:43,444 INFO [train.py:996] (3/4) Epoch 11, batch 7550, loss[loss=0.2194, simple_loss=0.2875, pruned_loss=0.0756, over 21218.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3091, pruned_loss=0.07866, over 4266837.27 frames. ], batch size: 608, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:16:15,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1875036.0, ans=0.125 2023-06-25 01:16:17,096 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.793e+02 9.851e+02 1.650e+03 2.404e+03 5.031e+03, threshold=3.301e+03, percent-clipped=35.0 2023-06-25 01:16:49,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1875156.0, ans=0.5 2023-06-25 01:16:52,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1875156.0, ans=0.125 2023-06-25 01:17:29,901 INFO [train.py:996] (3/4) Epoch 11, batch 7600, loss[loss=0.2525, simple_loss=0.3144, pruned_loss=0.09527, over 21391.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3088, pruned_loss=0.07768, over 4276048.38 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 32.0 2023-06-25 01:17:38,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1875276.0, ans=0.125 2023-06-25 01:18:03,631 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.77 vs. limit=6.0 2023-06-25 01:18:18,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1875396.0, ans=0.0 2023-06-25 01:18:41,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1875456.0, ans=0.1 2023-06-25 01:19:03,514 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-06-25 01:19:14,308 INFO [train.py:996] (3/4) Epoch 11, batch 7650, loss[loss=0.2387, simple_loss=0.3077, pruned_loss=0.0849, over 21485.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3084, pruned_loss=0.07932, over 4287542.64 frames. ], batch size: 131, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:19:20,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1875576.0, ans=0.125 2023-06-25 01:19:44,575 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.184e+02 7.567e+02 1.161e+03 1.543e+03 3.222e+03, threshold=2.322e+03, percent-clipped=0.0 2023-06-25 01:19:58,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1875696.0, ans=0.0 2023-06-25 01:20:02,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-25 01:20:15,070 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:20:56,014 INFO [train.py:996] (3/4) Epoch 11, batch 7700, loss[loss=0.2162, simple_loss=0.2687, pruned_loss=0.08189, over 20779.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3119, pruned_loss=0.08185, over 4288816.56 frames. ], batch size: 608, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:22:21,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1876116.0, ans=0.0 2023-06-25 01:22:24,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1876116.0, ans=0.1 2023-06-25 01:22:45,819 INFO [train.py:996] (3/4) Epoch 11, batch 7750, loss[loss=0.2885, simple_loss=0.4049, pruned_loss=0.08606, over 21198.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3194, pruned_loss=0.08225, over 4285759.85 frames. ], batch size: 548, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:22:57,304 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.36 vs. limit=15.0 2023-06-25 01:23:10,558 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 7.928e+02 1.247e+03 1.821e+03 3.792e+03, threshold=2.494e+03, percent-clipped=12.0 2023-06-25 01:23:43,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1876296.0, ans=0.1 2023-06-25 01:24:31,881 INFO [train.py:996] (3/4) Epoch 11, batch 7800, loss[loss=0.2354, simple_loss=0.3174, pruned_loss=0.0767, over 21748.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3208, pruned_loss=0.08254, over 4282168.18 frames. ], batch size: 391, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:24:39,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1876476.0, ans=0.0 2023-06-25 01:25:26,935 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.02 vs. limit=15.0 2023-06-25 01:25:27,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1876656.0, ans=0.5 2023-06-25 01:25:56,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1876716.0, ans=0.125 2023-06-25 01:26:15,679 INFO [train.py:996] (3/4) Epoch 11, batch 7850, loss[loss=0.1745, simple_loss=0.2304, pruned_loss=0.05929, over 20760.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3123, pruned_loss=0.08136, over 4273053.78 frames. ], batch size: 609, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:26:19,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1876776.0, ans=0.125 2023-06-25 01:26:23,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1876776.0, ans=0.125 2023-06-25 01:26:29,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1876776.0, ans=0.125 2023-06-25 01:26:46,361 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.852e+02 8.105e+02 1.212e+03 1.898e+03 4.667e+03, threshold=2.425e+03, percent-clipped=9.0 2023-06-25 01:26:56,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1876836.0, ans=0.125 2023-06-25 01:27:08,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1876896.0, ans=0.05 2023-06-25 01:27:10,910 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-06-25 01:28:06,253 INFO [train.py:996] (3/4) Epoch 11, batch 7900, loss[loss=0.1857, simple_loss=0.2484, pruned_loss=0.06153, over 21431.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3055, pruned_loss=0.07989, over 4261289.40 frames. ], batch size: 212, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:28:16,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1877076.0, ans=0.04949747468305833 2023-06-25 01:28:18,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1877076.0, ans=0.125 2023-06-25 01:28:46,700 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:28:53,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1877196.0, ans=0.125 2023-06-25 01:29:00,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1877196.0, ans=0.125 2023-06-25 01:29:03,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1877196.0, ans=0.125 2023-06-25 01:29:03,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1877196.0, ans=0.1 2023-06-25 01:29:57,766 INFO [train.py:996] (3/4) Epoch 11, batch 7950, loss[loss=0.2449, simple_loss=0.3279, pruned_loss=0.08093, over 21881.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3106, pruned_loss=0.07972, over 4256343.77 frames. ], batch size: 316, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:30:32,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1877436.0, ans=0.0 2023-06-25 01:30:35,718 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.494e+02 9.486e+02 1.599e+03 2.410e+03 5.026e+03, threshold=3.197e+03, percent-clipped=23.0 2023-06-25 01:30:45,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1877496.0, ans=10.0 2023-06-25 01:31:28,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1877556.0, ans=0.125 2023-06-25 01:31:47,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1877616.0, ans=0.2 2023-06-25 01:31:47,884 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:32:03,969 INFO [train.py:996] (3/4) Epoch 11, batch 8000, loss[loss=0.3025, simple_loss=0.375, pruned_loss=0.115, over 21445.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3146, pruned_loss=0.08153, over 4259130.44 frames. ], batch size: 471, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:32:05,046 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-25 01:32:16,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1877676.0, ans=0.125 2023-06-25 01:33:26,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1877856.0, ans=0.0 2023-06-25 01:33:26,681 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:33:37,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1877856.0, ans=0.2 2023-06-25 01:33:47,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1877916.0, ans=0.95 2023-06-25 01:33:56,979 INFO [train.py:996] (3/4) Epoch 11, batch 8050, loss[loss=0.3234, simple_loss=0.4055, pruned_loss=0.1206, over 21561.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3186, pruned_loss=0.08224, over 4261285.28 frames. ], batch size: 471, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:34:02,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1877976.0, ans=0.015 2023-06-25 01:34:34,626 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.082e+02 8.572e+02 1.267e+03 1.861e+03 4.173e+03, threshold=2.534e+03, percent-clipped=4.0 2023-06-25 01:35:45,695 INFO [train.py:996] (3/4) Epoch 11, batch 8100, loss[loss=0.2913, simple_loss=0.3428, pruned_loss=0.12, over 21699.00 frames. ], tot_loss[loss=0.241, simple_loss=0.317, pruned_loss=0.08253, over 4268434.14 frames. ], batch size: 507, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:35:57,376 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:36:22,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1878336.0, ans=0.125 2023-06-25 01:37:11,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1878456.0, ans=0.125 2023-06-25 01:37:15,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1878456.0, ans=0.07 2023-06-25 01:37:48,339 INFO [train.py:996] (3/4) Epoch 11, batch 8150, loss[loss=0.2643, simple_loss=0.3811, pruned_loss=0.07378, over 21165.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3231, pruned_loss=0.08346, over 4260462.77 frames. ], batch size: 548, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:38:17,984 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.023e+02 7.751e+02 1.218e+03 2.122e+03 5.445e+03, threshold=2.437e+03, percent-clipped=16.0 2023-06-25 01:38:47,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1878696.0, ans=0.2 2023-06-25 01:38:49,249 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-25 01:39:05,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-06-25 01:39:06,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1878756.0, ans=0.1 2023-06-25 01:39:24,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1878816.0, ans=0.2 2023-06-25 01:39:39,798 INFO [train.py:996] (3/4) Epoch 11, batch 8200, loss[loss=0.2065, simple_loss=0.2598, pruned_loss=0.07659, over 21117.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3164, pruned_loss=0.08129, over 4264247.95 frames. ], batch size: 143, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:39:41,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1878876.0, ans=0.0 2023-06-25 01:40:05,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1878936.0, ans=0.125 2023-06-25 01:40:55,696 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:41:00,241 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=15.0 2023-06-25 01:41:06,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1879116.0, ans=0.125 2023-06-25 01:41:28,943 INFO [train.py:996] (3/4) Epoch 11, batch 8250, loss[loss=0.2769, simple_loss=0.3699, pruned_loss=0.09189, over 21624.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3157, pruned_loss=0.08166, over 4265907.72 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:41:32,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1879176.0, ans=0.1 2023-06-25 01:41:42,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1879176.0, ans=0.125 2023-06-25 01:41:53,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1879236.0, ans=0.0 2023-06-25 01:41:58,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1879236.0, ans=0.125 2023-06-25 01:42:00,011 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 7.306e+02 1.035e+03 1.633e+03 3.565e+03, threshold=2.069e+03, percent-clipped=11.0 2023-06-25 01:42:01,055 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=22.5 2023-06-25 01:42:19,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1879296.0, ans=0.125 2023-06-25 01:42:24,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1879296.0, ans=0.125 2023-06-25 01:42:50,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1879356.0, ans=0.0 2023-06-25 01:43:04,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1879416.0, ans=0.1 2023-06-25 01:43:17,157 INFO [train.py:996] (3/4) Epoch 11, batch 8300, loss[loss=0.2173, simple_loss=0.3057, pruned_loss=0.06448, over 21631.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3132, pruned_loss=0.07892, over 4270405.65 frames. ], batch size: 247, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:43:54,269 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=22.5 2023-06-25 01:44:09,584 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.70 vs. limit=10.0 2023-06-25 01:44:55,611 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-25 01:45:04,799 INFO [train.py:996] (3/4) Epoch 11, batch 8350, loss[loss=0.2384, simple_loss=0.3009, pruned_loss=0.08793, over 21746.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3127, pruned_loss=0.07778, over 4265664.97 frames. ], batch size: 102, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:45:44,603 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.048e+02 7.785e+02 1.165e+03 1.706e+03 3.630e+03, threshold=2.331e+03, percent-clipped=15.0 2023-06-25 01:45:53,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1879896.0, ans=0.125 2023-06-25 01:46:21,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1879956.0, ans=0.125 2023-06-25 01:46:47,783 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-25 01:46:52,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1880076.0, ans=0.125 2023-06-25 01:46:53,163 INFO [train.py:996] (3/4) Epoch 11, batch 8400, loss[loss=0.2397, simple_loss=0.3215, pruned_loss=0.079, over 20784.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3091, pruned_loss=0.0744, over 4268728.25 frames. ], batch size: 608, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:47:53,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1880196.0, ans=0.0 2023-06-25 01:48:20,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1880256.0, ans=0.0 2023-06-25 01:48:36,525 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=22.5 2023-06-25 01:48:41,832 INFO [train.py:996] (3/4) Epoch 11, batch 8450, loss[loss=0.2085, simple_loss=0.291, pruned_loss=0.06303, over 21503.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.307, pruned_loss=0.07303, over 4275207.92 frames. ], batch size: 212, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:48:58,590 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.74 vs. limit=12.0 2023-06-25 01:49:20,510 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.263e+02 6.433e+02 1.170e+03 1.916e+03 4.574e+03, threshold=2.341e+03, percent-clipped=17.0 2023-06-25 01:49:22,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1880436.0, ans=0.0 2023-06-25 01:49:52,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1880556.0, ans=0.0 2023-06-25 01:50:13,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1880616.0, ans=0.1 2023-06-25 01:50:30,024 INFO [train.py:996] (3/4) Epoch 11, batch 8500, loss[loss=0.2529, simple_loss=0.3127, pruned_loss=0.09658, over 21264.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3024, pruned_loss=0.0741, over 4275705.94 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:51:08,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1880736.0, ans=0.0 2023-06-25 01:51:55,642 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.03 vs. limit=15.0 2023-06-25 01:51:57,181 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.62 vs. limit=10.0 2023-06-25 01:52:18,588 INFO [train.py:996] (3/4) Epoch 11, batch 8550, loss[loss=0.2525, simple_loss=0.3557, pruned_loss=0.07467, over 20686.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3062, pruned_loss=0.0766, over 4279811.62 frames. ], batch size: 607, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:52:55,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1881036.0, ans=0.125 2023-06-25 01:52:56,693 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.844e+02 6.860e+02 9.469e+02 1.395e+03 3.551e+03, threshold=1.894e+03, percent-clipped=10.0 2023-06-25 01:53:14,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1881096.0, ans=0.0 2023-06-25 01:53:32,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1881156.0, ans=0.125 2023-06-25 01:54:20,724 INFO [train.py:996] (3/4) Epoch 11, batch 8600, loss[loss=0.3103, simple_loss=0.3765, pruned_loss=0.1221, over 21536.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3125, pruned_loss=0.07869, over 4280587.76 frames. ], batch size: 414, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:54:23,265 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.27 vs. limit=22.5 2023-06-25 01:54:31,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1881276.0, ans=0.125 2023-06-25 01:54:48,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1881336.0, ans=0.1 2023-06-25 01:54:53,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1881336.0, ans=0.1 2023-06-25 01:54:57,486 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.68 vs. limit=22.5 2023-06-25 01:55:16,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1881396.0, ans=0.1 2023-06-25 01:55:33,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1881456.0, ans=6.0 2023-06-25 01:55:35,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1881456.0, ans=0.125 2023-06-25 01:55:56,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1881516.0, ans=0.2 2023-06-25 01:56:09,645 INFO [train.py:996] (3/4) Epoch 11, batch 8650, loss[loss=0.1349, simple_loss=0.199, pruned_loss=0.03538, over 16541.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3176, pruned_loss=0.07906, over 4276974.41 frames. ], batch size: 60, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:56:23,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1881576.0, ans=0.1 2023-06-25 01:56:42,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1881636.0, ans=0.125 2023-06-25 01:56:43,775 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.081e+02 8.530e+02 1.308e+03 2.199e+03 5.345e+03, threshold=2.615e+03, percent-clipped=30.0 2023-06-25 01:57:45,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1881816.0, ans=0.1 2023-06-25 01:57:47,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1881816.0, ans=0.125 2023-06-25 01:57:52,020 INFO [train.py:996] (3/4) Epoch 11, batch 8700, loss[loss=0.2024, simple_loss=0.2765, pruned_loss=0.06416, over 21443.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3131, pruned_loss=0.07743, over 4274444.82 frames. ], batch size: 389, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:58:01,863 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-06-25 01:58:37,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1881996.0, ans=0.0 2023-06-25 01:59:10,962 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-06-25 01:59:36,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=22.5 2023-06-25 01:59:38,921 INFO [train.py:996] (3/4) Epoch 11, batch 8750, loss[loss=0.2229, simple_loss=0.2949, pruned_loss=0.07545, over 21249.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3107, pruned_loss=0.07841, over 4274223.95 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:59:49,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-25 01:59:57,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1882176.0, ans=0.125 2023-06-25 02:00:16,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1882236.0, ans=0.0 2023-06-25 02:00:25,251 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.601e+02 8.552e+02 1.572e+03 2.395e+03 4.841e+03, threshold=3.145e+03, percent-clipped=19.0 2023-06-25 02:01:31,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1882476.0, ans=0.0 2023-06-25 02:01:32,850 INFO [train.py:996] (3/4) Epoch 11, batch 8800, loss[loss=0.3404, simple_loss=0.4078, pruned_loss=0.1365, over 21442.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.319, pruned_loss=0.08067, over 4278561.76 frames. ], batch size: 471, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:01:42,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1882476.0, ans=0.05 2023-06-25 02:01:46,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1882476.0, ans=0.015 2023-06-25 02:02:02,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1882536.0, ans=0.125 2023-06-25 02:02:14,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1882536.0, ans=0.1 2023-06-25 02:02:31,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1882596.0, ans=0.1 2023-06-25 02:02:51,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1882656.0, ans=0.0 2023-06-25 02:02:53,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1882656.0, ans=0.2 2023-06-25 02:02:56,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1882716.0, ans=0.2 2023-06-25 02:03:25,503 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2023-06-25 02:03:27,962 INFO [train.py:996] (3/4) Epoch 11, batch 8850, loss[loss=0.2224, simple_loss=0.3102, pruned_loss=0.06727, over 21158.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3243, pruned_loss=0.08237, over 4278605.10 frames. ], batch size: 143, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:03:43,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1882836.0, ans=0.125 2023-06-25 02:03:53,230 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=22.5 2023-06-25 02:04:04,654 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.975e+02 8.533e+02 1.157e+03 2.147e+03 4.267e+03, threshold=2.313e+03, percent-clipped=8.0 2023-06-25 02:04:11,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1882896.0, ans=0.125 2023-06-25 02:04:31,358 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=22.5 2023-06-25 02:04:40,198 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.25 vs. limit=15.0 2023-06-25 02:04:43,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1882956.0, ans=0.125 2023-06-25 02:04:48,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1883016.0, ans=6.0 2023-06-25 02:05:17,223 INFO [train.py:996] (3/4) Epoch 11, batch 8900, loss[loss=0.234, simple_loss=0.3194, pruned_loss=0.07433, over 21860.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.318, pruned_loss=0.08077, over 4267851.88 frames. ], batch size: 372, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:06:49,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1883316.0, ans=0.125 2023-06-25 02:07:13,523 INFO [train.py:996] (3/4) Epoch 11, batch 8950, loss[loss=0.229, simple_loss=0.319, pruned_loss=0.0695, over 21717.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.316, pruned_loss=0.07986, over 4272235.66 frames. ], batch size: 351, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:07:27,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1883376.0, ans=0.125 2023-06-25 02:07:36,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1883436.0, ans=0.125 2023-06-25 02:07:36,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1883436.0, ans=0.0 2023-06-25 02:07:47,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1883436.0, ans=0.1 2023-06-25 02:07:48,283 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.752e+02 7.956e+02 1.198e+03 2.154e+03 4.592e+03, threshold=2.397e+03, percent-clipped=22.0 2023-06-25 02:08:35,067 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-25 02:08:38,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1883616.0, ans=0.0 2023-06-25 02:08:55,069 INFO [train.py:996] (3/4) Epoch 11, batch 9000, loss[loss=0.2001, simple_loss=0.2652, pruned_loss=0.06747, over 21704.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3111, pruned_loss=0.0795, over 4273059.78 frames. ], batch size: 333, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:08:55,070 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 02:09:12,568 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.2589, simple_loss=0.3526, pruned_loss=0.08262, over 1796401.00 frames. 2023-06-25 02:09:12,568 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-25 02:09:29,377 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=22.5 2023-06-25 02:09:33,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1883736.0, ans=0.0 2023-06-25 02:10:42,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1883916.0, ans=0.125 2023-06-25 02:10:55,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1883916.0, ans=0.0 2023-06-25 02:11:00,336 INFO [train.py:996] (3/4) Epoch 11, batch 9050, loss[loss=0.2786, simple_loss=0.348, pruned_loss=0.1046, over 21760.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3082, pruned_loss=0.07612, over 4268826.24 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:11:44,397 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.917e+02 7.004e+02 1.025e+03 1.804e+03 4.936e+03, threshold=2.049e+03, percent-clipped=10.0 2023-06-25 02:12:17,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1884156.0, ans=0.2 2023-06-25 02:12:25,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1884156.0, ans=0.125 2023-06-25 02:12:50,832 INFO [train.py:996] (3/4) Epoch 11, batch 9100, loss[loss=0.1984, simple_loss=0.3017, pruned_loss=0.04757, over 21899.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.315, pruned_loss=0.07915, over 4273109.23 frames. ], batch size: 317, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:14:38,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1884576.0, ans=0.125 2023-06-25 02:14:40,242 INFO [train.py:996] (3/4) Epoch 11, batch 9150, loss[loss=0.3272, simple_loss=0.4104, pruned_loss=0.122, over 21518.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.316, pruned_loss=0.0765, over 4270897.53 frames. ], batch size: 471, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:14:57,359 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=15.0 2023-06-25 02:15:21,591 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.226e+02 1.034e+03 1.434e+03 2.123e+03 3.847e+03, threshold=2.868e+03, percent-clipped=26.0 2023-06-25 02:15:44,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1884696.0, ans=0.125 2023-06-25 02:16:21,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1884816.0, ans=0.125 2023-06-25 02:16:33,367 INFO [train.py:996] (3/4) Epoch 11, batch 9200, loss[loss=0.2583, simple_loss=0.3504, pruned_loss=0.08309, over 21214.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3168, pruned_loss=0.07544, over 4275373.14 frames. ], batch size: 548, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:18:05,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1885116.0, ans=10.0 2023-06-25 02:18:15,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1885116.0, ans=0.125 2023-06-25 02:18:20,283 INFO [train.py:996] (3/4) Epoch 11, batch 9250, loss[loss=0.2804, simple_loss=0.3185, pruned_loss=0.1212, over 21423.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3206, pruned_loss=0.07825, over 4276098.80 frames. ], batch size: 510, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:18:56,155 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.130e+02 8.312e+02 1.043e+03 1.613e+03 4.110e+03, threshold=2.085e+03, percent-clipped=7.0 2023-06-25 02:19:24,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1885356.0, ans=0.125 2023-06-25 02:20:14,357 INFO [train.py:996] (3/4) Epoch 11, batch 9300, loss[loss=0.2534, simple_loss=0.3577, pruned_loss=0.07453, over 21240.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3138, pruned_loss=0.07784, over 4275449.52 frames. ], batch size: 549, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:20:30,433 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.97 vs. limit=8.0 2023-06-25 02:20:57,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1885596.0, ans=0.0 2023-06-25 02:22:02,525 INFO [train.py:996] (3/4) Epoch 11, batch 9350, loss[loss=0.2404, simple_loss=0.3242, pruned_loss=0.07827, over 21874.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3194, pruned_loss=0.07914, over 4270260.68 frames. ], batch size: 316, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:22:24,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1885776.0, ans=0.125 2023-06-25 02:22:26,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1885836.0, ans=0.1 2023-06-25 02:22:31,601 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-25 02:22:41,001 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.377e+02 8.449e+02 1.377e+03 2.044e+03 3.190e+03, threshold=2.753e+03, percent-clipped=23.0 2023-06-25 02:23:38,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1886016.0, ans=0.2 2023-06-25 02:23:52,734 INFO [train.py:996] (3/4) Epoch 11, batch 9400, loss[loss=0.2525, simple_loss=0.3126, pruned_loss=0.0962, over 21620.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.323, pruned_loss=0.07974, over 4272432.08 frames. ], batch size: 298, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:23:53,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1886076.0, ans=0.125 2023-06-25 02:24:16,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1886136.0, ans=0.0 2023-06-25 02:24:46,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1886196.0, ans=0.125 2023-06-25 02:24:48,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1886196.0, ans=0.0 2023-06-25 02:24:57,729 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.68 vs. limit=15.0 2023-06-25 02:25:04,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1886256.0, ans=0.125 2023-06-25 02:25:44,603 INFO [train.py:996] (3/4) Epoch 11, batch 9450, loss[loss=0.2204, simple_loss=0.2797, pruned_loss=0.08054, over 21609.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3157, pruned_loss=0.0787, over 4256908.59 frames. ], batch size: 247, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:25:56,028 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-06-25 02:25:57,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1886376.0, ans=0.125 2023-06-25 02:26:03,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1886436.0, ans=0.0 2023-06-25 02:26:20,761 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.121e+02 9.191e+02 1.408e+03 2.175e+03 4.648e+03, threshold=2.816e+03, percent-clipped=10.0 2023-06-25 02:26:49,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1886556.0, ans=0.1 2023-06-25 02:27:08,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1886556.0, ans=0.125 2023-06-25 02:27:33,355 INFO [train.py:996] (3/4) Epoch 11, batch 9500, loss[loss=0.1827, simple_loss=0.2729, pruned_loss=0.04624, over 21748.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3077, pruned_loss=0.07692, over 4250185.06 frames. ], batch size: 282, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:27:51,395 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-25 02:28:08,917 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-06-25 02:28:12,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1886796.0, ans=0.1 2023-06-25 02:28:34,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1886856.0, ans=0.125 2023-06-25 02:28:45,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1886856.0, ans=0.125 2023-06-25 02:28:59,012 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.15 vs. limit=22.5 2023-06-25 02:29:22,384 INFO [train.py:996] (3/4) Epoch 11, batch 9550, loss[loss=0.2393, simple_loss=0.3304, pruned_loss=0.07414, over 21669.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3105, pruned_loss=0.07939, over 4252236.57 frames. ], batch size: 263, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:29:57,628 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.824e+02 8.918e+02 1.397e+03 2.020e+03 4.656e+03, threshold=2.794e+03, percent-clipped=11.0 2023-06-25 02:30:44,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1887156.0, ans=0.125 2023-06-25 02:31:08,639 INFO [train.py:996] (3/4) Epoch 11, batch 9600, loss[loss=0.2279, simple_loss=0.3023, pruned_loss=0.0768, over 21773.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3128, pruned_loss=0.0802, over 4263328.64 frames. ], batch size: 112, lr: 2.68e-03, grad_scale: 32.0 2023-06-25 02:31:15,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1887276.0, ans=0.125 2023-06-25 02:31:56,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1887396.0, ans=0.1 2023-06-25 02:32:06,461 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-06-25 02:32:48,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1887516.0, ans=0.125 2023-06-25 02:32:52,524 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:32:56,738 INFO [train.py:996] (3/4) Epoch 11, batch 9650, loss[loss=0.2634, simple_loss=0.3323, pruned_loss=0.09724, over 21319.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.313, pruned_loss=0.0804, over 4266126.08 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:33:23,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1887636.0, ans=0.125 2023-06-25 02:33:34,592 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.507e+02 8.589e+02 1.260e+03 1.923e+03 2.986e+03, threshold=2.520e+03, percent-clipped=3.0 2023-06-25 02:33:34,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1887696.0, ans=0.125 2023-06-25 02:34:35,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1887816.0, ans=0.0 2023-06-25 02:34:45,479 INFO [train.py:996] (3/4) Epoch 11, batch 9700, loss[loss=0.2398, simple_loss=0.3172, pruned_loss=0.08118, over 21353.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.315, pruned_loss=0.08035, over 4259055.19 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:35:15,921 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.88 vs. limit=8.0 2023-06-25 02:35:25,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1887936.0, ans=0.125 2023-06-25 02:35:45,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1887996.0, ans=0.1 2023-06-25 02:35:55,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1888056.0, ans=0.125 2023-06-25 02:36:02,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1888056.0, ans=0.125 2023-06-25 02:36:34,150 INFO [train.py:996] (3/4) Epoch 11, batch 9750, loss[loss=0.2193, simple_loss=0.2776, pruned_loss=0.08049, over 21593.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3073, pruned_loss=0.07872, over 4263993.24 frames. ], batch size: 415, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:36:52,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1888176.0, ans=0.0 2023-06-25 02:37:09,417 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.320e+02 8.182e+02 1.091e+03 1.675e+03 6.818e+03, threshold=2.183e+03, percent-clipped=8.0 2023-06-25 02:37:32,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1888296.0, ans=10.0 2023-06-25 02:37:39,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1888356.0, ans=0.05 2023-06-25 02:38:04,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1888416.0, ans=0.125 2023-06-25 02:38:17,393 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-25 02:38:19,327 INFO [train.py:996] (3/4) Epoch 11, batch 9800, loss[loss=0.2212, simple_loss=0.297, pruned_loss=0.07268, over 21798.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3083, pruned_loss=0.07897, over 4267816.38 frames. ], batch size: 298, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:38:25,316 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=12.0 2023-06-25 02:38:49,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1888536.0, ans=0.0 2023-06-25 02:38:49,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1888536.0, ans=0.5 2023-06-25 02:39:27,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1888656.0, ans=0.125 2023-06-25 02:39:38,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1888656.0, ans=0.125 2023-06-25 02:39:55,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1888716.0, ans=0.0 2023-06-25 02:40:05,027 INFO [train.py:996] (3/4) Epoch 11, batch 9850, loss[loss=0.2247, simple_loss=0.2759, pruned_loss=0.08681, over 21331.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3058, pruned_loss=0.07983, over 4273418.62 frames. ], batch size: 473, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:40:41,991 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 6.727e+02 9.053e+02 1.353e+03 2.861e+03, threshold=1.811e+03, percent-clipped=2.0 2023-06-25 02:40:57,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1888896.0, ans=0.125 2023-06-25 02:41:03,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1888896.0, ans=0.125 2023-06-25 02:41:45,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1889016.0, ans=0.0 2023-06-25 02:41:53,306 INFO [train.py:996] (3/4) Epoch 11, batch 9900, loss[loss=0.2032, simple_loss=0.27, pruned_loss=0.06819, over 21910.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3018, pruned_loss=0.07895, over 4279885.48 frames. ], batch size: 107, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:42:00,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1889076.0, ans=0.125 2023-06-25 02:42:14,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1889136.0, ans=0.125 2023-06-25 02:43:31,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1889316.0, ans=0.0 2023-06-25 02:43:38,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1889376.0, ans=0.1 2023-06-25 02:43:40,112 INFO [train.py:996] (3/4) Epoch 11, batch 9950, loss[loss=0.2024, simple_loss=0.2622, pruned_loss=0.07132, over 21566.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3033, pruned_loss=0.08082, over 4275308.29 frames. ], batch size: 263, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:43:42,612 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-25 02:44:22,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1889436.0, ans=0.0 2023-06-25 02:44:23,420 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.842e+02 7.849e+02 1.088e+03 1.572e+03 3.841e+03, threshold=2.175e+03, percent-clipped=17.0 2023-06-25 02:45:20,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=22.5 2023-06-25 02:45:21,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1889616.0, ans=0.125 2023-06-25 02:45:36,538 INFO [train.py:996] (3/4) Epoch 11, batch 10000, loss[loss=0.2155, simple_loss=0.2908, pruned_loss=0.07007, over 21531.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3002, pruned_loss=0.08014, over 4274686.77 frames. ], batch size: 441, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 02:46:24,072 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2023-06-25 02:47:25,703 INFO [train.py:996] (3/4) Epoch 11, batch 10050, loss[loss=0.2952, simple_loss=0.4098, pruned_loss=0.09023, over 19767.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3035, pruned_loss=0.08052, over 4270015.06 frames. ], batch size: 703, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:48:13,113 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.178e+02 7.347e+02 1.195e+03 1.566e+03 3.839e+03, threshold=2.391e+03, percent-clipped=10.0 2023-06-25 02:48:20,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1890096.0, ans=0.125 2023-06-25 02:48:40,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1890156.0, ans=0.0 2023-06-25 02:48:43,080 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=22.5 2023-06-25 02:49:12,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1890216.0, ans=0.125 2023-06-25 02:49:16,345 INFO [train.py:996] (3/4) Epoch 11, batch 10100, loss[loss=0.2408, simple_loss=0.3061, pruned_loss=0.08775, over 21611.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3023, pruned_loss=0.07949, over 4268525.66 frames. ], batch size: 230, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:49:18,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1890276.0, ans=0.125 2023-06-25 02:49:50,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1890336.0, ans=0.1 2023-06-25 02:50:51,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1890516.0, ans=0.125 2023-06-25 02:50:51,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1890516.0, ans=0.125 2023-06-25 02:50:55,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1890516.0, ans=0.125 2023-06-25 02:51:12,549 INFO [train.py:996] (3/4) Epoch 11, batch 10150, loss[loss=0.232, simple_loss=0.31, pruned_loss=0.07702, over 21671.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3073, pruned_loss=0.08135, over 4271414.89 frames. ], batch size: 332, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:51:32,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1890576.0, ans=10.0 2023-06-25 02:51:59,199 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.141e+02 7.484e+02 1.008e+03 1.435e+03 3.139e+03, threshold=2.017e+03, percent-clipped=8.0 2023-06-25 02:52:08,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1890696.0, ans=0.125 2023-06-25 02:52:52,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1890816.0, ans=0.1 2023-06-25 02:52:54,884 INFO [train.py:996] (3/4) Epoch 11, batch 10200, loss[loss=0.2112, simple_loss=0.2807, pruned_loss=0.07085, over 21722.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3064, pruned_loss=0.07903, over 4278576.60 frames. ], batch size: 112, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:53:15,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1890876.0, ans=0.125 2023-06-25 02:53:44,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1890996.0, ans=0.2 2023-06-25 02:53:58,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1890996.0, ans=0.07 2023-06-25 02:54:24,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1891116.0, ans=0.0 2023-06-25 02:54:25,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1891116.0, ans=0.125 2023-06-25 02:54:44,090 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.95 vs. limit=10.0 2023-06-25 02:54:47,842 INFO [train.py:996] (3/4) Epoch 11, batch 10250, loss[loss=0.1763, simple_loss=0.2489, pruned_loss=0.05189, over 17304.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.301, pruned_loss=0.07395, over 4267221.48 frames. ], batch size: 62, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:55:08,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1891176.0, ans=0.1 2023-06-25 02:55:29,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1891236.0, ans=0.0 2023-06-25 02:55:38,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.531e+02 8.201e+02 1.215e+03 1.712e+03 3.588e+03, threshold=2.431e+03, percent-clipped=17.0 2023-06-25 02:55:45,540 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=22.5 2023-06-25 02:56:46,808 INFO [train.py:996] (3/4) Epoch 11, batch 10300, loss[loss=0.2915, simple_loss=0.3903, pruned_loss=0.09639, over 19855.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3053, pruned_loss=0.0753, over 4262986.25 frames. ], batch size: 702, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:56:48,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1891476.0, ans=0.0 2023-06-25 02:57:01,198 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-25 02:57:52,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1891656.0, ans=0.125 2023-06-25 02:57:52,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1891656.0, ans=0.125 2023-06-25 02:58:21,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1891716.0, ans=0.0 2023-06-25 02:58:34,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1891716.0, ans=0.1 2023-06-25 02:58:38,055 INFO [train.py:996] (3/4) Epoch 11, batch 10350, loss[loss=0.2201, simple_loss=0.3176, pruned_loss=0.06128, over 21202.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3073, pruned_loss=0.07564, over 4265484.48 frames. ], batch size: 549, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 02:58:40,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1891776.0, ans=0.125 2023-06-25 02:59:25,190 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.545e+02 9.168e+02 1.323e+03 1.995e+03 3.228e+03, threshold=2.646e+03, percent-clipped=12.0 2023-06-25 02:59:31,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1891896.0, ans=0.2 2023-06-25 02:59:52,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1891956.0, ans=0.0 2023-06-25 03:00:14,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1892016.0, ans=0.125 2023-06-25 03:00:32,824 INFO [train.py:996] (3/4) Epoch 11, batch 10400, loss[loss=0.243, simple_loss=0.3191, pruned_loss=0.08346, over 21735.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3002, pruned_loss=0.07405, over 4262672.87 frames. ], batch size: 415, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:00:45,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1892076.0, ans=0.125 2023-06-25 03:01:20,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1892196.0, ans=0.0 2023-06-25 03:01:37,602 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-06-25 03:01:51,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1892256.0, ans=0.125 2023-06-25 03:02:23,350 INFO [train.py:996] (3/4) Epoch 11, batch 10450, loss[loss=0.3614, simple_loss=0.4275, pruned_loss=0.1476, over 21389.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3079, pruned_loss=0.07773, over 4264876.60 frames. ], batch size: 507, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:02:48,592 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.21 vs. limit=6.0 2023-06-25 03:02:56,608 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=15.0 2023-06-25 03:03:04,279 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.499e+02 9.525e+02 1.455e+03 2.411e+03 5.571e+03, threshold=2.910e+03, percent-clipped=19.0 2023-06-25 03:04:10,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1892676.0, ans=0.0 2023-06-25 03:04:11,704 INFO [train.py:996] (3/4) Epoch 11, batch 10500, loss[loss=0.2177, simple_loss=0.2794, pruned_loss=0.07796, over 21545.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3061, pruned_loss=0.07634, over 4266440.01 frames. ], batch size: 230, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:04:13,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1892676.0, ans=0.125 2023-06-25 03:04:53,551 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-25 03:05:18,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1892856.0, ans=0.125 2023-06-25 03:05:26,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1892856.0, ans=0.125 2023-06-25 03:05:27,626 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-25 03:05:44,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1892916.0, ans=0.0 2023-06-25 03:05:57,582 INFO [train.py:996] (3/4) Epoch 11, batch 10550, loss[loss=0.1907, simple_loss=0.2576, pruned_loss=0.06192, over 21576.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3006, pruned_loss=0.07534, over 4269553.97 frames. ], batch size: 263, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:06:32,000 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-25 03:06:39,268 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.737e+02 7.309e+02 9.989e+02 1.510e+03 3.276e+03, threshold=1.998e+03, percent-clipped=4.0 2023-06-25 03:07:51,000 INFO [train.py:996] (3/4) Epoch 11, batch 10600, loss[loss=0.1948, simple_loss=0.2719, pruned_loss=0.05879, over 21265.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2961, pruned_loss=0.07433, over 4268335.36 frames. ], batch size: 159, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:07:51,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1893276.0, ans=0.0 2023-06-25 03:08:22,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1893336.0, ans=0.04949747468305833 2023-06-25 03:09:03,366 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.72 vs. limit=15.0 2023-06-25 03:09:39,375 INFO [train.py:996] (3/4) Epoch 11, batch 10650, loss[loss=0.172, simple_loss=0.2473, pruned_loss=0.04837, over 21358.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2977, pruned_loss=0.07214, over 4268309.69 frames. ], batch size: 194, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:09:41,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1893576.0, ans=0.125 2023-06-25 03:09:48,466 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.83 vs. limit=22.5 2023-06-25 03:09:52,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1893576.0, ans=0.1 2023-06-25 03:10:16,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1893636.0, ans=15.0 2023-06-25 03:10:23,987 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.435e+02 7.785e+02 1.184e+03 1.890e+03 4.480e+03, threshold=2.368e+03, percent-clipped=23.0 2023-06-25 03:10:38,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1893696.0, ans=0.125 2023-06-25 03:10:53,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1893756.0, ans=0.0 2023-06-25 03:11:09,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1893816.0, ans=0.1 2023-06-25 03:11:22,657 INFO [train.py:996] (3/4) Epoch 11, batch 10700, loss[loss=0.2682, simple_loss=0.3342, pruned_loss=0.1011, over 21222.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2989, pruned_loss=0.07308, over 4264589.17 frames. ], batch size: 143, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:12:05,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1893936.0, ans=0.125 2023-06-25 03:12:10,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1893996.0, ans=0.1 2023-06-25 03:12:35,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1894056.0, ans=0.125 2023-06-25 03:12:45,558 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-06-25 03:13:09,964 INFO [train.py:996] (3/4) Epoch 11, batch 10750, loss[loss=0.2247, simple_loss=0.3403, pruned_loss=0.05452, over 21303.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3085, pruned_loss=0.07695, over 4267083.24 frames. ], batch size: 548, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:13:25,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1894176.0, ans=0.1 2023-06-25 03:13:36,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1894176.0, ans=0.0 2023-06-25 03:13:59,108 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-25 03:14:04,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1894296.0, ans=0.04949747468305833 2023-06-25 03:14:06,435 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.572e+02 8.726e+02 1.242e+03 1.937e+03 5.296e+03, threshold=2.484e+03, percent-clipped=18.0 2023-06-25 03:14:13,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1894296.0, ans=0.1 2023-06-25 03:14:30,240 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:14:40,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1894416.0, ans=0.125 2023-06-25 03:15:07,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1894416.0, ans=0.1 2023-06-25 03:15:11,136 INFO [train.py:996] (3/4) Epoch 11, batch 10800, loss[loss=0.282, simple_loss=0.3473, pruned_loss=0.1084, over 21250.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3134, pruned_loss=0.07805, over 4266748.16 frames. ], batch size: 143, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:15:22,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1894476.0, ans=0.0 2023-06-25 03:15:24,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1894476.0, ans=0.1 2023-06-25 03:15:25,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1894476.0, ans=0.0 2023-06-25 03:15:51,286 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-25 03:16:06,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1894596.0, ans=0.1 2023-06-25 03:16:59,868 INFO [train.py:996] (3/4) Epoch 11, batch 10850, loss[loss=0.2056, simple_loss=0.2749, pruned_loss=0.06813, over 21214.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3138, pruned_loss=0.0781, over 4263871.38 frames. ], batch size: 549, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:17:40,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1894836.0, ans=0.0 2023-06-25 03:17:48,726 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.535e+02 7.569e+02 9.387e+02 1.863e+03 6.222e+03, threshold=1.877e+03, percent-clipped=9.0 2023-06-25 03:18:17,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1894956.0, ans=0.07 2023-06-25 03:18:49,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1895076.0, ans=0.1 2023-06-25 03:18:50,672 INFO [train.py:996] (3/4) Epoch 11, batch 10900, loss[loss=0.1965, simple_loss=0.2713, pruned_loss=0.0609, over 15783.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3061, pruned_loss=0.07594, over 4254831.16 frames. ], batch size: 61, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:20:06,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1895256.0, ans=0.125 2023-06-25 03:20:16,672 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=22.5 2023-06-25 03:20:37,853 INFO [train.py:996] (3/4) Epoch 11, batch 10950, loss[loss=0.1975, simple_loss=0.2682, pruned_loss=0.0634, over 21647.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2999, pruned_loss=0.07349, over 4257842.78 frames. ], batch size: 333, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:20:47,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1895376.0, ans=0.125 2023-06-25 03:21:12,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1895436.0, ans=0.07 2023-06-25 03:21:26,496 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.792e+02 7.087e+02 9.989e+02 1.560e+03 2.958e+03, threshold=1.998e+03, percent-clipped=15.0 2023-06-25 03:21:38,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1895556.0, ans=0.0 2023-06-25 03:22:06,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1895616.0, ans=0.125 2023-06-25 03:22:25,899 INFO [train.py:996] (3/4) Epoch 11, batch 11000, loss[loss=0.2306, simple_loss=0.3028, pruned_loss=0.07923, over 21524.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2999, pruned_loss=0.07452, over 4263226.26 frames. ], batch size: 131, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:23:07,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1895736.0, ans=0.1 2023-06-25 03:23:18,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1895796.0, ans=0.125 2023-06-25 03:23:23,351 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:23:28,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1895856.0, ans=0.125 2023-06-25 03:23:54,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1895916.0, ans=0.125 2023-06-25 03:24:05,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1895916.0, ans=0.2 2023-06-25 03:24:12,470 INFO [train.py:996] (3/4) Epoch 11, batch 11050, loss[loss=0.2068, simple_loss=0.2656, pruned_loss=0.074, over 21373.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2982, pruned_loss=0.07612, over 4259701.90 frames. ], batch size: 144, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:24:46,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1896036.0, ans=0.125 2023-06-25 03:24:57,013 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.157e+02 7.118e+02 9.875e+02 1.339e+03 2.675e+03, threshold=1.975e+03, percent-clipped=6.0 2023-06-25 03:25:09,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1896156.0, ans=0.125 2023-06-25 03:25:17,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1896156.0, ans=0.125 2023-06-25 03:25:54,919 INFO [train.py:996] (3/4) Epoch 11, batch 11100, loss[loss=0.243, simple_loss=0.3005, pruned_loss=0.09274, over 14919.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2973, pruned_loss=0.07704, over 4255595.69 frames. ], batch size: 61, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:26:41,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1896396.0, ans=0.0 2023-06-25 03:26:59,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1896456.0, ans=0.025 2023-06-25 03:27:17,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1896516.0, ans=0.125 2023-06-25 03:27:41,684 INFO [train.py:996] (3/4) Epoch 11, batch 11150, loss[loss=0.2204, simple_loss=0.3234, pruned_loss=0.0587, over 21746.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2957, pruned_loss=0.0767, over 4249927.38 frames. ], batch size: 351, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:28:12,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1896636.0, ans=0.04949747468305833 2023-06-25 03:28:14,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1896636.0, ans=0.125 2023-06-25 03:28:31,508 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 7.186e+02 9.135e+02 1.372e+03 3.865e+03, threshold=1.827e+03, percent-clipped=12.0 2023-06-25 03:29:09,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1896756.0, ans=0.1 2023-06-25 03:29:31,019 INFO [train.py:996] (3/4) Epoch 11, batch 11200, loss[loss=0.2229, simple_loss=0.2748, pruned_loss=0.0855, over 21752.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2941, pruned_loss=0.07593, over 4246413.54 frames. ], batch size: 112, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 03:31:01,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1897116.0, ans=0.0 2023-06-25 03:31:04,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1897116.0, ans=0.125 2023-06-25 03:31:15,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1897116.0, ans=0.0 2023-06-25 03:31:15,670 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-25 03:31:19,609 INFO [train.py:996] (3/4) Epoch 11, batch 11250, loss[loss=0.2043, simple_loss=0.2841, pruned_loss=0.06226, over 21563.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2952, pruned_loss=0.07631, over 4257404.80 frames. ], batch size: 263, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 03:32:07,775 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.121e+02 7.485e+02 1.049e+03 1.491e+03 3.670e+03, threshold=2.098e+03, percent-clipped=11.0 2023-06-25 03:33:07,306 INFO [train.py:996] (3/4) Epoch 11, batch 11300, loss[loss=0.2004, simple_loss=0.2823, pruned_loss=0.05932, over 21750.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2959, pruned_loss=0.07649, over 4269269.55 frames. ], batch size: 112, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 03:33:19,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1897476.0, ans=0.0 2023-06-25 03:33:38,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1897536.0, ans=0.125 2023-06-25 03:33:48,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1897536.0, ans=0.1 2023-06-25 03:33:56,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1897596.0, ans=0.0 2023-06-25 03:34:24,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1897656.0, ans=0.2 2023-06-25 03:34:38,916 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2023-06-25 03:34:48,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1897716.0, ans=0.5 2023-06-25 03:34:54,359 INFO [train.py:996] (3/4) Epoch 11, batch 11350, loss[loss=0.2885, simple_loss=0.3592, pruned_loss=0.1089, over 21380.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2969, pruned_loss=0.07618, over 4270999.94 frames. ], batch size: 548, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:35:07,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1897776.0, ans=0.125 2023-06-25 03:35:47,353 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.585e+02 7.865e+02 1.156e+03 1.769e+03 3.739e+03, threshold=2.312e+03, percent-clipped=14.0 2023-06-25 03:36:23,002 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=12.0 2023-06-25 03:36:26,581 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=12.0 2023-06-25 03:36:51,917 INFO [train.py:996] (3/4) Epoch 11, batch 11400, loss[loss=0.219, simple_loss=0.2948, pruned_loss=0.07158, over 20020.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3015, pruned_loss=0.07762, over 4277936.41 frames. ], batch size: 702, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:36:52,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1898076.0, ans=0.125 2023-06-25 03:36:55,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1898076.0, ans=0.125 2023-06-25 03:37:10,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1898076.0, ans=0.125 2023-06-25 03:37:56,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1898256.0, ans=0.1 2023-06-25 03:37:57,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1898256.0, ans=0.125 2023-06-25 03:38:21,249 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-25 03:38:39,459 INFO [train.py:996] (3/4) Epoch 11, batch 11450, loss[loss=0.209, simple_loss=0.2729, pruned_loss=0.0725, over 21214.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3006, pruned_loss=0.07507, over 4270609.73 frames. ], batch size: 608, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:39:02,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1898436.0, ans=0.125 2023-06-25 03:39:10,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1898436.0, ans=0.125 2023-06-25 03:39:16,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1898436.0, ans=15.0 2023-06-25 03:39:33,541 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.188e+02 7.976e+02 1.094e+03 1.671e+03 3.367e+03, threshold=2.188e+03, percent-clipped=9.0 2023-06-25 03:40:16,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1898616.0, ans=0.1 2023-06-25 03:40:29,725 INFO [train.py:996] (3/4) Epoch 11, batch 11500, loss[loss=0.1778, simple_loss=0.2314, pruned_loss=0.06208, over 20885.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3039, pruned_loss=0.07641, over 4271200.74 frames. ], batch size: 608, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:41:15,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1898796.0, ans=0.0 2023-06-25 03:41:38,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1898856.0, ans=0.0 2023-06-25 03:41:47,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1898856.0, ans=0.125 2023-06-25 03:42:25,753 INFO [train.py:996] (3/4) Epoch 11, batch 11550, loss[loss=0.2957, simple_loss=0.4243, pruned_loss=0.08354, over 21181.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3116, pruned_loss=0.07707, over 4271833.81 frames. ], batch size: 548, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:42:50,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1899036.0, ans=0.2 2023-06-25 03:43:21,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.334e+02 7.903e+02 1.066e+03 1.850e+03 4.952e+03, threshold=2.132e+03, percent-clipped=19.0 2023-06-25 03:43:36,056 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-25 03:44:11,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1899216.0, ans=0.125 2023-06-25 03:44:16,553 INFO [train.py:996] (3/4) Epoch 11, batch 11600, loss[loss=0.2392, simple_loss=0.3391, pruned_loss=0.06961, over 21571.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3245, pruned_loss=0.07842, over 4266394.71 frames. ], batch size: 230, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:44:37,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1899336.0, ans=0.1 2023-06-25 03:44:55,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1899336.0, ans=0.125 2023-06-25 03:45:39,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1899456.0, ans=0.0 2023-06-25 03:45:59,719 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=22.5 2023-06-25 03:46:03,265 INFO [train.py:996] (3/4) Epoch 11, batch 11650, loss[loss=0.2497, simple_loss=0.3229, pruned_loss=0.08827, over 21846.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3317, pruned_loss=0.07998, over 4262803.99 frames. ], batch size: 372, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:46:11,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1899576.0, ans=0.2 2023-06-25 03:46:13,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.63 vs. limit=10.0 2023-06-25 03:46:30,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1899636.0, ans=0.125 2023-06-25 03:47:00,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1899696.0, ans=0.0 2023-06-25 03:47:01,632 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.300e+02 9.276e+02 1.301e+03 2.293e+03 3.963e+03, threshold=2.603e+03, percent-clipped=26.0 2023-06-25 03:47:30,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1899756.0, ans=0.0 2023-06-25 03:47:55,802 INFO [train.py:996] (3/4) Epoch 11, batch 11700, loss[loss=0.1968, simple_loss=0.2591, pruned_loss=0.06724, over 21588.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3227, pruned_loss=0.07994, over 4255126.50 frames. ], batch size: 213, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:47:58,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=1899876.0, ans=0.1 2023-06-25 03:47:59,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1899876.0, ans=0.1 2023-06-25 03:48:09,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1899876.0, ans=0.125 2023-06-25 03:48:40,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1899996.0, ans=0.0 2023-06-25 03:48:51,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1899996.0, ans=15.0 2023-06-25 03:49:06,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1900056.0, ans=0.125 2023-06-25 03:49:12,485 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-06-25 03:49:14,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1900056.0, ans=0.0 2023-06-25 03:49:42,892 INFO [train.py:996] (3/4) Epoch 11, batch 11750, loss[loss=0.2179, simple_loss=0.2907, pruned_loss=0.07253, over 21745.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3139, pruned_loss=0.07913, over 4254156.08 frames. ], batch size: 282, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:50:19,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1900236.0, ans=0.2 2023-06-25 03:50:36,018 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.631e+02 8.041e+02 1.029e+03 1.302e+03 3.025e+03, threshold=2.058e+03, percent-clipped=2.0 2023-06-25 03:51:00,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1900356.0, ans=0.125 2023-06-25 03:51:03,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1900356.0, ans=0.125 2023-06-25 03:51:31,734 INFO [train.py:996] (3/4) Epoch 11, batch 11800, loss[loss=0.2273, simple_loss=0.3302, pruned_loss=0.06214, over 21891.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3155, pruned_loss=0.0814, over 4261928.15 frames. ], batch size: 372, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:53:19,747 INFO [train.py:996] (3/4) Epoch 11, batch 11850, loss[loss=0.2401, simple_loss=0.321, pruned_loss=0.07957, over 21763.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3168, pruned_loss=0.08063, over 4271468.72 frames. ], batch size: 247, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:54:17,574 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.516e+02 7.061e+02 9.969e+02 1.583e+03 3.889e+03, threshold=1.994e+03, percent-clipped=10.0 2023-06-25 03:54:32,156 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-25 03:54:45,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1901016.0, ans=0.0 2023-06-25 03:55:03,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1901016.0, ans=0.125 2023-06-25 03:55:15,576 INFO [train.py:996] (3/4) Epoch 11, batch 11900, loss[loss=0.2255, simple_loss=0.3228, pruned_loss=0.06413, over 21216.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3164, pruned_loss=0.07796, over 4265100.07 frames. ], batch size: 548, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:56:03,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1901196.0, ans=0.125 2023-06-25 03:56:40,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1901316.0, ans=0.125 2023-06-25 03:56:44,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1901316.0, ans=0.2 2023-06-25 03:56:57,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1901316.0, ans=0.125 2023-06-25 03:57:11,026 INFO [train.py:996] (3/4) Epoch 11, batch 11950, loss[loss=0.2349, simple_loss=0.331, pruned_loss=0.06941, over 21723.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3174, pruned_loss=0.07495, over 4265602.54 frames. ], batch size: 351, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:57:53,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1901496.0, ans=0.0 2023-06-25 03:57:56,080 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.868e+02 8.366e+02 1.305e+03 1.891e+03 4.761e+03, threshold=2.610e+03, percent-clipped=24.0 2023-06-25 03:58:23,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1901556.0, ans=0.1 2023-06-25 03:58:24,336 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-06-25 03:58:42,341 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.77 vs. limit=15.0 2023-06-25 03:58:53,115 INFO [train.py:996] (3/4) Epoch 11, batch 12000, loss[loss=0.2084, simple_loss=0.2756, pruned_loss=0.07055, over 21508.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3113, pruned_loss=0.0734, over 4261987.93 frames. ], batch size: 212, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 03:58:53,116 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 03:59:08,724 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.9058, 3.3535, 1.5735, 1.6770], device='cuda:3') 2023-06-25 03:59:11,383 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.2587, simple_loss=0.3514, pruned_loss=0.08303, over 1796401.00 frames. 2023-06-25 03:59:11,384 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-25 03:59:20,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1901676.0, ans=0.125 2023-06-25 03:59:29,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1901676.0, ans=0.125 2023-06-25 03:59:44,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1901736.0, ans=0.125 2023-06-25 03:59:46,665 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=12.0 2023-06-25 04:00:05,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1901796.0, ans=0.0 2023-06-25 04:00:50,843 INFO [train.py:996] (3/4) Epoch 11, batch 12050, loss[loss=0.2201, simple_loss=0.2857, pruned_loss=0.07729, over 21914.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3084, pruned_loss=0.07563, over 4271421.74 frames. ], batch size: 316, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 04:01:06,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1901976.0, ans=0.125 2023-06-25 04:01:44,069 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.395e+02 7.721e+02 1.099e+03 1.708e+03 2.830e+03, threshold=2.199e+03, percent-clipped=2.0 2023-06-25 04:02:06,629 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.53 vs. limit=22.5 2023-06-25 04:02:29,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.38 vs. limit=15.0 2023-06-25 04:02:41,830 INFO [train.py:996] (3/4) Epoch 11, batch 12100, loss[loss=0.3666, simple_loss=0.4102, pruned_loss=0.1615, over 21366.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3119, pruned_loss=0.07907, over 4271083.79 frames. ], batch size: 507, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 04:02:42,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1902276.0, ans=0.125 2023-06-25 04:03:01,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1902336.0, ans=0.125 2023-06-25 04:03:05,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1902336.0, ans=0.125 2023-06-25 04:04:17,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1902516.0, ans=0.2 2023-06-25 04:04:25,625 INFO [train.py:996] (3/4) Epoch 11, batch 12150, loss[loss=0.196, simple_loss=0.3081, pruned_loss=0.04199, over 20831.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3155, pruned_loss=0.07814, over 4269613.95 frames. ], batch size: 607, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 04:04:51,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1902636.0, ans=0.2 2023-06-25 04:05:22,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1902696.0, ans=10.0 2023-06-25 04:05:23,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1902696.0, ans=0.125 2023-06-25 04:05:28,512 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.672e+02 1.025e+03 1.712e+03 2.364e+03 4.484e+03, threshold=3.424e+03, percent-clipped=30.0 2023-06-25 04:05:48,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1902756.0, ans=0.0 2023-06-25 04:06:08,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1902816.0, ans=0.125 2023-06-25 04:06:10,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1902816.0, ans=0.2 2023-06-25 04:06:12,839 INFO [train.py:996] (3/4) Epoch 11, batch 12200, loss[loss=0.2401, simple_loss=0.2919, pruned_loss=0.09414, over 21566.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3123, pruned_loss=0.07741, over 4262205.53 frames. ], batch size: 231, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:07:58,634 INFO [train.py:996] (3/4) Epoch 11, batch 12250, loss[loss=0.1821, simple_loss=0.2648, pruned_loss=0.04968, over 21654.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3042, pruned_loss=0.07457, over 4263755.65 frames. ], batch size: 247, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:08:43,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1903296.0, ans=0.0 2023-06-25 04:08:59,363 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.457e+02 7.371e+02 1.190e+03 1.577e+03 4.141e+03, threshold=2.380e+03, percent-clipped=2.0 2023-06-25 04:09:11,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1903356.0, ans=0.0 2023-06-25 04:09:11,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1903356.0, ans=0.07 2023-06-25 04:09:18,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1903356.0, ans=0.125 2023-06-25 04:09:26,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1903416.0, ans=0.0 2023-06-25 04:09:30,090 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-25 04:09:34,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1903416.0, ans=0.125 2023-06-25 04:09:44,645 INFO [train.py:996] (3/4) Epoch 11, batch 12300, loss[loss=0.2402, simple_loss=0.3355, pruned_loss=0.0725, over 21751.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2968, pruned_loss=0.06911, over 4267655.70 frames. ], batch size: 351, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:10:47,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1903596.0, ans=0.125 2023-06-25 04:11:01,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1903656.0, ans=15.0 2023-06-25 04:11:18,129 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:11:21,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1903716.0, ans=0.04949747468305833 2023-06-25 04:11:24,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1903716.0, ans=0.0 2023-06-25 04:11:30,622 INFO [train.py:996] (3/4) Epoch 11, batch 12350, loss[loss=0.2194, simple_loss=0.2977, pruned_loss=0.07056, over 21602.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3022, pruned_loss=0.0697, over 4272614.50 frames. ], batch size: 195, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:11:31,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1903776.0, ans=0.2 2023-06-25 04:11:52,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1903836.0, ans=22.5 2023-06-25 04:11:58,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1903836.0, ans=0.125 2023-06-25 04:12:31,427 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.609e+02 8.670e+02 1.217e+03 1.964e+03 4.834e+03, threshold=2.433e+03, percent-clipped=16.0 2023-06-25 04:13:08,113 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:13:16,309 INFO [train.py:996] (3/4) Epoch 11, batch 12400, loss[loss=0.2524, simple_loss=0.3196, pruned_loss=0.09264, over 21731.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3044, pruned_loss=0.07338, over 4280563.68 frames. ], batch size: 389, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:13:31,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1904076.0, ans=0.125 2023-06-25 04:14:12,592 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=12.0 2023-06-25 04:14:33,407 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:14:49,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1904316.0, ans=0.1 2023-06-25 04:14:57,306 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-25 04:15:07,761 INFO [train.py:996] (3/4) Epoch 11, batch 12450, loss[loss=0.2818, simple_loss=0.3523, pruned_loss=0.1056, over 21343.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3082, pruned_loss=0.07662, over 4278821.27 frames. ], batch size: 143, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:15:18,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1904376.0, ans=0.125 2023-06-25 04:15:33,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1904436.0, ans=0.125 2023-06-25 04:16:10,960 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.996e+02 6.913e+02 8.483e+02 1.165e+03 2.704e+03, threshold=1.697e+03, percent-clipped=3.0 2023-06-25 04:16:23,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1904556.0, ans=0.1 2023-06-25 04:16:59,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1904616.0, ans=0.04949747468305833 2023-06-25 04:17:03,375 INFO [train.py:996] (3/4) Epoch 11, batch 12500, loss[loss=0.281, simple_loss=0.3683, pruned_loss=0.09685, over 21942.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3211, pruned_loss=0.08041, over 4277203.94 frames. ], batch size: 317, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:17:03,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1904676.0, ans=0.125 2023-06-25 04:17:35,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1904736.0, ans=0.2 2023-06-25 04:18:30,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-25 04:18:52,651 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:19:02,301 INFO [train.py:996] (3/4) Epoch 11, batch 12550, loss[loss=0.227, simple_loss=0.3055, pruned_loss=0.0742, over 21336.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3245, pruned_loss=0.08204, over 4281330.67 frames. ], batch size: 159, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:19:30,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1905036.0, ans=0.125 2023-06-25 04:19:34,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1905036.0, ans=0.2 2023-06-25 04:19:46,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1905096.0, ans=0.0 2023-06-25 04:20:07,412 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.343e+02 7.503e+02 1.080e+03 1.641e+03 3.839e+03, threshold=2.159e+03, percent-clipped=20.0 2023-06-25 04:20:47,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1905216.0, ans=0.125 2023-06-25 04:20:52,627 INFO [train.py:996] (3/4) Epoch 11, batch 12600, loss[loss=0.2267, simple_loss=0.3061, pruned_loss=0.07372, over 21507.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.324, pruned_loss=0.08021, over 4274040.61 frames. ], batch size: 195, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:20:58,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1905276.0, ans=0.2 2023-06-25 04:21:09,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1905276.0, ans=0.0 2023-06-25 04:21:27,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1905336.0, ans=0.1 2023-06-25 04:22:33,088 INFO [train.py:996] (3/4) Epoch 11, batch 12650, loss[loss=0.2633, simple_loss=0.3211, pruned_loss=0.1027, over 21860.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3173, pruned_loss=0.07699, over 4274186.32 frames. ], batch size: 351, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:22:38,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1905576.0, ans=0.125 2023-06-25 04:22:52,559 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=15.0 2023-06-25 04:22:55,571 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-06-25 04:23:37,021 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.529e+02 6.454e+02 1.042e+03 1.689e+03 3.142e+03, threshold=2.085e+03, percent-clipped=12.0 2023-06-25 04:23:41,353 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=22.5 2023-06-25 04:24:22,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1905816.0, ans=0.0 2023-06-25 04:24:28,093 INFO [train.py:996] (3/4) Epoch 11, batch 12700, loss[loss=0.263, simple_loss=0.3346, pruned_loss=0.09566, over 21467.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3154, pruned_loss=0.07922, over 4279940.30 frames. ], batch size: 131, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:26:13,869 INFO [train.py:996] (3/4) Epoch 11, batch 12750, loss[loss=0.2599, simple_loss=0.3254, pruned_loss=0.0972, over 21890.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3155, pruned_loss=0.07912, over 4278809.21 frames. ], batch size: 107, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:26:36,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1906236.0, ans=0.125 2023-06-25 04:26:38,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1906236.0, ans=0.125 2023-06-25 04:27:09,559 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.849e+02 1.051e+03 1.343e+03 1.949e+03 4.528e+03, threshold=2.685e+03, percent-clipped=20.0 2023-06-25 04:27:52,724 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-06-25 04:28:00,575 INFO [train.py:996] (3/4) Epoch 11, batch 12800, loss[loss=0.2441, simple_loss=0.3196, pruned_loss=0.08434, over 21676.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3158, pruned_loss=0.08021, over 4288948.59 frames. ], batch size: 389, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:28:08,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1906476.0, ans=0.125 2023-06-25 04:28:27,958 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=22.5 2023-06-25 04:28:51,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1906596.0, ans=0.2 2023-06-25 04:29:15,649 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:29:39,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1906716.0, ans=0.125 2023-06-25 04:29:50,499 INFO [train.py:996] (3/4) Epoch 11, batch 12850, loss[loss=0.2086, simple_loss=0.3043, pruned_loss=0.05647, over 21853.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3167, pruned_loss=0.08059, over 4289206.74 frames. ], batch size: 316, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:30:26,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1906836.0, ans=0.125 2023-06-25 04:30:53,497 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.548e+02 7.415e+02 1.066e+03 1.369e+03 3.330e+03, threshold=2.132e+03, percent-clipped=6.0 2023-06-25 04:31:43,313 INFO [train.py:996] (3/4) Epoch 11, batch 12900, loss[loss=0.2572, simple_loss=0.3534, pruned_loss=0.08046, over 21140.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3148, pruned_loss=0.07724, over 4284004.62 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:31:52,206 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:32:09,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1907136.0, ans=0.0 2023-06-25 04:33:33,454 INFO [train.py:996] (3/4) Epoch 11, batch 12950, loss[loss=0.1923, simple_loss=0.2705, pruned_loss=0.057, over 21227.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3107, pruned_loss=0.07487, over 4280052.96 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:34:31,946 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 6.927e+02 1.132e+03 1.522e+03 3.743e+03, threshold=2.263e+03, percent-clipped=8.0 2023-06-25 04:35:20,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1907676.0, ans=0.04949747468305833 2023-06-25 04:35:21,568 INFO [train.py:996] (3/4) Epoch 11, batch 13000, loss[loss=0.1674, simple_loss=0.2508, pruned_loss=0.04197, over 21198.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3116, pruned_loss=0.07621, over 4279042.54 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:35:30,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1907676.0, ans=0.2 2023-06-25 04:36:10,189 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-25 04:36:13,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-25 04:37:07,499 INFO [train.py:996] (3/4) Epoch 11, batch 13050, loss[loss=0.2099, simple_loss=0.2784, pruned_loss=0.07071, over 21855.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3072, pruned_loss=0.07417, over 4275276.19 frames. ], batch size: 247, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:37:23,096 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=12.0 2023-06-25 04:37:26,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1908036.0, ans=0.0 2023-06-25 04:38:00,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1908096.0, ans=0.04949747468305833 2023-06-25 04:38:05,017 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.554e+02 7.302e+02 9.567e+02 1.329e+03 2.389e+03, threshold=1.913e+03, percent-clipped=1.0 2023-06-25 04:38:52,044 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=15.0 2023-06-25 04:38:55,888 INFO [train.py:996] (3/4) Epoch 11, batch 13100, loss[loss=0.32, simple_loss=0.4293, pruned_loss=0.1053, over 19773.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3103, pruned_loss=0.07431, over 4274227.68 frames. ], batch size: 703, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:39:49,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1908396.0, ans=0.025 2023-06-25 04:40:07,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1908456.0, ans=0.0 2023-06-25 04:40:30,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1908516.0, ans=0.0 2023-06-25 04:40:45,493 INFO [train.py:996] (3/4) Epoch 11, batch 13150, loss[loss=0.2558, simple_loss=0.4112, pruned_loss=0.05018, over 19709.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3154, pruned_loss=0.07762, over 4266055.28 frames. ], batch size: 702, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:40:55,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1908576.0, ans=0.125 2023-06-25 04:41:02,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1908576.0, ans=0.125 2023-06-25 04:41:04,492 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.68 vs. limit=10.0 2023-06-25 04:41:05,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1908576.0, ans=0.125 2023-06-25 04:41:27,046 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-25 04:41:38,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1908696.0, ans=0.1 2023-06-25 04:41:55,003 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.487e+02 7.569e+02 1.234e+03 1.722e+03 3.917e+03, threshold=2.467e+03, percent-clipped=21.0 2023-06-25 04:42:13,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1908756.0, ans=0.1 2023-06-25 04:42:14,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1908756.0, ans=0.2 2023-06-25 04:42:18,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1908816.0, ans=0.125 2023-06-25 04:42:46,104 INFO [train.py:996] (3/4) Epoch 11, batch 13200, loss[loss=0.285, simple_loss=0.3485, pruned_loss=0.1107, over 21569.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3134, pruned_loss=0.0777, over 4263997.21 frames. ], batch size: 389, lr: 2.66e-03, grad_scale: 32.0 2023-06-25 04:43:04,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1908936.0, ans=0.0 2023-06-25 04:43:19,393 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:43:36,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1908996.0, ans=0.025 2023-06-25 04:43:38,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1908996.0, ans=0.0 2023-06-25 04:43:45,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1908996.0, ans=0.1 2023-06-25 04:44:30,136 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.85 vs. limit=10.0 2023-06-25 04:44:34,058 INFO [train.py:996] (3/4) Epoch 11, batch 13250, loss[loss=0.2277, simple_loss=0.31, pruned_loss=0.07268, over 21654.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3141, pruned_loss=0.08043, over 4264167.41 frames. ], batch size: 263, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:45:19,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1909296.0, ans=0.1 2023-06-25 04:45:32,773 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.299e+02 1.027e+03 1.488e+03 2.200e+03 4.599e+03, threshold=2.975e+03, percent-clipped=16.0 2023-06-25 04:46:21,094 INFO [train.py:996] (3/4) Epoch 11, batch 13300, loss[loss=0.2415, simple_loss=0.3241, pruned_loss=0.07939, over 21594.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3157, pruned_loss=0.07959, over 4267026.56 frames. ], batch size: 263, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:46:43,865 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-25 04:48:02,342 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-06-25 04:48:09,240 INFO [train.py:996] (3/4) Epoch 11, batch 13350, loss[loss=0.3057, simple_loss=0.3802, pruned_loss=0.1156, over 21743.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3198, pruned_loss=0.08242, over 4272019.52 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:48:11,495 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:48:56,367 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-25 04:49:08,200 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.321e+02 8.273e+02 1.155e+03 1.760e+03 3.459e+03, threshold=2.310e+03, percent-clipped=3.0 2023-06-25 04:49:10,420 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:49:35,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1910016.0, ans=0.125 2023-06-25 04:49:38,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1910016.0, ans=0.2 2023-06-25 04:49:52,101 INFO [train.py:996] (3/4) Epoch 11, batch 13400, loss[loss=0.2524, simple_loss=0.3209, pruned_loss=0.09195, over 21825.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3198, pruned_loss=0.08339, over 4279801.00 frames. ], batch size: 351, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:50:10,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1910076.0, ans=0.0 2023-06-25 04:50:25,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1910136.0, ans=0.0 2023-06-25 04:50:55,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1910196.0, ans=0.0 2023-06-25 04:51:05,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1910256.0, ans=0.125 2023-06-25 04:51:14,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1910316.0, ans=0.0 2023-06-25 04:51:16,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1910316.0, ans=0.125 2023-06-25 04:51:39,476 INFO [train.py:996] (3/4) Epoch 11, batch 13450, loss[loss=0.2334, simple_loss=0.2977, pruned_loss=0.08452, over 21722.00 frames. ], tot_loss[loss=0.246, simple_loss=0.322, pruned_loss=0.085, over 4281579.53 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:52:27,515 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:52:36,800 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.628e+02 8.174e+02 1.187e+03 1.780e+03 3.541e+03, threshold=2.373e+03, percent-clipped=13.0 2023-06-25 04:52:44,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1910556.0, ans=0.125 2023-06-25 04:52:52,920 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=15.0 2023-06-25 04:53:16,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1910616.0, ans=0.025 2023-06-25 04:53:23,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1910616.0, ans=0.125 2023-06-25 04:53:26,282 INFO [train.py:996] (3/4) Epoch 11, batch 13500, loss[loss=0.2044, simple_loss=0.2666, pruned_loss=0.07108, over 21596.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3123, pruned_loss=0.08166, over 4272873.90 frames. ], batch size: 263, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:53:30,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1910676.0, ans=0.0 2023-06-25 04:53:40,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1910676.0, ans=0.0 2023-06-25 04:54:29,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1910856.0, ans=0.0 2023-06-25 04:55:12,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1910976.0, ans=0.125 2023-06-25 04:55:13,726 INFO [train.py:996] (3/4) Epoch 11, batch 13550, loss[loss=0.2405, simple_loss=0.3233, pruned_loss=0.07891, over 21419.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3173, pruned_loss=0.08186, over 4264558.62 frames. ], batch size: 131, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:55:51,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1911036.0, ans=0.125 2023-06-25 04:56:11,385 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.050e+02 7.777e+02 1.227e+03 1.710e+03 3.921e+03, threshold=2.454e+03, percent-clipped=8.0 2023-06-25 04:56:40,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1911156.0, ans=0.1 2023-06-25 04:56:54,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1911216.0, ans=0.1 2023-06-25 04:56:54,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1911216.0, ans=0.125 2023-06-25 04:56:55,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1911216.0, ans=0.0 2023-06-25 04:57:01,035 INFO [train.py:996] (3/4) Epoch 11, batch 13600, loss[loss=0.2473, simple_loss=0.3209, pruned_loss=0.08683, over 21834.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3171, pruned_loss=0.08214, over 4276067.36 frames. ], batch size: 124, lr: 2.66e-03, grad_scale: 32.0 2023-06-25 04:57:35,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1911336.0, ans=0.0 2023-06-25 04:57:37,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1911336.0, ans=0.125 2023-06-25 04:58:05,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1911456.0, ans=0.125 2023-06-25 04:58:15,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1911456.0, ans=0.2 2023-06-25 04:58:42,926 INFO [train.py:996] (3/4) Epoch 11, batch 13650, loss[loss=0.2054, simple_loss=0.2712, pruned_loss=0.06975, over 21662.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3126, pruned_loss=0.07973, over 4273909.49 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:59:05,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1911576.0, ans=0.05 2023-06-25 04:59:48,572 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.419e+02 6.510e+02 1.024e+03 1.563e+03 2.533e+03, threshold=2.048e+03, percent-clipped=2.0 2023-06-25 05:00:35,830 INFO [train.py:996] (3/4) Epoch 11, batch 13700, loss[loss=0.3529, simple_loss=0.4106, pruned_loss=0.1476, over 21507.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3118, pruned_loss=0.07952, over 4264127.32 frames. ], batch size: 508, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:01:09,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1911936.0, ans=0.125 2023-06-25 05:01:16,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1911996.0, ans=0.125 2023-06-25 05:01:44,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1912056.0, ans=0.09899494936611666 2023-06-25 05:01:56,839 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-25 05:02:30,024 INFO [train.py:996] (3/4) Epoch 11, batch 13750, loss[loss=0.2615, simple_loss=0.3524, pruned_loss=0.08529, over 21166.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3084, pruned_loss=0.07808, over 4263978.08 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:02:31,246 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=13.09 vs. limit=15.0 2023-06-25 05:02:48,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1912236.0, ans=0.125 2023-06-25 05:03:33,617 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.440e+02 9.228e+02 1.294e+03 2.214e+03 4.699e+03, threshold=2.588e+03, percent-clipped=28.0 2023-06-25 05:03:54,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1912356.0, ans=0.125 2023-06-25 05:04:04,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1912416.0, ans=0.0 2023-06-25 05:04:14,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1912416.0, ans=0.125 2023-06-25 05:04:15,572 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.40 vs. limit=12.0 2023-06-25 05:04:20,964 INFO [train.py:996] (3/4) Epoch 11, batch 13800, loss[loss=0.2547, simple_loss=0.3565, pruned_loss=0.0764, over 21704.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3103, pruned_loss=0.07629, over 4257239.17 frames. ], batch size: 298, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:05:02,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1912536.0, ans=0.0 2023-06-25 05:05:33,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1912656.0, ans=0.0 2023-06-25 05:05:47,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1912656.0, ans=0.0 2023-06-25 05:05:49,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1912716.0, ans=0.125 2023-06-25 05:05:56,472 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.35 vs. limit=22.5 2023-06-25 05:06:01,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.61 vs. limit=12.0 2023-06-25 05:06:07,151 INFO [train.py:996] (3/4) Epoch 11, batch 13850, loss[loss=0.2644, simple_loss=0.3483, pruned_loss=0.0903, over 21348.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3188, pruned_loss=0.07732, over 4257006.97 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 05:06:12,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1912776.0, ans=0.0 2023-06-25 05:06:32,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1912776.0, ans=0.0 2023-06-25 05:06:48,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1912836.0, ans=0.0 2023-06-25 05:07:06,087 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=15.0 2023-06-25 05:07:13,526 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.731e+02 7.739e+02 1.067e+03 1.553e+03 4.213e+03, threshold=2.133e+03, percent-clipped=6.0 2023-06-25 05:07:13,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1912956.0, ans=0.125 2023-06-25 05:07:52,240 INFO [train.py:996] (3/4) Epoch 11, batch 13900, loss[loss=0.254, simple_loss=0.3258, pruned_loss=0.09103, over 21449.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3227, pruned_loss=0.08102, over 4259032.84 frames. ], batch size: 159, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 05:08:04,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1913076.0, ans=0.2 2023-06-25 05:08:36,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1913196.0, ans=0.125 2023-06-25 05:08:48,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1913196.0, ans=0.0 2023-06-25 05:09:13,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1913256.0, ans=0.125 2023-06-25 05:09:28,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1913316.0, ans=0.125 2023-06-25 05:09:28,523 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-25 05:09:32,286 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-25 05:09:45,234 INFO [train.py:996] (3/4) Epoch 11, batch 13950, loss[loss=0.2742, simple_loss=0.3557, pruned_loss=0.09639, over 21882.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3219, pruned_loss=0.08316, over 4270832.99 frames. ], batch size: 107, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 05:09:57,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1913376.0, ans=0.0 2023-06-25 05:10:09,011 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.73 vs. limit=15.0 2023-06-25 05:10:29,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1913496.0, ans=0.125 2023-06-25 05:10:36,380 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.08 vs. limit=6.0 2023-06-25 05:10:49,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1913556.0, ans=0.125 2023-06-25 05:10:49,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1913556.0, ans=0.2 2023-06-25 05:10:50,566 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.156e+02 8.731e+02 1.156e+03 1.746e+03 2.860e+03, threshold=2.312e+03, percent-clipped=13.0 2023-06-25 05:10:54,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1913556.0, ans=0.125 2023-06-25 05:11:29,094 INFO [train.py:996] (3/4) Epoch 11, batch 14000, loss[loss=0.203, simple_loss=0.3046, pruned_loss=0.05072, over 21793.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3186, pruned_loss=0.08144, over 4270514.63 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:11:36,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1913676.0, ans=0.125 2023-06-25 05:11:52,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1913736.0, ans=0.1 2023-06-25 05:12:27,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1913796.0, ans=0.025 2023-06-25 05:13:16,590 INFO [train.py:996] (3/4) Epoch 11, batch 14050, loss[loss=0.2297, simple_loss=0.2963, pruned_loss=0.08159, over 21155.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3123, pruned_loss=0.07732, over 4269112.57 frames. ], batch size: 608, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:13:26,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1913976.0, ans=0.2 2023-06-25 05:14:24,602 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.465e+02 7.745e+02 1.137e+03 1.921e+03 3.840e+03, threshold=2.273e+03, percent-clipped=15.0 2023-06-25 05:14:52,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1914216.0, ans=0.0 2023-06-25 05:15:04,507 INFO [train.py:996] (3/4) Epoch 11, batch 14100, loss[loss=0.251, simple_loss=0.3096, pruned_loss=0.0962, over 21233.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3061, pruned_loss=0.0772, over 4262579.80 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:16:42,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1914516.0, ans=0.2 2023-06-25 05:16:46,910 INFO [train.py:996] (3/4) Epoch 11, batch 14150, loss[loss=0.2197, simple_loss=0.3038, pruned_loss=0.06784, over 15625.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3097, pruned_loss=0.07817, over 4261265.63 frames. ], batch size: 60, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:17:09,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-06-25 05:17:23,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1914636.0, ans=0.2 2023-06-25 05:17:30,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1914696.0, ans=0.125 2023-06-25 05:17:36,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1914696.0, ans=0.0 2023-06-25 05:17:50,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1914756.0, ans=0.2 2023-06-25 05:17:51,146 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.783e+02 7.327e+02 9.497e+02 1.308e+03 3.394e+03, threshold=1.899e+03, percent-clipped=3.0 2023-06-25 05:18:29,894 INFO [train.py:996] (3/4) Epoch 11, batch 14200, loss[loss=0.2093, simple_loss=0.2913, pruned_loss=0.06361, over 21845.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3078, pruned_loss=0.07702, over 4261519.30 frames. ], batch size: 112, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:19:02,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1914936.0, ans=0.0 2023-06-25 05:19:38,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1915056.0, ans=0.125 2023-06-25 05:19:39,453 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-25 05:19:50,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1915116.0, ans=0.1 2023-06-25 05:20:09,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1915116.0, ans=0.125 2023-06-25 05:20:09,939 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:20:14,308 INFO [train.py:996] (3/4) Epoch 11, batch 14250, loss[loss=0.1984, simple_loss=0.2624, pruned_loss=0.06718, over 21364.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3034, pruned_loss=0.07713, over 4271435.72 frames. ], batch size: 144, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:20:27,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1915176.0, ans=0.0 2023-06-25 05:20:59,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1915236.0, ans=0.125 2023-06-25 05:21:24,827 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.492e+02 7.052e+02 9.633e+02 1.519e+03 2.693e+03, threshold=1.927e+03, percent-clipped=14.0 2023-06-25 05:22:03,005 INFO [train.py:996] (3/4) Epoch 11, batch 14300, loss[loss=0.2359, simple_loss=0.3038, pruned_loss=0.08401, over 15847.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3065, pruned_loss=0.07724, over 4256026.63 frames. ], batch size: 66, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:22:31,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1915536.0, ans=0.0 2023-06-25 05:22:32,543 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-25 05:22:35,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1915536.0, ans=0.1 2023-06-25 05:22:37,659 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-25 05:23:02,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1915596.0, ans=0.2 2023-06-25 05:23:02,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1915596.0, ans=0.125 2023-06-25 05:23:02,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1915596.0, ans=0.0 2023-06-25 05:23:49,069 INFO [train.py:996] (3/4) Epoch 11, batch 14350, loss[loss=0.2612, simple_loss=0.3473, pruned_loss=0.08754, over 21614.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3122, pruned_loss=0.07789, over 4260127.32 frames. ], batch size: 471, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:23:51,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1915776.0, ans=0.125 2023-06-25 05:24:38,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1915896.0, ans=0.1 2023-06-25 05:24:56,250 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.843e+02 8.086e+02 1.263e+03 2.324e+03 6.942e+03, threshold=2.526e+03, percent-clipped=29.0 2023-06-25 05:25:15,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1915956.0, ans=0.0 2023-06-25 05:25:18,444 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:25:31,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1916016.0, ans=0.0 2023-06-25 05:25:31,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1916016.0, ans=0.125 2023-06-25 05:25:33,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1916076.0, ans=0.125 2023-06-25 05:25:34,852 INFO [train.py:996] (3/4) Epoch 11, batch 14400, loss[loss=0.2186, simple_loss=0.2855, pruned_loss=0.07584, over 21475.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3114, pruned_loss=0.07891, over 4260896.64 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 32.0 2023-06-25 05:25:53,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1916076.0, ans=0.125 2023-06-25 05:27:12,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1916316.0, ans=0.0 2023-06-25 05:27:15,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1916316.0, ans=0.0 2023-06-25 05:27:29,121 INFO [train.py:996] (3/4) Epoch 11, batch 14450, loss[loss=0.1866, simple_loss=0.2576, pruned_loss=0.05782, over 21710.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3057, pruned_loss=0.07868, over 4266919.74 frames. ], batch size: 247, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:27:51,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1916436.0, ans=0.2 2023-06-25 05:27:53,556 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.45 vs. limit=10.0 2023-06-25 05:28:30,485 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.134e+02 8.129e+02 1.231e+03 1.648e+03 3.274e+03, threshold=2.462e+03, percent-clipped=7.0 2023-06-25 05:28:56,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1916616.0, ans=0.0 2023-06-25 05:29:07,843 INFO [train.py:996] (3/4) Epoch 11, batch 14500, loss[loss=0.2283, simple_loss=0.3139, pruned_loss=0.0714, over 21780.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3008, pruned_loss=0.07782, over 4275727.45 frames. ], batch size: 351, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:29:23,163 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-25 05:30:00,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1916796.0, ans=0.125 2023-06-25 05:31:01,905 INFO [train.py:996] (3/4) Epoch 11, batch 14550, loss[loss=0.291, simple_loss=0.3538, pruned_loss=0.1141, over 21418.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3066, pruned_loss=0.07915, over 4274651.81 frames. ], batch size: 211, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 05:31:24,047 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-25 05:31:46,200 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-25 05:31:53,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1917096.0, ans=0.125 2023-06-25 05:32:13,604 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.818e+02 8.368e+02 1.257e+03 1.782e+03 3.337e+03, threshold=2.514e+03, percent-clipped=4.0 2023-06-25 05:32:39,028 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=15.0 2023-06-25 05:32:56,086 INFO [train.py:996] (3/4) Epoch 11, batch 14600, loss[loss=0.2081, simple_loss=0.2661, pruned_loss=0.07509, over 20951.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3127, pruned_loss=0.08226, over 4273737.30 frames. ], batch size: 608, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:32:58,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1917276.0, ans=0.125 2023-06-25 05:32:58,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1917276.0, ans=0.125 2023-06-25 05:33:02,303 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-25 05:33:57,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1917456.0, ans=0.09899494936611666 2023-06-25 05:34:06,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1917456.0, ans=0.125 2023-06-25 05:34:39,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1917516.0, ans=0.125 2023-06-25 05:34:44,120 INFO [train.py:996] (3/4) Epoch 11, batch 14650, loss[loss=0.2897, simple_loss=0.3568, pruned_loss=0.1113, over 21367.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3136, pruned_loss=0.08105, over 4274985.93 frames. ], batch size: 549, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:35:02,596 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-25 05:35:21,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1917696.0, ans=0.1 2023-06-25 05:35:50,870 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 8.792e+02 1.262e+03 1.854e+03 3.152e+03, threshold=2.525e+03, percent-clipped=6.0 2023-06-25 05:35:52,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1917756.0, ans=0.0 2023-06-25 05:35:58,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1917756.0, ans=0.125 2023-06-25 05:36:12,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1917816.0, ans=0.125 2023-06-25 05:36:32,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1917876.0, ans=0.125 2023-06-25 05:36:33,368 INFO [train.py:996] (3/4) Epoch 11, batch 14700, loss[loss=0.2382, simple_loss=0.3374, pruned_loss=0.06951, over 21714.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3076, pruned_loss=0.07558, over 4269419.71 frames. ], batch size: 351, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:36:36,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1917876.0, ans=0.0 2023-06-25 05:36:56,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1917936.0, ans=0.125 2023-06-25 05:37:23,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1917996.0, ans=0.125 2023-06-25 05:37:23,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1917996.0, ans=0.0 2023-06-25 05:37:35,769 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-06-25 05:37:56,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1918056.0, ans=0.2 2023-06-25 05:38:07,360 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:38:07,933 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-06-25 05:38:22,502 INFO [train.py:996] (3/4) Epoch 11, batch 14750, loss[loss=0.3011, simple_loss=0.3689, pruned_loss=0.1167, over 21624.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.312, pruned_loss=0.07848, over 4254338.53 frames. ], batch size: 389, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:38:44,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1918236.0, ans=0.0 2023-06-25 05:39:30,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1918296.0, ans=0.1 2023-06-25 05:39:38,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1918356.0, ans=0.125 2023-06-25 05:39:42,614 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.909e+02 1.142e+03 1.631e+03 3.263e+03, threshold=2.283e+03, percent-clipped=2.0 2023-06-25 05:40:10,287 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.38 vs. limit=15.0 2023-06-25 05:40:16,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1918476.0, ans=0.07 2023-06-25 05:40:17,890 INFO [train.py:996] (3/4) Epoch 11, batch 14800, loss[loss=0.2513, simple_loss=0.3235, pruned_loss=0.08951, over 21572.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3232, pruned_loss=0.08357, over 4254496.31 frames. ], batch size: 263, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:40:37,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1918476.0, ans=0.125 2023-06-25 05:40:40,215 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-06-25 05:41:26,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1918656.0, ans=0.2 2023-06-25 05:41:31,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1918656.0, ans=0.2 2023-06-25 05:42:13,907 INFO [train.py:996] (3/4) Epoch 11, batch 14850, loss[loss=0.2042, simple_loss=0.2699, pruned_loss=0.06926, over 21536.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3169, pruned_loss=0.08294, over 4257073.77 frames. ], batch size: 263, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:43:25,169 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.029e+02 9.409e+02 1.250e+03 2.186e+03 4.588e+03, threshold=2.500e+03, percent-clipped=20.0 2023-06-25 05:44:03,306 INFO [train.py:996] (3/4) Epoch 11, batch 14900, loss[loss=0.3024, simple_loss=0.367, pruned_loss=0.1189, over 21470.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3209, pruned_loss=0.08487, over 4264983.31 frames. ], batch size: 471, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:44:10,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1919076.0, ans=0.1 2023-06-25 05:44:27,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1919136.0, ans=0.0 2023-06-25 05:44:36,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1919136.0, ans=0.0 2023-06-25 05:44:36,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1919136.0, ans=22.5 2023-06-25 05:44:41,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1919136.0, ans=0.125 2023-06-25 05:45:10,195 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:45:37,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1919316.0, ans=0.125 2023-06-25 05:45:44,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1919316.0, ans=0.015 2023-06-25 05:45:50,687 INFO [train.py:996] (3/4) Epoch 11, batch 14950, loss[loss=0.2072, simple_loss=0.299, pruned_loss=0.05773, over 21865.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3227, pruned_loss=0.08482, over 4261339.09 frames. ], batch size: 372, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:46:31,155 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=22.5 2023-06-25 05:46:58,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1919556.0, ans=0.125 2023-06-25 05:47:03,170 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.098e+02 8.346e+02 1.154e+03 1.605e+03 2.804e+03, threshold=2.309e+03, percent-clipped=2.0 2023-06-25 05:47:29,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1919616.0, ans=0.125 2023-06-25 05:47:39,726 INFO [train.py:996] (3/4) Epoch 11, batch 15000, loss[loss=0.2407, simple_loss=0.3018, pruned_loss=0.08981, over 21804.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3246, pruned_loss=0.08697, over 4259077.50 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:47:39,726 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 05:48:02,330 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.2537, simple_loss=0.3474, pruned_loss=0.08002, over 1796401.00 frames. 2023-06-25 05:48:02,331 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-25 05:48:02,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1919676.0, ans=0.1 2023-06-25 05:49:33,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1919916.0, ans=0.0 2023-06-25 05:49:50,628 INFO [train.py:996] (3/4) Epoch 11, batch 15050, loss[loss=0.3029, simple_loss=0.3952, pruned_loss=0.1053, over 21261.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.326, pruned_loss=0.08753, over 4266504.27 frames. ], batch size: 548, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:50:21,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1920036.0, ans=0.0 2023-06-25 05:50:50,146 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-25 05:50:57,303 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.512e+02 8.828e+02 1.154e+03 1.761e+03 2.876e+03, threshold=2.308e+03, percent-clipped=7.0 2023-06-25 05:51:39,390 INFO [train.py:996] (3/4) Epoch 11, batch 15100, loss[loss=0.2448, simple_loss=0.326, pruned_loss=0.08178, over 21263.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3251, pruned_loss=0.08621, over 4259963.57 frames. ], batch size: 548, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:51:45,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1920276.0, ans=0.0 2023-06-25 05:51:48,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1920276.0, ans=0.1 2023-06-25 05:52:03,490 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-25 05:52:18,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1920336.0, ans=0.125 2023-06-25 05:53:18,925 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.86 vs. limit=15.0 2023-06-25 05:53:26,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1920516.0, ans=0.0 2023-06-25 05:53:29,024 INFO [train.py:996] (3/4) Epoch 11, batch 15150, loss[loss=0.1994, simple_loss=0.2691, pruned_loss=0.06484, over 21742.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3225, pruned_loss=0.0866, over 4252297.16 frames. ], batch size: 334, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:53:39,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1920576.0, ans=0.0 2023-06-25 05:54:31,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1920756.0, ans=0.0 2023-06-25 05:54:37,078 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.77 vs. limit=5.0 2023-06-25 05:54:44,904 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.398e+02 9.134e+02 1.396e+03 2.248e+03 4.445e+03, threshold=2.791e+03, percent-clipped=24.0 2023-06-25 05:55:18,866 INFO [train.py:996] (3/4) Epoch 11, batch 15200, loss[loss=0.2427, simple_loss=0.345, pruned_loss=0.07015, over 20786.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3132, pruned_loss=0.08227, over 4261856.60 frames. ], batch size: 607, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:55:40,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1920936.0, ans=0.125 2023-06-25 05:57:06,644 INFO [train.py:996] (3/4) Epoch 11, batch 15250, loss[loss=0.2821, simple_loss=0.3284, pruned_loss=0.1179, over 21231.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3073, pruned_loss=0.08108, over 4252389.52 frames. ], batch size: 471, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:57:33,231 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-25 05:57:56,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1921296.0, ans=0.0 2023-06-25 05:58:19,444 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.800e+02 7.712e+02 1.026e+03 1.486e+03 3.458e+03, threshold=2.053e+03, percent-clipped=2.0 2023-06-25 05:58:45,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1921416.0, ans=0.2 2023-06-25 05:58:53,092 INFO [train.py:996] (3/4) Epoch 11, batch 15300, loss[loss=0.2884, simple_loss=0.3397, pruned_loss=0.1185, over 21301.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.308, pruned_loss=0.08277, over 4259136.75 frames. ], batch size: 507, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:59:23,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1921536.0, ans=0.0 2023-06-25 06:00:03,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1921656.0, ans=0.0 2023-06-25 06:00:48,342 INFO [train.py:996] (3/4) Epoch 11, batch 15350, loss[loss=0.2434, simple_loss=0.3439, pruned_loss=0.07146, over 21634.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3147, pruned_loss=0.08579, over 4264195.04 frames. ], batch size: 263, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:01:37,831 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.87 vs. limit=15.0 2023-06-25 06:01:53,579 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.938e+02 7.974e+02 1.016e+03 1.491e+03 3.012e+03, threshold=2.032e+03, percent-clipped=10.0 2023-06-25 06:02:16,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1922016.0, ans=0.1 2023-06-25 06:02:27,010 INFO [train.py:996] (3/4) Epoch 11, batch 15400, loss[loss=0.2445, simple_loss=0.3188, pruned_loss=0.08508, over 21833.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3155, pruned_loss=0.08434, over 4269383.50 frames. ], batch size: 414, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:02:27,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1922076.0, ans=0.09899494936611666 2023-06-25 06:02:36,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1922076.0, ans=0.125 2023-06-25 06:02:40,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1922076.0, ans=0.2 2023-06-25 06:02:41,725 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:02:57,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1922136.0, ans=0.95 2023-06-25 06:03:14,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1922196.0, ans=0.125 2023-06-25 06:03:26,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1922256.0, ans=0.0 2023-06-25 06:03:37,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1922256.0, ans=0.0 2023-06-25 06:04:11,467 INFO [train.py:996] (3/4) Epoch 11, batch 15450, loss[loss=0.2337, simple_loss=0.3291, pruned_loss=0.06912, over 21860.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3125, pruned_loss=0.08304, over 4277602.02 frames. ], batch size: 351, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:05:15,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1922496.0, ans=0.0 2023-06-25 06:05:15,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1922496.0, ans=0.1 2023-06-25 06:05:23,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1922556.0, ans=0.125 2023-06-25 06:05:25,665 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.669e+02 7.354e+02 9.513e+02 1.338e+03 2.588e+03, threshold=1.903e+03, percent-clipped=5.0 2023-06-25 06:05:43,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1922616.0, ans=0.0 2023-06-25 06:06:04,768 INFO [train.py:996] (3/4) Epoch 11, batch 15500, loss[loss=0.2437, simple_loss=0.3176, pruned_loss=0.08491, over 21107.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3162, pruned_loss=0.0823, over 4269358.21 frames. ], batch size: 608, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:06:05,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.33 vs. limit=10.0 2023-06-25 06:06:15,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1922676.0, ans=0.125 2023-06-25 06:07:28,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1922916.0, ans=0.05 2023-06-25 06:07:54,177 INFO [train.py:996] (3/4) Epoch 11, batch 15550, loss[loss=0.207, simple_loss=0.303, pruned_loss=0.05545, over 21789.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3139, pruned_loss=0.07958, over 4255327.93 frames. ], batch size: 371, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:09:07,395 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.113e+02 7.965e+02 1.145e+03 1.833e+03 5.244e+03, threshold=2.290e+03, percent-clipped=21.0 2023-06-25 06:09:09,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1923156.0, ans=0.125 2023-06-25 06:09:18,044 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:09:18,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1923216.0, ans=0.04949747468305833 2023-06-25 06:09:38,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1923216.0, ans=0.125 2023-06-25 06:09:41,624 INFO [train.py:996] (3/4) Epoch 11, batch 15600, loss[loss=0.212, simple_loss=0.2909, pruned_loss=0.06653, over 21612.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3075, pruned_loss=0.07795, over 4254642.00 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:10:04,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1923336.0, ans=0.125 2023-06-25 06:10:15,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1923336.0, ans=0.0 2023-06-25 06:10:16,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1923336.0, ans=0.2 2023-06-25 06:11:33,794 INFO [train.py:996] (3/4) Epoch 11, batch 15650, loss[loss=0.2665, simple_loss=0.3105, pruned_loss=0.1112, over 21345.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3075, pruned_loss=0.07827, over 4255448.80 frames. ], batch size: 508, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:12:15,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1923696.0, ans=0.0 2023-06-25 06:12:35,952 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-25 06:12:43,332 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.960e+02 7.231e+02 1.048e+03 1.538e+03 3.677e+03, threshold=2.096e+03, percent-clipped=8.0 2023-06-25 06:12:45,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1923756.0, ans=0.2 2023-06-25 06:13:23,062 INFO [train.py:996] (3/4) Epoch 11, batch 15700, loss[loss=0.246, simple_loss=0.3076, pruned_loss=0.09222, over 21445.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.304, pruned_loss=0.07719, over 4247175.80 frames. ], batch size: 441, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:14:22,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1924056.0, ans=0.0 2023-06-25 06:15:08,059 INFO [train.py:996] (3/4) Epoch 11, batch 15750, loss[loss=0.2178, simple_loss=0.2819, pruned_loss=0.07681, over 21783.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3003, pruned_loss=0.0772, over 4249669.53 frames. ], batch size: 118, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:15:08,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1924176.0, ans=0.1 2023-06-25 06:15:08,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1924176.0, ans=0.125 2023-06-25 06:16:14,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1924356.0, ans=0.125 2023-06-25 06:16:18,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1924356.0, ans=0.0 2023-06-25 06:16:19,542 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.466e+02 7.452e+02 1.136e+03 1.633e+03 2.643e+03, threshold=2.272e+03, percent-clipped=11.0 2023-06-25 06:16:55,691 INFO [train.py:996] (3/4) Epoch 11, batch 15800, loss[loss=0.2113, simple_loss=0.275, pruned_loss=0.0738, over 21615.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2957, pruned_loss=0.07641, over 4260211.99 frames. ], batch size: 332, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 06:17:03,004 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:17:34,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1924536.0, ans=0.2 2023-06-25 06:18:37,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1924716.0, ans=0.125 2023-06-25 06:18:45,278 INFO [train.py:996] (3/4) Epoch 11, batch 15850, loss[loss=0.2689, simple_loss=0.3305, pruned_loss=0.1036, over 21703.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2983, pruned_loss=0.07886, over 4262923.94 frames. ], batch size: 441, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 06:19:44,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1924896.0, ans=0.125 2023-06-25 06:19:54,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1924956.0, ans=0.0 2023-06-25 06:19:57,108 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.898e+02 6.760e+02 9.766e+02 1.376e+03 2.542e+03, threshold=1.953e+03, percent-clipped=1.0 2023-06-25 06:20:17,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1925016.0, ans=0.1 2023-06-25 06:20:18,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1925016.0, ans=0.0 2023-06-25 06:20:34,246 INFO [train.py:996] (3/4) Epoch 11, batch 15900, loss[loss=0.1933, simple_loss=0.2623, pruned_loss=0.06218, over 21521.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2953, pruned_loss=0.07884, over 4262091.17 frames. ], batch size: 230, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 06:21:33,628 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=22.5 2023-06-25 06:21:35,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1925256.0, ans=0.1 2023-06-25 06:21:37,321 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-25 06:21:49,150 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-25 06:21:59,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1925316.0, ans=0.1 2023-06-25 06:22:22,452 INFO [train.py:996] (3/4) Epoch 11, batch 15950, loss[loss=0.2441, simple_loss=0.3218, pruned_loss=0.08322, over 21376.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2955, pruned_loss=0.07729, over 4249485.66 frames. ], batch size: 211, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 06:22:40,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1925436.0, ans=0.035 2023-06-25 06:22:43,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1925436.0, ans=0.125 2023-06-25 06:23:27,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1925556.0, ans=0.0 2023-06-25 06:23:31,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1925556.0, ans=0.025 2023-06-25 06:23:33,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1925556.0, ans=0.1 2023-06-25 06:23:35,061 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.240e+02 8.208e+02 1.106e+03 1.560e+03 3.108e+03, threshold=2.211e+03, percent-clipped=12.0 2023-06-25 06:23:40,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1925556.0, ans=0.1 2023-06-25 06:24:08,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1925616.0, ans=0.0 2023-06-25 06:24:12,223 INFO [train.py:996] (3/4) Epoch 11, batch 16000, loss[loss=0.2116, simple_loss=0.3066, pruned_loss=0.05833, over 21757.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2962, pruned_loss=0.074, over 4254812.18 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:24:12,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1925676.0, ans=0.125 2023-06-25 06:24:19,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1925676.0, ans=0.0 2023-06-25 06:25:03,485 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-25 06:25:05,206 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=22.5 2023-06-25 06:25:35,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1925916.0, ans=0.125 2023-06-25 06:25:58,377 INFO [train.py:996] (3/4) Epoch 11, batch 16050, loss[loss=0.2149, simple_loss=0.3093, pruned_loss=0.06023, over 21713.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3009, pruned_loss=0.07256, over 4264990.47 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:25:58,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1925976.0, ans=0.0 2023-06-25 06:26:04,228 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-25 06:26:13,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1926036.0, ans=0.125 2023-06-25 06:26:58,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1926156.0, ans=0.2 2023-06-25 06:27:00,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1926156.0, ans=0.1 2023-06-25 06:27:06,120 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.209e+02 1.010e+03 1.605e+03 2.461e+03 5.413e+03, threshold=3.210e+03, percent-clipped=30.0 2023-06-25 06:27:14,000 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=22.5 2023-06-25 06:27:35,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1926276.0, ans=0.2 2023-06-25 06:27:36,708 INFO [train.py:996] (3/4) Epoch 11, batch 16100, loss[loss=0.2855, simple_loss=0.3383, pruned_loss=0.1163, over 21802.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3049, pruned_loss=0.07378, over 4272750.41 frames. ], batch size: 508, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:28:50,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1926456.0, ans=0.0 2023-06-25 06:29:17,216 INFO [train.py:996] (3/4) Epoch 11, batch 16150, loss[loss=0.2192, simple_loss=0.2916, pruned_loss=0.07343, over 20058.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3046, pruned_loss=0.07569, over 4281412.50 frames. ], batch size: 702, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:29:42,436 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-25 06:29:50,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1926636.0, ans=0.0 2023-06-25 06:30:20,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1926696.0, ans=0.0 2023-06-25 06:30:27,340 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.21 vs. limit=15.0 2023-06-25 06:30:40,134 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.105e+02 8.824e+02 1.229e+03 1.712e+03 3.510e+03, threshold=2.459e+03, percent-clipped=5.0 2023-06-25 06:30:42,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1926756.0, ans=0.1 2023-06-25 06:30:54,614 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-25 06:31:16,326 INFO [train.py:996] (3/4) Epoch 11, batch 16200, loss[loss=0.2755, simple_loss=0.3504, pruned_loss=0.1003, over 21475.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3088, pruned_loss=0.07737, over 4283774.18 frames. ], batch size: 131, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:31:25,662 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-25 06:33:02,352 INFO [train.py:996] (3/4) Epoch 11, batch 16250, loss[loss=0.2259, simple_loss=0.2978, pruned_loss=0.07702, over 21683.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3084, pruned_loss=0.07783, over 4279344.61 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:33:11,418 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.90 vs. limit=15.0 2023-06-25 06:34:11,384 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.609e+02 8.190e+02 1.044e+03 1.433e+03 2.783e+03, threshold=2.088e+03, percent-clipped=4.0 2023-06-25 06:34:19,620 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-06-25 06:34:39,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1927416.0, ans=0.125 2023-06-25 06:34:49,044 INFO [train.py:996] (3/4) Epoch 11, batch 16300, loss[loss=0.2281, simple_loss=0.3161, pruned_loss=0.07005, over 21208.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3038, pruned_loss=0.07524, over 4269266.12 frames. ], batch size: 548, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:34:53,449 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.17 vs. limit=10.0 2023-06-25 06:35:25,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1927536.0, ans=0.125 2023-06-25 06:35:25,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1927536.0, ans=0.0 2023-06-25 06:35:31,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1927536.0, ans=0.125 2023-06-25 06:35:42,426 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.56 vs. limit=15.0 2023-06-25 06:36:30,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1927716.0, ans=0.125 2023-06-25 06:36:36,567 INFO [train.py:996] (3/4) Epoch 11, batch 16350, loss[loss=0.2417, simple_loss=0.331, pruned_loss=0.07619, over 20791.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3033, pruned_loss=0.07566, over 4269164.56 frames. ], batch size: 608, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:37:05,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=1927836.0, ans=0.1 2023-06-25 06:37:27,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1927896.0, ans=0.09899494936611666 2023-06-25 06:37:52,921 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.557e+02 7.101e+02 1.051e+03 1.461e+03 2.820e+03, threshold=2.102e+03, percent-clipped=5.0 2023-06-25 06:38:07,981 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:38:24,467 INFO [train.py:996] (3/4) Epoch 11, batch 16400, loss[loss=0.2058, simple_loss=0.282, pruned_loss=0.06476, over 21883.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3083, pruned_loss=0.07752, over 4274896.65 frames. ], batch size: 107, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:38:54,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1928136.0, ans=0.0 2023-06-25 06:39:08,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1928196.0, ans=0.0 2023-06-25 06:39:12,137 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:39:23,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1928256.0, ans=0.0 2023-06-25 06:40:09,723 INFO [train.py:996] (3/4) Epoch 11, batch 16450, loss[loss=0.3251, simple_loss=0.3714, pruned_loss=0.1394, over 21754.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3082, pruned_loss=0.07774, over 4274217.07 frames. ], batch size: 508, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:40:14,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1928376.0, ans=0.05 2023-06-25 06:40:16,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1928376.0, ans=0.125 2023-06-25 06:40:30,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1928376.0, ans=22.5 2023-06-25 06:40:47,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1928436.0, ans=0.125 2023-06-25 06:40:50,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1928496.0, ans=0.125 2023-06-25 06:40:52,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1928496.0, ans=0.2 2023-06-25 06:40:56,881 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=22.5 2023-06-25 06:41:17,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1928556.0, ans=0.125 2023-06-25 06:41:22,525 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.821e+02 6.919e+02 9.825e+02 1.554e+03 3.786e+03, threshold=1.965e+03, percent-clipped=13.0 2023-06-25 06:41:26,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1928556.0, ans=0.1 2023-06-25 06:41:40,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1928616.0, ans=0.125 2023-06-25 06:41:53,328 INFO [train.py:996] (3/4) Epoch 11, batch 16500, loss[loss=0.3109, simple_loss=0.3891, pruned_loss=0.1164, over 20007.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3075, pruned_loss=0.0781, over 4276420.12 frames. ], batch size: 702, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:41:55,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1928676.0, ans=0.125 2023-06-25 06:42:02,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1928676.0, ans=0.05 2023-06-25 06:42:23,440 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-25 06:43:13,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1928856.0, ans=0.0 2023-06-25 06:43:44,080 INFO [train.py:996] (3/4) Epoch 11, batch 16550, loss[loss=0.2077, simple_loss=0.2703, pruned_loss=0.07259, over 21834.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.307, pruned_loss=0.07623, over 4281275.87 frames. ], batch size: 118, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:43:53,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1928976.0, ans=0.0 2023-06-25 06:44:05,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1929036.0, ans=0.0 2023-06-25 06:44:05,964 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=12.0 2023-06-25 06:45:00,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1929156.0, ans=0.125 2023-06-25 06:45:07,334 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.454e+02 9.180e+02 1.462e+03 2.154e+03 5.250e+03, threshold=2.924e+03, percent-clipped=28.0 2023-06-25 06:45:16,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1929216.0, ans=0.0 2023-06-25 06:45:31,235 INFO [train.py:996] (3/4) Epoch 11, batch 16600, loss[loss=0.2722, simple_loss=0.3959, pruned_loss=0.07426, over 19690.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3145, pruned_loss=0.07892, over 4277938.15 frames. ], batch size: 702, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:46:04,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1929336.0, ans=0.125 2023-06-25 06:46:04,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1929336.0, ans=0.0 2023-06-25 06:46:16,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1929396.0, ans=0.1 2023-06-25 06:46:22,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1929396.0, ans=0.2 2023-06-25 06:47:21,350 INFO [train.py:996] (3/4) Epoch 11, batch 16650, loss[loss=0.2636, simple_loss=0.3399, pruned_loss=0.09367, over 21384.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3243, pruned_loss=0.08133, over 4275130.01 frames. ], batch size: 549, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:47:24,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1929576.0, ans=0.1 2023-06-25 06:47:45,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1929636.0, ans=0.125 2023-06-25 06:47:46,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1929636.0, ans=0.125 2023-06-25 06:48:25,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1929696.0, ans=0.1 2023-06-25 06:48:41,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1929756.0, ans=0.125 2023-06-25 06:48:48,113 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.799e+02 8.385e+02 1.061e+03 1.516e+03 3.591e+03, threshold=2.122e+03, percent-clipped=0.0 2023-06-25 06:48:50,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1929756.0, ans=0.125 2023-06-25 06:48:57,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1929816.0, ans=0.1 2023-06-25 06:49:00,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1929816.0, ans=0.04949747468305833 2023-06-25 06:49:15,639 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.53 vs. limit=12.0 2023-06-25 06:49:18,267 INFO [train.py:996] (3/4) Epoch 11, batch 16700, loss[loss=0.1859, simple_loss=0.246, pruned_loss=0.06289, over 21812.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3257, pruned_loss=0.08232, over 4278510.61 frames. ], batch size: 118, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:50:05,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1929936.0, ans=0.125 2023-06-25 06:50:30,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1930056.0, ans=0.125 2023-06-25 06:50:44,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1930056.0, ans=0.0 2023-06-25 06:50:48,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=22.5 2023-06-25 06:51:05,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1930116.0, ans=0.125 2023-06-25 06:51:19,107 INFO [train.py:996] (3/4) Epoch 11, batch 16750, loss[loss=0.2418, simple_loss=0.3542, pruned_loss=0.06468, over 20806.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3291, pruned_loss=0.08554, over 4276497.69 frames. ], batch size: 607, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:51:47,085 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.48 vs. limit=10.0 2023-06-25 06:51:48,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1930236.0, ans=0.1 2023-06-25 06:52:20,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1930296.0, ans=0.125 2023-06-25 06:52:31,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1930356.0, ans=0.0 2023-06-25 06:52:39,846 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.774e+02 8.238e+02 1.096e+03 1.590e+03 4.377e+03, threshold=2.192e+03, percent-clipped=15.0 2023-06-25 06:52:57,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1930416.0, ans=0.125 2023-06-25 06:53:10,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1930416.0, ans=0.09899494936611666 2023-06-25 06:53:14,799 INFO [train.py:996] (3/4) Epoch 11, batch 16800, loss[loss=0.2525, simple_loss=0.38, pruned_loss=0.06253, over 20738.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3334, pruned_loss=0.08537, over 4269337.05 frames. ], batch size: 607, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:53:15,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1930476.0, ans=0.0 2023-06-25 06:53:33,614 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-25 06:54:02,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1930596.0, ans=0.125 2023-06-25 06:54:10,770 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=22.5 2023-06-25 06:54:34,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1930656.0, ans=0.0 2023-06-25 06:54:41,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=22.5 2023-06-25 06:54:42,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1930716.0, ans=0.1 2023-06-25 06:54:59,753 INFO [train.py:996] (3/4) Epoch 11, batch 16850, loss[loss=0.2433, simple_loss=0.355, pruned_loss=0.0658, over 20876.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3314, pruned_loss=0.08519, over 4273207.67 frames. ], batch size: 607, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:55:08,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1930776.0, ans=0.1 2023-06-25 06:55:48,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1930896.0, ans=0.125 2023-06-25 06:56:12,563 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.019e+02 7.924e+02 1.109e+03 1.823e+03 3.367e+03, threshold=2.218e+03, percent-clipped=14.0 2023-06-25 06:56:40,166 INFO [train.py:996] (3/4) Epoch 11, batch 16900, loss[loss=0.1774, simple_loss=0.2546, pruned_loss=0.05007, over 21533.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3252, pruned_loss=0.08405, over 4277489.96 frames. ], batch size: 230, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:56:45,655 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2023-06-25 06:57:37,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1931256.0, ans=15.0 2023-06-25 06:58:00,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1931316.0, ans=0.125 2023-06-25 06:58:23,598 INFO [train.py:996] (3/4) Epoch 11, batch 16950, loss[loss=0.2095, simple_loss=0.2773, pruned_loss=0.07088, over 21427.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3174, pruned_loss=0.08226, over 4274480.56 frames. ], batch size: 211, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:59:07,193 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-25 06:59:18,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1931556.0, ans=0.0 2023-06-25 06:59:21,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1931556.0, ans=0.2 2023-06-25 06:59:41,827 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.539e+02 6.475e+02 7.581e+02 1.089e+03 2.288e+03, threshold=1.516e+03, percent-clipped=2.0 2023-06-25 07:00:09,554 INFO [train.py:996] (3/4) Epoch 11, batch 17000, loss[loss=0.2254, simple_loss=0.3027, pruned_loss=0.07403, over 21849.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3138, pruned_loss=0.0824, over 4285564.73 frames. ], batch size: 124, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:00:44,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1931736.0, ans=0.2 2023-06-25 07:01:16,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1931856.0, ans=0.0 2023-06-25 07:01:56,033 INFO [train.py:996] (3/4) Epoch 11, batch 17050, loss[loss=0.2682, simple_loss=0.3453, pruned_loss=0.09556, over 21446.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3206, pruned_loss=0.08477, over 4284123.74 frames. ], batch size: 211, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:02:07,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1931976.0, ans=0.2 2023-06-25 07:02:19,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1932036.0, ans=0.125 2023-06-25 07:02:32,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1932036.0, ans=0.125 2023-06-25 07:03:02,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1932156.0, ans=0.125 2023-06-25 07:03:22,788 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.436e+02 8.949e+02 1.142e+03 1.744e+03 3.951e+03, threshold=2.284e+03, percent-clipped=33.0 2023-06-25 07:03:42,195 INFO [train.py:996] (3/4) Epoch 11, batch 17100, loss[loss=0.2632, simple_loss=0.3287, pruned_loss=0.09886, over 21804.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3181, pruned_loss=0.08486, over 4289215.13 frames. ], batch size: 112, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:04:35,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1932396.0, ans=0.0 2023-06-25 07:05:29,195 INFO [train.py:996] (3/4) Epoch 11, batch 17150, loss[loss=0.1951, simple_loss=0.2757, pruned_loss=0.05724, over 21736.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.315, pruned_loss=0.08433, over 4290789.38 frames. ], batch size: 389, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:06:11,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1932696.0, ans=0.1 2023-06-25 07:06:50,440 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-25 07:06:55,807 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.253e+02 7.399e+02 1.011e+03 1.479e+03 2.669e+03, threshold=2.021e+03, percent-clipped=4.0 2023-06-25 07:07:06,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1932816.0, ans=0.125 2023-06-25 07:07:16,406 INFO [train.py:996] (3/4) Epoch 11, batch 17200, loss[loss=0.2579, simple_loss=0.3329, pruned_loss=0.09148, over 21273.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3141, pruned_loss=0.08426, over 4293696.73 frames. ], batch size: 143, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:07:29,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.50 vs. limit=15.0 2023-06-25 07:08:21,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1932996.0, ans=0.0 2023-06-25 07:08:43,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1933056.0, ans=0.2 2023-06-25 07:09:10,152 INFO [train.py:996] (3/4) Epoch 11, batch 17250, loss[loss=0.2334, simple_loss=0.3128, pruned_loss=0.07697, over 21895.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3168, pruned_loss=0.08544, over 4292212.21 frames. ], batch size: 371, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:09:35,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1933236.0, ans=0.0 2023-06-25 07:09:37,863 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=22.5 2023-06-25 07:10:31,129 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.220e+02 7.674e+02 1.037e+03 1.511e+03 3.569e+03, threshold=2.074e+03, percent-clipped=11.0 2023-06-25 07:10:56,480 INFO [train.py:996] (3/4) Epoch 11, batch 17300, loss[loss=0.2978, simple_loss=0.3622, pruned_loss=0.1167, over 21301.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3255, pruned_loss=0.08872, over 4288721.63 frames. ], batch size: 143, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:11:33,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1933536.0, ans=0.0 2023-06-25 07:11:34,712 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:12:02,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1933596.0, ans=0.2 2023-06-25 07:12:05,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1933656.0, ans=0.0 2023-06-25 07:12:50,942 INFO [train.py:996] (3/4) Epoch 11, batch 17350, loss[loss=0.193, simple_loss=0.2762, pruned_loss=0.05493, over 21400.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.327, pruned_loss=0.08876, over 4282327.88 frames. ], batch size: 211, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:13:01,327 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-25 07:13:10,237 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-25 07:13:11,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1933776.0, ans=0.0 2023-06-25 07:13:57,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1933956.0, ans=0.95 2023-06-25 07:14:08,331 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.102e+02 8.837e+02 1.250e+03 1.745e+03 4.253e+03, threshold=2.500e+03, percent-clipped=18.0 2023-06-25 07:14:46,086 INFO [train.py:996] (3/4) Epoch 11, batch 17400, loss[loss=0.219, simple_loss=0.2832, pruned_loss=0.07745, over 20145.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3233, pruned_loss=0.08505, over 4282814.00 frames. ], batch size: 702, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:16:17,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1934316.0, ans=0.125 2023-06-25 07:16:26,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1934316.0, ans=0.1 2023-06-25 07:16:33,112 INFO [train.py:996] (3/4) Epoch 11, batch 17450, loss[loss=0.2117, simple_loss=0.3035, pruned_loss=0.05999, over 21701.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3186, pruned_loss=0.082, over 4268051.23 frames. ], batch size: 298, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:16:38,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1934376.0, ans=0.5 2023-06-25 07:17:06,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1934436.0, ans=0.125 2023-06-25 07:17:21,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1934496.0, ans=0.0 2023-06-25 07:17:48,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1934556.0, ans=0.125 2023-06-25 07:18:04,013 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.432e+02 7.727e+02 1.188e+03 2.165e+03 4.981e+03, threshold=2.376e+03, percent-clipped=19.0 2023-06-25 07:18:05,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1934616.0, ans=0.09899494936611666 2023-06-25 07:18:13,105 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=15.0 2023-06-25 07:18:22,110 INFO [train.py:996] (3/4) Epoch 11, batch 17500, loss[loss=0.2481, simple_loss=0.3187, pruned_loss=0.08872, over 21393.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.315, pruned_loss=0.07944, over 4273608.06 frames. ], batch size: 131, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:18:23,074 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=22.5 2023-06-25 07:18:39,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1934736.0, ans=0.0 2023-06-25 07:19:05,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1934796.0, ans=0.125 2023-06-25 07:20:05,460 INFO [train.py:996] (3/4) Epoch 11, batch 17550, loss[loss=0.2239, simple_loss=0.3168, pruned_loss=0.06551, over 21462.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3136, pruned_loss=0.07765, over 4267484.70 frames. ], batch size: 194, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:20:17,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1934976.0, ans=0.05 2023-06-25 07:21:29,548 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.993e+02 7.228e+02 9.267e+02 1.344e+03 3.002e+03, threshold=1.853e+03, percent-clipped=5.0 2023-06-25 07:21:37,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1935216.0, ans=0.0 2023-06-25 07:21:37,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1935216.0, ans=0.2 2023-06-25 07:21:49,261 INFO [train.py:996] (3/4) Epoch 11, batch 17600, loss[loss=0.2335, simple_loss=0.3092, pruned_loss=0.07894, over 20622.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3165, pruned_loss=0.07845, over 4259815.22 frames. ], batch size: 607, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:22:07,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1935336.0, ans=0.2 2023-06-25 07:22:39,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1935396.0, ans=0.125 2023-06-25 07:23:43,420 INFO [train.py:996] (3/4) Epoch 11, batch 17650, loss[loss=0.1778, simple_loss=0.2481, pruned_loss=0.05371, over 21645.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3147, pruned_loss=0.07881, over 4269750.94 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:24:12,642 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-25 07:24:23,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1935696.0, ans=0.2 2023-06-25 07:24:25,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1935696.0, ans=0.0 2023-06-25 07:25:04,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1935756.0, ans=0.0 2023-06-25 07:25:12,524 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.151e+02 8.324e+02 1.406e+03 1.795e+03 4.059e+03, threshold=2.812e+03, percent-clipped=23.0 2023-06-25 07:25:21,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1935816.0, ans=0.125 2023-06-25 07:25:30,958 INFO [train.py:996] (3/4) Epoch 11, batch 17700, loss[loss=0.229, simple_loss=0.3181, pruned_loss=0.06992, over 21565.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3068, pruned_loss=0.07557, over 4254280.74 frames. ], batch size: 230, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:25:54,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1935936.0, ans=0.125 2023-06-25 07:27:21,736 INFO [train.py:996] (3/4) Epoch 11, batch 17750, loss[loss=0.2736, simple_loss=0.3465, pruned_loss=0.1004, over 21624.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3157, pruned_loss=0.07912, over 4262621.77 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:27:30,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1936176.0, ans=0.0 2023-06-25 07:27:39,783 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.60 vs. limit=15.0 2023-06-25 07:27:47,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1936236.0, ans=0.125 2023-06-25 07:27:47,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1936236.0, ans=0.05 2023-06-25 07:27:49,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1936236.0, ans=0.125 2023-06-25 07:28:35,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1936356.0, ans=0.1 2023-06-25 07:28:50,288 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.938e+02 6.855e+02 8.351e+02 1.068e+03 2.757e+03, threshold=1.670e+03, percent-clipped=0.0 2023-06-25 07:28:57,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.27 vs. limit=15.0 2023-06-25 07:29:09,867 INFO [train.py:996] (3/4) Epoch 11, batch 17800, loss[loss=0.2433, simple_loss=0.3241, pruned_loss=0.08122, over 19942.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3148, pruned_loss=0.07845, over 4264177.79 frames. ], batch size: 703, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:30:09,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1936596.0, ans=0.0 2023-06-25 07:30:28,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1936656.0, ans=0.125 2023-06-25 07:30:28,829 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-25 07:30:38,435 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=12.0 2023-06-25 07:30:41,751 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-25 07:30:49,418 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:30:57,465 INFO [train.py:996] (3/4) Epoch 11, batch 17850, loss[loss=0.2865, simple_loss=0.3606, pruned_loss=0.1062, over 21363.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3158, pruned_loss=0.07859, over 4266317.99 frames. ], batch size: 549, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:31:19,006 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-06-25 07:31:43,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1936896.0, ans=0.125 2023-06-25 07:31:59,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1936896.0, ans=0.2 2023-06-25 07:31:59,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1936896.0, ans=0.0 2023-06-25 07:32:22,648 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.524e+02 9.470e+02 1.328e+03 1.940e+03 3.459e+03, threshold=2.655e+03, percent-clipped=37.0 2023-06-25 07:32:29,132 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-25 07:32:30,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.67 vs. limit=12.0 2023-06-25 07:32:39,656 INFO [train.py:996] (3/4) Epoch 11, batch 17900, loss[loss=0.2511, simple_loss=0.3257, pruned_loss=0.08828, over 21102.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3202, pruned_loss=0.08074, over 4263012.99 frames. ], batch size: 143, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:33:12,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1937136.0, ans=0.125 2023-06-25 07:34:00,844 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.52 vs. limit=10.0 2023-06-25 07:34:01,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1937256.0, ans=0.1 2023-06-25 07:34:41,340 INFO [train.py:996] (3/4) Epoch 11, batch 17950, loss[loss=0.1838, simple_loss=0.2797, pruned_loss=0.04395, over 21758.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3193, pruned_loss=0.0776, over 4256174.12 frames. ], batch size: 332, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:34:41,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1937376.0, ans=0.125 2023-06-25 07:35:35,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1937556.0, ans=0.125 2023-06-25 07:35:37,877 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=12.0 2023-06-25 07:35:56,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1937556.0, ans=0.125 2023-06-25 07:35:59,048 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.579e+02 7.777e+02 1.188e+03 1.807e+03 3.395e+03, threshold=2.377e+03, percent-clipped=4.0 2023-06-25 07:36:27,819 INFO [train.py:996] (3/4) Epoch 11, batch 18000, loss[loss=0.2081, simple_loss=0.2628, pruned_loss=0.07669, over 20670.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.312, pruned_loss=0.07562, over 4261635.41 frames. ], batch size: 607, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:36:27,820 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 07:36:44,997 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.2562, simple_loss=0.3557, pruned_loss=0.07833, over 1796401.00 frames. 2023-06-25 07:36:44,998 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-25 07:37:04,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1937676.0, ans=0.125 2023-06-25 07:37:08,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1937736.0, ans=0.2 2023-06-25 07:38:11,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1937916.0, ans=0.1 2023-06-25 07:38:33,156 INFO [train.py:996] (3/4) Epoch 11, batch 18050, loss[loss=0.2143, simple_loss=0.2813, pruned_loss=0.07363, over 20727.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3059, pruned_loss=0.07491, over 4252266.89 frames. ], batch size: 607, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:38:49,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1938036.0, ans=0.2 2023-06-25 07:38:55,016 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-25 07:39:04,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1938036.0, ans=0.125 2023-06-25 07:39:59,285 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.782e+02 7.777e+02 1.078e+03 1.586e+03 2.998e+03, threshold=2.156e+03, percent-clipped=7.0 2023-06-25 07:40:21,745 INFO [train.py:996] (3/4) Epoch 11, batch 18100, loss[loss=0.2365, simple_loss=0.3308, pruned_loss=0.07112, over 21692.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3112, pruned_loss=0.07761, over 4257945.57 frames. ], batch size: 351, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:40:45,224 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.66 vs. limit=12.0 2023-06-25 07:41:21,678 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-25 07:42:02,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1938516.0, ans=0.125 2023-06-25 07:42:08,815 INFO [train.py:996] (3/4) Epoch 11, batch 18150, loss[loss=0.2375, simple_loss=0.3063, pruned_loss=0.08435, over 21763.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3122, pruned_loss=0.07687, over 4267520.23 frames. ], batch size: 351, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:42:37,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1938636.0, ans=0.0 2023-06-25 07:42:41,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1938696.0, ans=0.1 2023-06-25 07:43:23,191 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=12.0 2023-06-25 07:43:31,135 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-06-25 07:43:31,672 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.895e+02 7.455e+02 1.236e+03 1.816e+03 3.616e+03, threshold=2.471e+03, percent-clipped=14.0 2023-06-25 07:43:32,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1938816.0, ans=0.125 2023-06-25 07:43:54,188 INFO [train.py:996] (3/4) Epoch 11, batch 18200, loss[loss=0.2245, simple_loss=0.2868, pruned_loss=0.08113, over 21819.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3064, pruned_loss=0.07662, over 4273744.31 frames. ], batch size: 98, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:44:15,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1938936.0, ans=0.125 2023-06-25 07:44:42,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1939056.0, ans=0.1 2023-06-25 07:45:26,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1939116.0, ans=0.125 2023-06-25 07:45:26,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1939116.0, ans=0.0 2023-06-25 07:45:33,100 INFO [train.py:996] (3/4) Epoch 11, batch 18250, loss[loss=0.2131, simple_loss=0.2822, pruned_loss=0.07201, over 21796.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2989, pruned_loss=0.07422, over 4275572.29 frames. ], batch size: 298, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:45:51,805 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:46:03,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1939236.0, ans=0.2 2023-06-25 07:46:04,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1939236.0, ans=0.0 2023-06-25 07:46:10,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=12.0 2023-06-25 07:46:26,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1939296.0, ans=0.1 2023-06-25 07:46:31,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1939356.0, ans=0.04949747468305833 2023-06-25 07:46:32,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1939356.0, ans=0.05 2023-06-25 07:46:56,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.451e+02 6.639e+02 9.483e+02 1.514e+03 2.544e+03, threshold=1.897e+03, percent-clipped=1.0 2023-06-25 07:47:09,135 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-25 07:47:18,815 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.79 vs. limit=15.0 2023-06-25 07:47:21,080 INFO [train.py:996] (3/4) Epoch 11, batch 18300, loss[loss=0.2658, simple_loss=0.3508, pruned_loss=0.09043, over 21712.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2972, pruned_loss=0.07441, over 4268491.71 frames. ], batch size: 441, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:47:48,025 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-25 07:48:19,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1939656.0, ans=0.0 2023-06-25 07:48:31,648 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-06-25 07:48:52,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1939716.0, ans=0.09899494936611666 2023-06-25 07:49:01,547 INFO [train.py:996] (3/4) Epoch 11, batch 18350, loss[loss=0.2335, simple_loss=0.2943, pruned_loss=0.08632, over 21118.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3045, pruned_loss=0.07495, over 4267371.44 frames. ], batch size: 159, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:49:14,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1939776.0, ans=0.2 2023-06-25 07:49:45,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1939896.0, ans=0.0 2023-06-25 07:50:32,427 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.196e+02 8.074e+02 1.390e+03 1.835e+03 4.417e+03, threshold=2.780e+03, percent-clipped=23.0 2023-06-25 07:50:42,141 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-25 07:50:52,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1940016.0, ans=0.035 2023-06-25 07:50:55,371 INFO [train.py:996] (3/4) Epoch 11, batch 18400, loss[loss=0.1768, simple_loss=0.2557, pruned_loss=0.04895, over 21532.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3004, pruned_loss=0.07408, over 4252584.08 frames. ], batch size: 195, lr: 2.64e-03, grad_scale: 32.0 2023-06-25 07:51:09,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1940076.0, ans=0.1 2023-06-25 07:51:16,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1940136.0, ans=0.125 2023-06-25 07:51:18,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1940136.0, ans=0.04949747468305833 2023-06-25 07:51:30,475 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.10 vs. limit=15.0 2023-06-25 07:52:24,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1940316.0, ans=0.125 2023-06-25 07:52:43,118 INFO [train.py:996] (3/4) Epoch 11, batch 18450, loss[loss=0.2104, simple_loss=0.299, pruned_loss=0.06093, over 21498.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2975, pruned_loss=0.0707, over 4251836.35 frames. ], batch size: 473, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:53:10,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1940436.0, ans=0.125 2023-06-25 07:53:42,976 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2023-06-25 07:54:04,991 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.624e+02 6.992e+02 1.032e+03 1.619e+03 3.807e+03, threshold=2.064e+03, percent-clipped=5.0 2023-06-25 07:54:25,066 INFO [train.py:996] (3/4) Epoch 11, batch 18500, loss[loss=0.2188, simple_loss=0.2956, pruned_loss=0.07099, over 21241.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2943, pruned_loss=0.07002, over 4242661.03 frames. ], batch size: 176, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:55:14,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1940796.0, ans=0.0 2023-06-25 07:56:07,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1940916.0, ans=0.0 2023-06-25 07:56:15,406 INFO [train.py:996] (3/4) Epoch 11, batch 18550, loss[loss=0.2085, simple_loss=0.287, pruned_loss=0.06502, over 20080.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2939, pruned_loss=0.06947, over 4241352.10 frames. ], batch size: 702, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:56:16,331 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=12.0 2023-06-25 07:56:20,075 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-25 07:56:20,178 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-25 07:57:46,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1941216.0, ans=0.125 2023-06-25 07:57:49,420 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.217e+02 7.256e+02 1.032e+03 1.520e+03 3.767e+03, threshold=2.064e+03, percent-clipped=11.0 2023-06-25 07:58:04,459 INFO [train.py:996] (3/4) Epoch 11, batch 18600, loss[loss=0.177, simple_loss=0.2536, pruned_loss=0.05022, over 21351.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2923, pruned_loss=0.07032, over 4242174.82 frames. ], batch size: 159, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:58:14,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1941276.0, ans=0.2 2023-06-25 07:58:58,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1941396.0, ans=0.1 2023-06-25 07:59:50,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1941576.0, ans=0.125 2023-06-25 07:59:51,198 INFO [train.py:996] (3/4) Epoch 11, batch 18650, loss[loss=0.1879, simple_loss=0.2589, pruned_loss=0.05843, over 20011.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2912, pruned_loss=0.07038, over 4229431.84 frames. ], batch size: 703, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:00:16,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1941636.0, ans=0.125 2023-06-25 08:00:18,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1941636.0, ans=0.1 2023-06-25 08:01:02,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1941756.0, ans=0.125 2023-06-25 08:01:21,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.720e+02 7.139e+02 9.409e+02 1.577e+03 2.753e+03, threshold=1.882e+03, percent-clipped=11.0 2023-06-25 08:01:35,901 INFO [train.py:996] (3/4) Epoch 11, batch 18700, loss[loss=0.2238, simple_loss=0.2895, pruned_loss=0.07902, over 15503.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2894, pruned_loss=0.07186, over 4228826.26 frames. ], batch size: 60, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:02:02,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1941936.0, ans=0.125 2023-06-25 08:02:06,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1941936.0, ans=0.1 2023-06-25 08:02:11,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1941996.0, ans=0.125 2023-06-25 08:02:21,947 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:02:37,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1942056.0, ans=0.035 2023-06-25 08:02:39,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-25 08:03:05,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1942056.0, ans=0.125 2023-06-25 08:03:20,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1942116.0, ans=0.2 2023-06-25 08:03:24,892 INFO [train.py:996] (3/4) Epoch 11, batch 18750, loss[loss=0.2134, simple_loss=0.2767, pruned_loss=0.07506, over 21779.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2931, pruned_loss=0.07511, over 4238030.01 frames. ], batch size: 247, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:04:04,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1942296.0, ans=0.0 2023-06-25 08:04:50,020 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.888e+02 8.399e+02 1.249e+03 1.994e+03 4.167e+03, threshold=2.497e+03, percent-clipped=25.0 2023-06-25 08:05:08,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1942416.0, ans=0.125 2023-06-25 08:05:08,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1942416.0, ans=0.125 2023-06-25 08:05:11,241 INFO [train.py:996] (3/4) Epoch 11, batch 18800, loss[loss=0.2358, simple_loss=0.3261, pruned_loss=0.07279, over 21842.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2989, pruned_loss=0.07556, over 4249019.21 frames. ], batch size: 316, lr: 2.64e-03, grad_scale: 32.0 2023-06-25 08:05:12,366 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.16 vs. limit=10.0 2023-06-25 08:06:02,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1942596.0, ans=0.0 2023-06-25 08:06:23,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1942656.0, ans=0.2 2023-06-25 08:06:46,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1942716.0, ans=0.0 2023-06-25 08:06:56,509 INFO [train.py:996] (3/4) Epoch 11, batch 18850, loss[loss=0.1936, simple_loss=0.2566, pruned_loss=0.0653, over 21235.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2953, pruned_loss=0.07202, over 4238589.23 frames. ], batch size: 159, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:07:16,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1942836.0, ans=0.125 2023-06-25 08:07:28,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1942896.0, ans=0.07 2023-06-25 08:07:31,933 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=15.0 2023-06-25 08:07:34,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1942896.0, ans=0.1 2023-06-25 08:07:57,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1942956.0, ans=0.1 2023-06-25 08:08:15,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1942956.0, ans=0.125 2023-06-25 08:08:21,201 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.163e+02 6.140e+02 8.289e+02 1.259e+03 4.459e+03, threshold=1.658e+03, percent-clipped=10.0 2023-06-25 08:08:34,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1943016.0, ans=0.5 2023-06-25 08:08:40,570 INFO [train.py:996] (3/4) Epoch 11, batch 18900, loss[loss=0.2018, simple_loss=0.2598, pruned_loss=0.07184, over 20979.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2906, pruned_loss=0.07133, over 4245442.42 frames. ], batch size: 608, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:09:15,687 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.06 vs. limit=15.0 2023-06-25 08:09:30,704 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=22.5 2023-06-25 08:10:27,777 INFO [train.py:996] (3/4) Epoch 11, batch 18950, loss[loss=0.1947, simple_loss=0.2692, pruned_loss=0.06008, over 21641.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.292, pruned_loss=0.07271, over 4262644.15 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:10:57,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1943436.0, ans=0.0 2023-06-25 08:11:40,455 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-06-25 08:11:43,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1943556.0, ans=0.125 2023-06-25 08:11:57,990 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=15.0 2023-06-25 08:12:02,078 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.904e+02 8.258e+02 1.054e+03 1.529e+03 3.478e+03, threshold=2.107e+03, percent-clipped=19.0 2023-06-25 08:12:15,293 INFO [train.py:996] (3/4) Epoch 11, batch 19000, loss[loss=0.2319, simple_loss=0.3101, pruned_loss=0.07683, over 21598.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3019, pruned_loss=0.07492, over 4270804.61 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:12:47,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1943796.0, ans=0.1 2023-06-25 08:13:19,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1943796.0, ans=0.125 2023-06-25 08:13:38,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1943916.0, ans=0.125 2023-06-25 08:14:01,773 INFO [train.py:996] (3/4) Epoch 11, batch 19050, loss[loss=0.2293, simple_loss=0.2963, pruned_loss=0.08121, over 21654.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3054, pruned_loss=0.07797, over 4281236.47 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:14:02,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1943976.0, ans=0.125 2023-06-25 08:14:22,819 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.85 vs. limit=10.0 2023-06-25 08:14:23,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1944036.0, ans=0.125 2023-06-25 08:15:11,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1944156.0, ans=0.0 2023-06-25 08:15:33,621 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 7.658e+02 1.037e+03 1.522e+03 3.485e+03, threshold=2.073e+03, percent-clipped=12.0 2023-06-25 08:15:48,079 INFO [train.py:996] (3/4) Epoch 11, batch 19100, loss[loss=0.2688, simple_loss=0.3096, pruned_loss=0.1141, over 21407.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.305, pruned_loss=0.07992, over 4286076.50 frames. ], batch size: 509, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:16:41,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1944396.0, ans=0.0 2023-06-25 08:17:35,893 INFO [train.py:996] (3/4) Epoch 11, batch 19150, loss[loss=0.2571, simple_loss=0.3563, pruned_loss=0.07896, over 21155.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3059, pruned_loss=0.07998, over 4288162.54 frames. ], batch size: 548, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:17:37,003 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-25 08:18:29,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1944696.0, ans=0.125 2023-06-25 08:19:11,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1944816.0, ans=0.2 2023-06-25 08:19:14,084 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.349e+02 9.714e+02 1.394e+03 2.160e+03 4.455e+03, threshold=2.788e+03, percent-clipped=28.0 2023-06-25 08:19:26,268 INFO [train.py:996] (3/4) Epoch 11, batch 19200, loss[loss=0.1508, simple_loss=0.2344, pruned_loss=0.03363, over 16327.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.314, pruned_loss=0.08013, over 4276883.22 frames. ], batch size: 61, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:19:26,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1944876.0, ans=0.0 2023-06-25 08:19:38,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1944876.0, ans=0.0 2023-06-25 08:19:41,044 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-25 08:20:02,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1944936.0, ans=0.1 2023-06-25 08:20:37,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1945056.0, ans=0.07 2023-06-25 08:20:47,840 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-25 08:21:08,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1945116.0, ans=0.125 2023-06-25 08:21:11,480 INFO [train.py:996] (3/4) Epoch 11, batch 19250, loss[loss=0.1918, simple_loss=0.2834, pruned_loss=0.05013, over 21647.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3161, pruned_loss=0.07571, over 4269249.68 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:21:57,816 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=15.0 2023-06-25 08:22:14,616 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.36 vs. limit=10.0 2023-06-25 08:22:20,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1945296.0, ans=0.125 2023-06-25 08:22:35,390 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=15.0 2023-06-25 08:22:38,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1945356.0, ans=0.0 2023-06-25 08:22:46,274 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.037e+02 6.757e+02 9.006e+02 1.219e+03 2.409e+03, threshold=1.801e+03, percent-clipped=0.0 2023-06-25 08:22:57,448 INFO [train.py:996] (3/4) Epoch 11, batch 19300, loss[loss=0.2102, simple_loss=0.2913, pruned_loss=0.06456, over 21715.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3126, pruned_loss=0.07516, over 4281450.75 frames. ], batch size: 389, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:23:33,670 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.68 vs. limit=12.0 2023-06-25 08:24:07,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1945596.0, ans=0.125 2023-06-25 08:24:09,522 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2023-06-25 08:24:12,423 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.53 vs. limit=15.0 2023-06-25 08:24:52,178 INFO [train.py:996] (3/4) Epoch 11, batch 19350, loss[loss=0.2515, simple_loss=0.3188, pruned_loss=0.09204, over 21345.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3067, pruned_loss=0.07134, over 4284224.94 frames. ], batch size: 176, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:25:05,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1945776.0, ans=0.0 2023-06-25 08:25:05,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1945776.0, ans=0.125 2023-06-25 08:25:37,374 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-06-25 08:26:17,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1946016.0, ans=0.125 2023-06-25 08:26:18,153 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.662e+02 8.721e+02 1.407e+03 2.132e+03 4.703e+03, threshold=2.815e+03, percent-clipped=33.0 2023-06-25 08:26:36,747 INFO [train.py:996] (3/4) Epoch 11, batch 19400, loss[loss=0.277, simple_loss=0.3889, pruned_loss=0.08261, over 19757.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.305, pruned_loss=0.07095, over 4283038.34 frames. ], batch size: 703, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:27:22,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1946196.0, ans=0.0 2023-06-25 08:27:46,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1946256.0, ans=0.0 2023-06-25 08:27:49,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1946256.0, ans=0.125 2023-06-25 08:28:10,848 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.697e-03 2023-06-25 08:28:22,706 INFO [train.py:996] (3/4) Epoch 11, batch 19450, loss[loss=0.2275, simple_loss=0.2867, pruned_loss=0.08414, over 21595.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3028, pruned_loss=0.07248, over 4286624.43 frames. ], batch size: 414, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 08:28:24,053 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.63 vs. limit=15.0 2023-06-25 08:29:28,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1946556.0, ans=0.125 2023-06-25 08:29:53,112 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.635e+02 8.363e+02 1.164e+03 1.702e+03 3.020e+03, threshold=2.327e+03, percent-clipped=5.0 2023-06-25 08:30:08,974 INFO [train.py:996] (3/4) Epoch 11, batch 19500, loss[loss=0.2289, simple_loss=0.3052, pruned_loss=0.0763, over 21801.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2986, pruned_loss=0.07395, over 4277885.72 frames. ], batch size: 372, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 08:30:31,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1946676.0, ans=0.1 2023-06-25 08:31:08,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1946796.0, ans=0.0 2023-06-25 08:31:18,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1946856.0, ans=0.125 2023-06-25 08:31:49,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1946916.0, ans=0.5 2023-06-25 08:31:50,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1946916.0, ans=0.125 2023-06-25 08:31:57,008 INFO [train.py:996] (3/4) Epoch 11, batch 19550, loss[loss=0.1387, simple_loss=0.2089, pruned_loss=0.03423, over 21223.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2936, pruned_loss=0.07249, over 4272312.22 frames. ], batch size: 131, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 08:33:31,260 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.735e+02 7.971e+02 1.072e+03 1.636e+03 3.226e+03, threshold=2.144e+03, percent-clipped=9.0 2023-06-25 08:33:41,362 INFO [train.py:996] (3/4) Epoch 11, batch 19600, loss[loss=0.1878, simple_loss=0.2565, pruned_loss=0.05954, over 21191.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2951, pruned_loss=0.07284, over 4279586.36 frames. ], batch size: 608, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:34:16,330 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.82 vs. limit=10.0 2023-06-25 08:34:34,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1947396.0, ans=0.0 2023-06-25 08:35:24,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1947516.0, ans=0.1 2023-06-25 08:35:36,924 INFO [train.py:996] (3/4) Epoch 11, batch 19650, loss[loss=0.2063, simple_loss=0.2796, pruned_loss=0.06651, over 21863.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2999, pruned_loss=0.07658, over 4285161.09 frames. ], batch size: 282, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:35:40,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1947576.0, ans=0.5 2023-06-25 08:35:43,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1947576.0, ans=0.2 2023-06-25 08:36:14,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1947696.0, ans=0.0 2023-06-25 08:36:48,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1947756.0, ans=0.125 2023-06-25 08:37:02,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1947816.0, ans=0.0 2023-06-25 08:37:15,092 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.249e+02 7.644e+02 9.843e+02 1.375e+03 3.676e+03, threshold=1.969e+03, percent-clipped=9.0 2023-06-25 08:37:30,371 INFO [train.py:996] (3/4) Epoch 11, batch 19700, loss[loss=0.2994, simple_loss=0.3785, pruned_loss=0.1101, over 21490.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3048, pruned_loss=0.07819, over 4281210.63 frames. ], batch size: 508, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:37:36,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1947876.0, ans=0.125 2023-06-25 08:37:52,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1947936.0, ans=0.125 2023-06-25 08:38:07,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1947996.0, ans=0.2 2023-06-25 08:38:19,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1947996.0, ans=0.2 2023-06-25 08:38:28,409 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:39:12,076 INFO [train.py:996] (3/4) Epoch 11, batch 19750, loss[loss=0.2447, simple_loss=0.3185, pruned_loss=0.08544, over 21270.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3149, pruned_loss=0.07921, over 4283701.53 frames. ], batch size: 143, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:39:25,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1948176.0, ans=0.0 2023-06-25 08:40:26,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1948356.0, ans=0.0 2023-06-25 08:40:49,898 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.288e+02 1.013e+03 1.397e+03 2.237e+03 5.539e+03, threshold=2.794e+03, percent-clipped=30.0 2023-06-25 08:40:59,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1948476.0, ans=0.125 2023-06-25 08:41:00,478 INFO [train.py:996] (3/4) Epoch 11, batch 19800, loss[loss=0.2051, simple_loss=0.2814, pruned_loss=0.06441, over 21798.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3159, pruned_loss=0.08017, over 4284893.15 frames. ], batch size: 282, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:41:16,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1948536.0, ans=0.0 2023-06-25 08:41:40,707 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=22.5 2023-06-25 08:41:41,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1948596.0, ans=0.04949747468305833 2023-06-25 08:42:28,410 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-06-25 08:42:28,461 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.44 vs. limit=6.0 2023-06-25 08:42:47,225 INFO [train.py:996] (3/4) Epoch 11, batch 19850, loss[loss=0.1401, simple_loss=0.1972, pruned_loss=0.04153, over 16679.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.308, pruned_loss=0.07524, over 4273711.00 frames. ], batch size: 60, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:42:47,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1948776.0, ans=0.2 2023-06-25 08:42:52,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1948776.0, ans=0.0 2023-06-25 08:43:56,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1948956.0, ans=0.09899494936611666 2023-06-25 08:44:23,840 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.088e+02 7.609e+02 1.066e+03 1.634e+03 3.345e+03, threshold=2.132e+03, percent-clipped=4.0 2023-06-25 08:44:32,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1949076.0, ans=0.125 2023-06-25 08:44:33,353 INFO [train.py:996] (3/4) Epoch 11, batch 19900, loss[loss=0.21, simple_loss=0.3054, pruned_loss=0.05735, over 21796.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3061, pruned_loss=0.0721, over 4283159.26 frames. ], batch size: 282, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:45:21,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1949196.0, ans=0.0 2023-06-25 08:45:59,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1949256.0, ans=0.125 2023-06-25 08:46:19,629 INFO [train.py:996] (3/4) Epoch 11, batch 19950, loss[loss=0.1886, simple_loss=0.2539, pruned_loss=0.06166, over 20714.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3011, pruned_loss=0.07194, over 4270914.47 frames. ], batch size: 607, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:46:57,927 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=22.5 2023-06-25 08:47:48,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1949616.0, ans=0.07 2023-06-25 08:47:53,971 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.809e+02 7.247e+02 1.068e+03 1.569e+03 2.873e+03, threshold=2.135e+03, percent-clipped=11.0 2023-06-25 08:48:01,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1949616.0, ans=0.125 2023-06-25 08:48:03,736 INFO [train.py:996] (3/4) Epoch 11, batch 20000, loss[loss=0.2407, simple_loss=0.3173, pruned_loss=0.08203, over 21778.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3031, pruned_loss=0.07296, over 4281172.88 frames. ], batch size: 112, lr: 2.63e-03, grad_scale: 32.0 2023-06-25 08:48:04,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1949676.0, ans=15.0 2023-06-25 08:48:05,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1949676.0, ans=0.125 2023-06-25 08:49:45,789 INFO [train.py:996] (3/4) Epoch 11, batch 20050, loss[loss=0.2418, simple_loss=0.3082, pruned_loss=0.08773, over 21812.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3047, pruned_loss=0.07514, over 4284079.75 frames. ], batch size: 298, lr: 2.63e-03, grad_scale: 32.0 2023-06-25 08:51:01,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1950156.0, ans=0.1 2023-06-25 08:51:06,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1950156.0, ans=0.07 2023-06-25 08:51:23,074 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.669e+02 8.021e+02 1.064e+03 1.748e+03 3.117e+03, threshold=2.127e+03, percent-clipped=13.0 2023-06-25 08:51:33,720 INFO [train.py:996] (3/4) Epoch 11, batch 20100, loss[loss=0.244, simple_loss=0.3365, pruned_loss=0.07576, over 21852.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3071, pruned_loss=0.07755, over 4285696.78 frames. ], batch size: 332, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:52:11,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1950336.0, ans=0.125 2023-06-25 08:52:44,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1950396.0, ans=0.1 2023-06-25 08:52:52,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1950456.0, ans=0.1 2023-06-25 08:53:28,316 INFO [train.py:996] (3/4) Epoch 11, batch 20150, loss[loss=0.2801, simple_loss=0.3441, pruned_loss=0.108, over 21335.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3172, pruned_loss=0.08159, over 4282305.39 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:53:44,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1950576.0, ans=0.0 2023-06-25 08:53:46,179 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=12.0 2023-06-25 08:53:51,446 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-25 08:55:17,090 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.413e+02 8.388e+02 1.067e+03 1.531e+03 4.094e+03, threshold=2.133e+03, percent-clipped=12.0 2023-06-25 08:55:18,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1950816.0, ans=0.125 2023-06-25 08:55:25,365 INFO [train.py:996] (3/4) Epoch 11, batch 20200, loss[loss=0.2111, simple_loss=0.2988, pruned_loss=0.06172, over 21277.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.323, pruned_loss=0.08415, over 4280702.96 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:55:28,869 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:55:44,185 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:55:55,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1950936.0, ans=0.2 2023-06-25 08:56:40,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1951056.0, ans=0.125 2023-06-25 08:57:06,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1951116.0, ans=0.05 2023-06-25 08:57:08,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1951116.0, ans=0.125 2023-06-25 08:57:11,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1951176.0, ans=0.0 2023-06-25 08:57:12,749 INFO [train.py:996] (3/4) Epoch 11, batch 20250, loss[loss=0.1844, simple_loss=0.2907, pruned_loss=0.03908, over 19694.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3227, pruned_loss=0.08195, over 4275557.35 frames. ], batch size: 702, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:57:28,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1951176.0, ans=0.125 2023-06-25 08:57:33,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1951176.0, ans=0.0 2023-06-25 08:58:15,416 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-25 08:58:41,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1951356.0, ans=0.125 2023-06-25 08:58:43,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1951416.0, ans=0.125 2023-06-25 08:58:47,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1951416.0, ans=0.0 2023-06-25 08:58:52,146 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.051e+02 7.038e+02 1.016e+03 1.334e+03 4.106e+03, threshold=2.032e+03, percent-clipped=11.0 2023-06-25 08:59:05,792 INFO [train.py:996] (3/4) Epoch 11, batch 20300, loss[loss=0.261, simple_loss=0.3616, pruned_loss=0.08026, over 20853.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3213, pruned_loss=0.07956, over 4266158.48 frames. ], batch size: 608, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:59:27,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1951536.0, ans=0.2 2023-06-25 08:59:38,889 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-25 08:59:44,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1951596.0, ans=0.0 2023-06-25 08:59:57,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1951596.0, ans=0.125 2023-06-25 09:00:06,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1951656.0, ans=0.2 2023-06-25 09:00:46,141 INFO [train.py:996] (3/4) Epoch 11, batch 20350, loss[loss=0.192, simple_loss=0.2667, pruned_loss=0.0586, over 20015.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3212, pruned_loss=0.07999, over 4258894.53 frames. ], batch size: 703, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:00:48,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1951776.0, ans=10.0 2023-06-25 09:01:14,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1951836.0, ans=0.0 2023-06-25 09:01:20,641 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-25 09:02:20,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1952016.0, ans=0.125 2023-06-25 09:02:24,587 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.181e+02 7.498e+02 1.071e+03 1.543e+03 3.638e+03, threshold=2.141e+03, percent-clipped=16.0 2023-06-25 09:02:31,877 INFO [train.py:996] (3/4) Epoch 11, batch 20400, loss[loss=0.2708, simple_loss=0.3381, pruned_loss=0.1018, over 21670.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3233, pruned_loss=0.08281, over 4256416.69 frames. ], batch size: 263, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:02:48,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1952076.0, ans=0.125 2023-06-25 09:02:56,298 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-25 09:02:57,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1952136.0, ans=0.125 2023-06-25 09:03:06,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1952136.0, ans=0.0 2023-06-25 09:04:03,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1952316.0, ans=0.1 2023-06-25 09:04:11,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-25 09:04:11,100 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.50 vs. limit=15.0 2023-06-25 09:04:16,688 INFO [train.py:996] (3/4) Epoch 11, batch 20450, loss[loss=0.2639, simple_loss=0.3277, pruned_loss=0.1, over 21552.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3232, pruned_loss=0.08415, over 4242522.00 frames. ], batch size: 471, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:04:38,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1952436.0, ans=0.125 2023-06-25 09:04:53,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1952496.0, ans=0.5 2023-06-25 09:04:55,826 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=22.5 2023-06-25 09:05:01,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1952496.0, ans=0.125 2023-06-25 09:05:19,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1952556.0, ans=0.07 2023-06-25 09:05:21,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1952556.0, ans=0.125 2023-06-25 09:05:40,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1952616.0, ans=0.125 2023-06-25 09:05:55,798 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.393e+02 8.343e+02 1.181e+03 1.747e+03 3.039e+03, threshold=2.362e+03, percent-clipped=12.0 2023-06-25 09:06:02,394 INFO [train.py:996] (3/4) Epoch 11, batch 20500, loss[loss=0.2292, simple_loss=0.2929, pruned_loss=0.08272, over 21843.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3184, pruned_loss=0.08446, over 4246350.72 frames. ], batch size: 107, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:06:29,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1952736.0, ans=0.125 2023-06-25 09:06:35,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1952736.0, ans=0.0 2023-06-25 09:07:13,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1952856.0, ans=0.1 2023-06-25 09:07:48,660 INFO [train.py:996] (3/4) Epoch 11, batch 20550, loss[loss=0.2677, simple_loss=0.3483, pruned_loss=0.09357, over 21565.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3106, pruned_loss=0.0827, over 4244114.91 frames. ], batch size: 441, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:09:19,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1953216.0, ans=0.125 2023-06-25 09:09:28,345 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.073e+02 8.204e+02 1.449e+03 2.191e+03 5.725e+03, threshold=2.898e+03, percent-clipped=18.0 2023-06-25 09:09:31,146 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.50 vs. limit=15.0 2023-06-25 09:09:32,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1953216.0, ans=0.1 2023-06-25 09:09:40,412 INFO [train.py:996] (3/4) Epoch 11, batch 20600, loss[loss=0.2441, simple_loss=0.3052, pruned_loss=0.09152, over 21526.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3132, pruned_loss=0.08043, over 4242797.99 frames. ], batch size: 211, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:09:59,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1953336.0, ans=0.0 2023-06-25 09:10:07,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1953336.0, ans=0.2 2023-06-25 09:10:08,953 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:10:12,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1953396.0, ans=0.1 2023-06-25 09:10:14,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1953396.0, ans=0.2 2023-06-25 09:10:43,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1953456.0, ans=0.1 2023-06-25 09:11:26,268 INFO [train.py:996] (3/4) Epoch 11, batch 20650, loss[loss=0.203, simple_loss=0.2763, pruned_loss=0.06488, over 21738.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3098, pruned_loss=0.08068, over 4252271.16 frames. ], batch size: 316, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:11:30,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1953576.0, ans=0.125 2023-06-25 09:11:30,633 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-06-25 09:11:56,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1953636.0, ans=0.0 2023-06-25 09:12:34,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1953756.0, ans=0.125 2023-06-25 09:12:54,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1953816.0, ans=0.07 2023-06-25 09:13:04,251 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.039e+02 6.601e+02 8.640e+02 1.224e+03 2.485e+03, threshold=1.728e+03, percent-clipped=0.0 2023-06-25 09:13:16,526 INFO [train.py:996] (3/4) Epoch 11, batch 20700, loss[loss=0.2492, simple_loss=0.3283, pruned_loss=0.08506, over 21774.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3023, pruned_loss=0.07738, over 4256822.20 frames. ], batch size: 351, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:14:32,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1954056.0, ans=0.0 2023-06-25 09:14:41,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1954056.0, ans=0.1 2023-06-25 09:15:07,986 INFO [train.py:996] (3/4) Epoch 11, batch 20750, loss[loss=0.2594, simple_loss=0.3529, pruned_loss=0.08289, over 21522.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3062, pruned_loss=0.07737, over 4257014.31 frames. ], batch size: 471, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:15:16,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1954176.0, ans=0.125 2023-06-25 09:15:38,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1954236.0, ans=0.07 2023-06-25 09:16:15,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=15.0 2023-06-25 09:16:48,247 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.098e+02 8.092e+02 1.287e+03 1.980e+03 4.706e+03, threshold=2.574e+03, percent-clipped=34.0 2023-06-25 09:16:48,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1954416.0, ans=0.125 2023-06-25 09:16:55,067 INFO [train.py:996] (3/4) Epoch 11, batch 20800, loss[loss=0.215, simple_loss=0.279, pruned_loss=0.07553, over 21188.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3098, pruned_loss=0.07856, over 4264102.81 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 32.0 2023-06-25 09:17:19,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1954536.0, ans=10.0 2023-06-25 09:17:29,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1954536.0, ans=0.035 2023-06-25 09:17:38,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1954596.0, ans=0.125 2023-06-25 09:18:40,331 INFO [train.py:996] (3/4) Epoch 11, batch 20850, loss[loss=0.2007, simple_loss=0.2737, pruned_loss=0.0638, over 21401.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3018, pruned_loss=0.0762, over 4260980.69 frames. ], batch size: 194, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:18:49,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1954776.0, ans=0.1 2023-06-25 09:19:40,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1954896.0, ans=0.1 2023-06-25 09:19:50,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1954956.0, ans=0.125 2023-06-25 09:20:06,607 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:20:18,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1955016.0, ans=0.2 2023-06-25 09:20:20,808 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.513e+02 8.708e+02 1.139e+03 1.646e+03 3.626e+03, threshold=2.277e+03, percent-clipped=8.0 2023-06-25 09:20:25,801 INFO [train.py:996] (3/4) Epoch 11, batch 20900, loss[loss=0.2667, simple_loss=0.3358, pruned_loss=0.09876, over 21875.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3031, pruned_loss=0.07805, over 4265268.10 frames. ], batch size: 107, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:21:21,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1955196.0, ans=0.125 2023-06-25 09:22:08,605 INFO [train.py:996] (3/4) Epoch 11, batch 20950, loss[loss=0.2012, simple_loss=0.2826, pruned_loss=0.05989, over 21288.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3003, pruned_loss=0.07539, over 4265372.47 frames. ], batch size: 548, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:22:24,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1955436.0, ans=0.125 2023-06-25 09:22:41,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1955436.0, ans=0.125 2023-06-25 09:23:05,227 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.11 vs. limit=15.0 2023-06-25 09:23:40,530 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.098e+02 8.503e+02 1.270e+03 1.885e+03 4.065e+03, threshold=2.540e+03, percent-clipped=13.0 2023-06-25 09:23:41,448 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=15.0 2023-06-25 09:23:45,525 INFO [train.py:996] (3/4) Epoch 11, batch 21000, loss[loss=0.2294, simple_loss=0.2988, pruned_loss=0.08004, over 21805.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3004, pruned_loss=0.07606, over 4253790.58 frames. ], batch size: 112, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:23:45,526 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 09:24:03,607 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.2627, simple_loss=0.3591, pruned_loss=0.08313, over 1796401.00 frames. 2023-06-25 09:24:03,608 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-25 09:24:06,548 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=15.0 2023-06-25 09:25:06,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1955856.0, ans=0.125 2023-06-25 09:25:32,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1955916.0, ans=0.0 2023-06-25 09:25:34,847 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-25 09:25:46,553 INFO [train.py:996] (3/4) Epoch 11, batch 21050, loss[loss=0.1868, simple_loss=0.25, pruned_loss=0.06185, over 21275.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2978, pruned_loss=0.07581, over 4261611.86 frames. ], batch size: 548, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:25:50,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1955976.0, ans=0.1 2023-06-25 09:26:13,987 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:26:17,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1956036.0, ans=0.1 2023-06-25 09:26:20,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1956036.0, ans=0.0 2023-06-25 09:27:04,820 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:27:27,184 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.534e+02 6.581e+02 8.702e+02 1.278e+03 3.016e+03, threshold=1.740e+03, percent-clipped=3.0 2023-06-25 09:27:30,704 INFO [train.py:996] (3/4) Epoch 11, batch 21100, loss[loss=0.1907, simple_loss=0.2586, pruned_loss=0.06139, over 21477.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2942, pruned_loss=0.07515, over 4258716.85 frames. ], batch size: 132, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:27:39,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1956276.0, ans=0.125 2023-06-25 09:27:41,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1956276.0, ans=0.0 2023-06-25 09:27:41,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1956276.0, ans=0.07 2023-06-25 09:27:59,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1956336.0, ans=0.125 2023-06-25 09:28:40,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1956456.0, ans=0.0 2023-06-25 09:28:50,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1956456.0, ans=0.0 2023-06-25 09:29:15,543 INFO [train.py:996] (3/4) Epoch 11, batch 21150, loss[loss=0.2219, simple_loss=0.2759, pruned_loss=0.08394, over 21397.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.29, pruned_loss=0.07494, over 4254357.07 frames. ], batch size: 131, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:30:21,489 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-25 09:30:43,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1956816.0, ans=0.125 2023-06-25 09:30:53,393 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-25 09:30:55,549 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.582e+02 7.371e+02 1.068e+03 1.667e+03 5.764e+03, threshold=2.137e+03, percent-clipped=24.0 2023-06-25 09:30:59,115 INFO [train.py:996] (3/4) Epoch 11, batch 21200, loss[loss=0.1786, simple_loss=0.2386, pruned_loss=0.05925, over 15535.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2876, pruned_loss=0.07422, over 4242020.72 frames. ], batch size: 60, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:31:12,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1956876.0, ans=0.125 2023-06-25 09:31:13,487 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-06-25 09:32:01,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1957056.0, ans=0.1 2023-06-25 09:32:36,015 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-25 09:32:38,401 INFO [train.py:996] (3/4) Epoch 11, batch 21250, loss[loss=0.2107, simple_loss=0.2912, pruned_loss=0.06505, over 21653.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2856, pruned_loss=0.07404, over 4248174.09 frames. ], batch size: 298, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:33:31,675 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-25 09:34:16,847 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.832e+02 8.531e+02 1.344e+03 2.187e+03 4.666e+03, threshold=2.689e+03, percent-clipped=25.0 2023-06-25 09:34:18,281 INFO [train.py:996] (3/4) Epoch 11, batch 21300, loss[loss=0.2224, simple_loss=0.3012, pruned_loss=0.07178, over 21924.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2911, pruned_loss=0.0764, over 4251216.96 frames. ], batch size: 333, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:35:18,920 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=10.0 2023-06-25 09:35:31,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1957656.0, ans=0.125 2023-06-25 09:35:42,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1957656.0, ans=0.0 2023-06-25 09:36:04,264 INFO [train.py:996] (3/4) Epoch 11, batch 21350, loss[loss=0.1922, simple_loss=0.2878, pruned_loss=0.04836, over 21773.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2957, pruned_loss=0.07664, over 4265662.76 frames. ], batch size: 282, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:36:07,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1957776.0, ans=0.125 2023-06-25 09:37:07,706 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.98 vs. limit=10.0 2023-06-25 09:37:21,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1957956.0, ans=0.125 2023-06-25 09:37:43,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1958016.0, ans=0.2 2023-06-25 09:37:55,866 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.389e+02 7.142e+02 1.027e+03 1.660e+03 3.891e+03, threshold=2.053e+03, percent-clipped=5.0 2023-06-25 09:37:57,539 INFO [train.py:996] (3/4) Epoch 11, batch 21400, loss[loss=0.2779, simple_loss=0.3417, pruned_loss=0.1071, over 21355.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.299, pruned_loss=0.07596, over 4270131.34 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:38:03,511 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.52 vs. limit=15.0 2023-06-25 09:38:09,880 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=15.0 2023-06-25 09:38:22,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1958136.0, ans=0.0 2023-06-25 09:38:37,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1958136.0, ans=0.0 2023-06-25 09:38:39,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1958196.0, ans=0.1 2023-06-25 09:39:31,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1958316.0, ans=0.125 2023-06-25 09:39:41,634 INFO [train.py:996] (3/4) Epoch 11, batch 21450, loss[loss=0.2618, simple_loss=0.3284, pruned_loss=0.09764, over 21799.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3032, pruned_loss=0.07816, over 4278764.57 frames. ], batch size: 124, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:40:10,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1958436.0, ans=0.125 2023-06-25 09:41:08,752 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:41:18,942 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.40 vs. limit=10.0 2023-06-25 09:41:25,337 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.203e+02 7.277e+02 9.911e+02 1.372e+03 2.622e+03, threshold=1.982e+03, percent-clipped=4.0 2023-06-25 09:41:26,993 INFO [train.py:996] (3/4) Epoch 11, batch 21500, loss[loss=0.219, simple_loss=0.2812, pruned_loss=0.07839, over 21565.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3011, pruned_loss=0.07928, over 4282797.61 frames. ], batch size: 263, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:42:14,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1958796.0, ans=0.125 2023-06-25 09:42:35,255 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-25 09:43:11,044 INFO [train.py:996] (3/4) Epoch 11, batch 21550, loss[loss=0.1669, simple_loss=0.2372, pruned_loss=0.0483, over 21616.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2941, pruned_loss=0.07663, over 4267833.58 frames. ], batch size: 263, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:44:12,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1959156.0, ans=0.125 2023-06-25 09:44:51,387 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.933e+02 7.990e+02 1.429e+03 2.000e+03 5.379e+03, threshold=2.857e+03, percent-clipped=25.0 2023-06-25 09:44:53,165 INFO [train.py:996] (3/4) Epoch 11, batch 21600, loss[loss=0.2118, simple_loss=0.2807, pruned_loss=0.07142, over 21201.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2894, pruned_loss=0.07474, over 4273401.59 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:45:45,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1959396.0, ans=0.2 2023-06-25 09:45:48,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1959396.0, ans=0.125 2023-06-25 09:46:40,917 INFO [train.py:996] (3/4) Epoch 11, batch 21650, loss[loss=0.3032, simple_loss=0.3915, pruned_loss=0.1074, over 21535.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2939, pruned_loss=0.07337, over 4266968.60 frames. ], batch size: 471, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:46:51,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1959576.0, ans=0.125 2023-06-25 09:47:30,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1959696.0, ans=0.025 2023-06-25 09:47:41,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1959696.0, ans=0.125 2023-06-25 09:47:43,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1959696.0, ans=0.125 2023-06-25 09:47:55,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1959756.0, ans=0.09899494936611666 2023-06-25 09:48:25,903 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.099e+02 8.539e+02 1.351e+03 1.899e+03 3.491e+03, threshold=2.702e+03, percent-clipped=7.0 2023-06-25 09:48:27,769 INFO [train.py:996] (3/4) Epoch 11, batch 21700, loss[loss=0.1963, simple_loss=0.306, pruned_loss=0.04325, over 19827.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2951, pruned_loss=0.07153, over 4267110.23 frames. ], batch size: 703, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:48:45,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1959876.0, ans=0.0 2023-06-25 09:50:12,969 INFO [train.py:996] (3/4) Epoch 11, batch 21750, loss[loss=0.215, simple_loss=0.2665, pruned_loss=0.08176, over 21504.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2923, pruned_loss=0.07213, over 4257367.80 frames. ], batch size: 212, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:50:24,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1960176.0, ans=0.125 2023-06-25 09:51:01,395 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-25 09:51:05,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1960296.0, ans=0.0 2023-06-25 09:51:07,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1960296.0, ans=0.05 2023-06-25 09:51:21,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1960356.0, ans=0.2 2023-06-25 09:51:26,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1960356.0, ans=0.2 2023-06-25 09:51:58,428 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.456e+02 8.216e+02 1.100e+03 1.452e+03 3.027e+03, threshold=2.200e+03, percent-clipped=1.0 2023-06-25 09:51:58,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1960476.0, ans=0.125 2023-06-25 09:51:59,870 INFO [train.py:996] (3/4) Epoch 11, batch 21800, loss[loss=0.2759, simple_loss=0.3661, pruned_loss=0.09288, over 21845.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2898, pruned_loss=0.07302, over 4259257.35 frames. ], batch size: 317, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:52:44,394 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.55 vs. limit=15.0 2023-06-25 09:52:45,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1960596.0, ans=10.0 2023-06-25 09:53:04,893 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.59 vs. limit=15.0 2023-06-25 09:53:45,084 INFO [train.py:996] (3/4) Epoch 11, batch 21850, loss[loss=0.2399, simple_loss=0.3044, pruned_loss=0.08768, over 21826.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2941, pruned_loss=0.07326, over 4269262.09 frames. ], batch size: 124, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:54:32,412 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=15.0 2023-06-25 09:54:40,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1960896.0, ans=0.1 2023-06-25 09:55:23,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1961016.0, ans=0.0 2023-06-25 09:55:27,834 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.221e+02 7.360e+02 1.053e+03 1.458e+03 3.571e+03, threshold=2.107e+03, percent-clipped=7.0 2023-06-25 09:55:35,126 INFO [train.py:996] (3/4) Epoch 11, batch 21900, loss[loss=0.2637, simple_loss=0.3267, pruned_loss=0.1003, over 21800.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.296, pruned_loss=0.07446, over 4271767.64 frames. ], batch size: 441, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:55:35,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1961076.0, ans=0.125 2023-06-25 09:56:10,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1961136.0, ans=0.0 2023-06-25 09:56:22,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1961196.0, ans=0.2 2023-06-25 09:57:20,405 INFO [train.py:996] (3/4) Epoch 11, batch 21950, loss[loss=0.1919, simple_loss=0.2443, pruned_loss=0.06972, over 20332.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2906, pruned_loss=0.07372, over 4269498.87 frames. ], batch size: 703, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 09:57:27,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1961376.0, ans=0.0 2023-06-25 09:57:51,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1961436.0, ans=0.0 2023-06-25 09:57:55,981 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.53 vs. limit=15.0 2023-06-25 09:58:57,726 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.305e+02 6.629e+02 8.784e+02 1.230e+03 3.737e+03, threshold=1.757e+03, percent-clipped=5.0 2023-06-25 09:58:59,377 INFO [train.py:996] (3/4) Epoch 11, batch 22000, loss[loss=0.248, simple_loss=0.3516, pruned_loss=0.07223, over 21228.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2845, pruned_loss=0.07046, over 4274077.04 frames. ], batch size: 549, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 09:59:10,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1961676.0, ans=0.125 2023-06-25 09:59:15,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1961676.0, ans=0.1 2023-06-25 09:59:17,340 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.17 vs. limit=15.0 2023-06-25 09:59:29,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1961736.0, ans=0.1 2023-06-25 09:59:35,102 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-25 09:59:41,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1961736.0, ans=0.125 2023-06-25 10:00:50,096 INFO [train.py:996] (3/4) Epoch 11, batch 22050, loss[loss=0.2322, simple_loss=0.3116, pruned_loss=0.07633, over 21446.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2901, pruned_loss=0.07195, over 4270117.27 frames. ], batch size: 211, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:01:32,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1962096.0, ans=0.125 2023-06-25 10:02:37,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.423e+02 9.011e+02 1.269e+03 1.922e+03 5.194e+03, threshold=2.539e+03, percent-clipped=30.0 2023-06-25 10:02:37,425 INFO [train.py:996] (3/4) Epoch 11, batch 22100, loss[loss=0.2526, simple_loss=0.3224, pruned_loss=0.09143, over 21785.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2995, pruned_loss=0.07561, over 4256636.07 frames. ], batch size: 414, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:03:16,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1962396.0, ans=0.125 2023-06-25 10:03:20,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1962396.0, ans=0.125 2023-06-25 10:04:23,265 INFO [train.py:996] (3/4) Epoch 11, batch 22150, loss[loss=0.2404, simple_loss=0.3026, pruned_loss=0.08909, over 21562.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3034, pruned_loss=0.07767, over 4267529.70 frames. ], batch size: 195, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:04:39,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1962576.0, ans=0.1 2023-06-25 10:04:48,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1962636.0, ans=0.125 2023-06-25 10:05:32,063 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.14 vs. limit=12.0 2023-06-25 10:05:59,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1962816.0, ans=0.125 2023-06-25 10:06:10,657 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.244e+02 8.641e+02 1.312e+03 2.175e+03 4.145e+03, threshold=2.624e+03, percent-clipped=16.0 2023-06-25 10:06:10,682 INFO [train.py:996] (3/4) Epoch 11, batch 22200, loss[loss=0.2317, simple_loss=0.3194, pruned_loss=0.07203, over 21366.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3047, pruned_loss=0.07857, over 4275707.31 frames. ], batch size: 176, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:06:36,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1962936.0, ans=0.125 2023-06-25 10:06:43,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1962936.0, ans=0.125 2023-06-25 10:07:10,876 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-06-25 10:07:22,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1963056.0, ans=0.1 2023-06-25 10:07:29,290 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=15.0 2023-06-25 10:07:43,221 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=12.0 2023-06-25 10:07:47,942 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-06-25 10:07:56,903 INFO [train.py:996] (3/4) Epoch 11, batch 22250, loss[loss=0.288, simple_loss=0.367, pruned_loss=0.1045, over 21755.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3111, pruned_loss=0.08022, over 4282156.63 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:08:12,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1963176.0, ans=0.125 2023-06-25 10:08:47,769 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:09:11,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1963356.0, ans=0.125 2023-06-25 10:09:14,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1963356.0, ans=0.0 2023-06-25 10:09:30,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1963416.0, ans=0.025 2023-06-25 10:09:39,204 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-25 10:09:44,565 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.502e+02 7.191e+02 1.032e+03 1.470e+03 3.757e+03, threshold=2.063e+03, percent-clipped=7.0 2023-06-25 10:09:44,587 INFO [train.py:996] (3/4) Epoch 11, batch 22300, loss[loss=0.2233, simple_loss=0.2824, pruned_loss=0.08206, over 21467.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3139, pruned_loss=0.08244, over 4283848.69 frames. ], batch size: 194, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:09:45,533 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=15.0 2023-06-25 10:10:25,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1963596.0, ans=0.1 2023-06-25 10:10:44,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1963656.0, ans=0.5 2023-06-25 10:10:56,320 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-25 10:11:21,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1963716.0, ans=0.0 2023-06-25 10:11:34,582 INFO [train.py:996] (3/4) Epoch 11, batch 22350, loss[loss=0.2058, simple_loss=0.2786, pruned_loss=0.06653, over 21260.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3115, pruned_loss=0.08274, over 4290432.98 frames. ], batch size: 176, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:11:40,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1963776.0, ans=0.1 2023-06-25 10:11:41,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1963776.0, ans=0.125 2023-06-25 10:11:43,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1963776.0, ans=0.125 2023-06-25 10:12:30,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1963896.0, ans=0.125 2023-06-25 10:12:32,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1963896.0, ans=0.0 2023-06-25 10:13:15,052 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.57 vs. limit=10.0 2023-06-25 10:13:21,899 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.769e+02 7.078e+02 9.389e+02 1.336e+03 2.790e+03, threshold=1.878e+03, percent-clipped=4.0 2023-06-25 10:13:21,920 INFO [train.py:996] (3/4) Epoch 11, batch 22400, loss[loss=0.195, simple_loss=0.2677, pruned_loss=0.0611, over 21551.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3087, pruned_loss=0.08008, over 4277904.62 frames. ], batch size: 230, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 10:13:25,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1964076.0, ans=0.125 2023-06-25 10:14:29,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1964256.0, ans=0.125 2023-06-25 10:14:34,876 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-25 10:14:37,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1964256.0, ans=0.0 2023-06-25 10:14:47,832 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=15.0 2023-06-25 10:14:57,498 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:15:02,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1964316.0, ans=0.125 2023-06-25 10:15:05,395 INFO [train.py:996] (3/4) Epoch 11, batch 22450, loss[loss=0.2094, simple_loss=0.2764, pruned_loss=0.07126, over 21371.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3023, pruned_loss=0.07851, over 4280388.87 frames. ], batch size: 194, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:15:23,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1964436.0, ans=0.0 2023-06-25 10:16:00,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1964496.0, ans=0.1 2023-06-25 10:16:16,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1964556.0, ans=0.0 2023-06-25 10:16:53,972 INFO [train.py:996] (3/4) Epoch 11, batch 22500, loss[loss=0.2539, simple_loss=0.3447, pruned_loss=0.0816, over 21486.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.299, pruned_loss=0.078, over 4284463.63 frames. ], batch size: 230, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:16:55,659 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.803e+02 7.459e+02 1.048e+03 1.318e+03 4.554e+03, threshold=2.097e+03, percent-clipped=12.0 2023-06-25 10:17:02,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1964676.0, ans=0.125 2023-06-25 10:17:24,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1964736.0, ans=0.07 2023-06-25 10:17:26,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1964736.0, ans=0.95 2023-06-25 10:17:52,797 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.56 vs. limit=10.0 2023-06-25 10:18:10,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1964856.0, ans=0.1 2023-06-25 10:18:41,428 INFO [train.py:996] (3/4) Epoch 11, batch 22550, loss[loss=0.2127, simple_loss=0.2838, pruned_loss=0.07074, over 21810.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3013, pruned_loss=0.07794, over 4290023.92 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:18:44,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1964976.0, ans=0.0 2023-06-25 10:19:23,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1965036.0, ans=0.07 2023-06-25 10:20:10,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1965156.0, ans=0.2 2023-06-25 10:20:36,192 INFO [train.py:996] (3/4) Epoch 11, batch 22600, loss[loss=0.2375, simple_loss=0.3199, pruned_loss=0.07755, over 21758.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3052, pruned_loss=0.07842, over 4289172.82 frames. ], batch size: 351, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:20:38,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1965276.0, ans=0.125 2023-06-25 10:20:39,510 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.096e+02 1.052e+03 1.426e+03 2.192e+03 4.902e+03, threshold=2.852e+03, percent-clipped=27.0 2023-06-25 10:20:59,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1965336.0, ans=0.07 2023-06-25 10:21:00,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1965336.0, ans=0.0 2023-06-25 10:21:05,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1965336.0, ans=0.125 2023-06-25 10:21:19,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1965396.0, ans=0.125 2023-06-25 10:21:24,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1965396.0, ans=0.125 2023-06-25 10:22:17,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1965516.0, ans=0.125 2023-06-25 10:22:20,049 INFO [train.py:996] (3/4) Epoch 11, batch 22650, loss[loss=0.213, simple_loss=0.2715, pruned_loss=0.07727, over 21444.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3015, pruned_loss=0.07843, over 4284680.16 frames. ], batch size: 389, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:22:26,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1965576.0, ans=0.0 2023-06-25 10:22:45,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1965636.0, ans=0.2 2023-06-25 10:22:57,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1965696.0, ans=0.125 2023-06-25 10:23:15,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1965696.0, ans=0.125 2023-06-25 10:23:18,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1965756.0, ans=0.025 2023-06-25 10:23:22,918 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=15.0 2023-06-25 10:23:50,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1965816.0, ans=0.125 2023-06-25 10:24:02,889 INFO [train.py:996] (3/4) Epoch 11, batch 22700, loss[loss=0.1825, simple_loss=0.253, pruned_loss=0.05601, over 21811.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2952, pruned_loss=0.07823, over 4281700.33 frames. ], batch size: 118, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:24:04,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1965876.0, ans=0.0 2023-06-25 10:24:06,047 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.859e+02 7.606e+02 1.053e+03 1.643e+03 3.332e+03, threshold=2.107e+03, percent-clipped=4.0 2023-06-25 10:24:11,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1965876.0, ans=0.125 2023-06-25 10:24:38,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1965936.0, ans=0.2 2023-06-25 10:25:26,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1966116.0, ans=0.125 2023-06-25 10:25:31,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1966116.0, ans=0.125 2023-06-25 10:25:50,168 INFO [train.py:996] (3/4) Epoch 11, batch 22750, loss[loss=0.2233, simple_loss=0.2963, pruned_loss=0.0751, over 21936.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2988, pruned_loss=0.07913, over 4269648.21 frames. ], batch size: 316, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:25:55,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1966176.0, ans=0.1 2023-06-25 10:26:20,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1966236.0, ans=0.95 2023-06-25 10:26:36,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1966296.0, ans=0.0 2023-06-25 10:27:04,518 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-25 10:27:09,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1966356.0, ans=0.125 2023-06-25 10:27:40,872 INFO [train.py:996] (3/4) Epoch 11, batch 22800, loss[loss=0.2021, simple_loss=0.2546, pruned_loss=0.07479, over 20757.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3024, pruned_loss=0.08066, over 4268224.34 frames. ], batch size: 607, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:27:44,251 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.117e+02 8.724e+02 1.380e+03 2.378e+03 6.132e+03, threshold=2.761e+03, percent-clipped=34.0 2023-06-25 10:27:45,336 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=22.5 2023-06-25 10:28:30,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=1966596.0, ans=22.5 2023-06-25 10:28:56,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1966716.0, ans=0.0 2023-06-25 10:29:24,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1966776.0, ans=0.125 2023-06-25 10:29:25,742 INFO [train.py:996] (3/4) Epoch 11, batch 22850, loss[loss=0.2418, simple_loss=0.3048, pruned_loss=0.08939, over 21531.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.2998, pruned_loss=0.08046, over 4276681.79 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:30:12,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.80 vs. limit=12.0 2023-06-25 10:30:55,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1967016.0, ans=0.125 2023-06-25 10:31:12,051 INFO [train.py:996] (3/4) Epoch 11, batch 22900, loss[loss=0.2314, simple_loss=0.3464, pruned_loss=0.05816, over 21651.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.301, pruned_loss=0.07968, over 4270480.89 frames. ], batch size: 389, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:31:15,869 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.509e+02 6.842e+02 1.024e+03 1.500e+03 4.089e+03, threshold=2.047e+03, percent-clipped=2.0 2023-06-25 10:31:20,628 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.93 vs. limit=10.0 2023-06-25 10:31:24,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.01 vs. limit=10.0 2023-06-25 10:31:24,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1967076.0, ans=0.125 2023-06-25 10:31:26,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1967076.0, ans=0.125 2023-06-25 10:31:30,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1967136.0, ans=0.0 2023-06-25 10:31:53,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1967196.0, ans=0.0 2023-06-25 10:32:00,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1967196.0, ans=0.125 2023-06-25 10:32:36,696 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-25 10:32:59,795 INFO [train.py:996] (3/4) Epoch 11, batch 22950, loss[loss=0.2591, simple_loss=0.3789, pruned_loss=0.06958, over 21743.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3129, pruned_loss=0.07878, over 4264427.04 frames. ], batch size: 332, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:33:03,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1967376.0, ans=0.125 2023-06-25 10:33:14,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1967436.0, ans=0.125 2023-06-25 10:33:34,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1967436.0, ans=0.125 2023-06-25 10:33:57,915 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:34:44,062 INFO [train.py:996] (3/4) Epoch 11, batch 23000, loss[loss=0.2278, simple_loss=0.3012, pruned_loss=0.07722, over 21900.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3109, pruned_loss=0.07702, over 4269786.03 frames. ], batch size: 316, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:34:47,327 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.236e+02 8.204e+02 1.340e+03 2.035e+03 4.542e+03, threshold=2.680e+03, percent-clipped=23.0 2023-06-25 10:34:50,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1967676.0, ans=0.1 2023-06-25 10:36:31,923 INFO [train.py:996] (3/4) Epoch 11, batch 23050, loss[loss=0.247, simple_loss=0.3184, pruned_loss=0.08776, over 21821.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3136, pruned_loss=0.07959, over 4270866.17 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:37:29,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1968096.0, ans=0.09899494936611666 2023-06-25 10:37:42,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1968096.0, ans=0.0 2023-06-25 10:37:54,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1968156.0, ans=0.2 2023-06-25 10:38:18,117 INFO [train.py:996] (3/4) Epoch 11, batch 23100, loss[loss=0.1989, simple_loss=0.2661, pruned_loss=0.06585, over 21945.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3097, pruned_loss=0.07927, over 4274514.99 frames. ], batch size: 113, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:38:29,160 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.702e+02 7.516e+02 1.022e+03 1.433e+03 4.307e+03, threshold=2.044e+03, percent-clipped=3.0 2023-06-25 10:38:33,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1968276.0, ans=0.2 2023-06-25 10:38:54,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1968336.0, ans=0.125 2023-06-25 10:40:00,056 INFO [train.py:996] (3/4) Epoch 11, batch 23150, loss[loss=0.241, simple_loss=0.3125, pruned_loss=0.08473, over 21842.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3035, pruned_loss=0.07816, over 4279702.59 frames. ], batch size: 118, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:40:28,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1968636.0, ans=0.125 2023-06-25 10:40:29,198 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.11 vs. limit=10.0 2023-06-25 10:40:41,043 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-25 10:41:19,748 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:41:37,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1968876.0, ans=0.125 2023-06-25 10:41:38,199 INFO [train.py:996] (3/4) Epoch 11, batch 23200, loss[loss=0.1884, simple_loss=0.2608, pruned_loss=0.05798, over 21895.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3017, pruned_loss=0.07841, over 4274231.65 frames. ], batch size: 283, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:41:43,100 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.976e+02 7.986e+02 1.089e+03 1.684e+03 3.717e+03, threshold=2.178e+03, percent-clipped=18.0 2023-06-25 10:41:57,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1968876.0, ans=0.0 2023-06-25 10:42:18,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1968996.0, ans=0.0 2023-06-25 10:42:55,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1969056.0, ans=0.125 2023-06-25 10:43:30,215 INFO [train.py:996] (3/4) Epoch 11, batch 23250, loss[loss=0.261, simple_loss=0.3274, pruned_loss=0.09726, over 21842.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3016, pruned_loss=0.07948, over 4283467.52 frames. ], batch size: 351, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:43:41,022 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-25 10:43:50,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1969236.0, ans=0.125 2023-06-25 10:44:20,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1969296.0, ans=15.0 2023-06-25 10:45:17,579 INFO [train.py:996] (3/4) Epoch 11, batch 23300, loss[loss=0.238, simple_loss=0.3341, pruned_loss=0.0709, over 21454.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3096, pruned_loss=0.08154, over 4291838.18 frames. ], batch size: 194, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:45:22,822 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.331e+02 7.944e+02 1.056e+03 1.535e+03 4.546e+03, threshold=2.112e+03, percent-clipped=10.0 2023-06-25 10:45:48,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1969536.0, ans=0.0 2023-06-25 10:46:34,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-25 10:46:49,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1969716.0, ans=0.2 2023-06-25 10:46:51,913 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-25 10:47:03,362 INFO [train.py:996] (3/4) Epoch 11, batch 23350, loss[loss=0.2179, simple_loss=0.2939, pruned_loss=0.07094, over 21349.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3124, pruned_loss=0.08007, over 4294396.37 frames. ], batch size: 131, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:47:03,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1969776.0, ans=0.05 2023-06-25 10:47:25,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1969776.0, ans=0.1 2023-06-25 10:47:40,218 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-25 10:47:43,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1969836.0, ans=0.125 2023-06-25 10:47:49,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1969836.0, ans=0.0 2023-06-25 10:48:00,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1969896.0, ans=0.125 2023-06-25 10:48:27,471 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-06-25 10:48:32,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1970016.0, ans=0.07 2023-06-25 10:48:39,496 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-06-25 10:48:54,014 INFO [train.py:996] (3/4) Epoch 11, batch 23400, loss[loss=0.2036, simple_loss=0.2584, pruned_loss=0.07444, over 20017.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3066, pruned_loss=0.07701, over 4290642.32 frames. ], batch size: 703, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:49:05,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1970076.0, ans=0.125 2023-06-25 10:49:07,025 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.051e+02 8.471e+02 1.302e+03 1.874e+03 3.604e+03, threshold=2.604e+03, percent-clipped=20.0 2023-06-25 10:49:39,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1970196.0, ans=0.1 2023-06-25 10:49:46,730 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.67 vs. limit=15.0 2023-06-25 10:50:05,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1970256.0, ans=0.0 2023-06-25 10:50:25,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1970316.0, ans=0.125 2023-06-25 10:50:48,343 INFO [train.py:996] (3/4) Epoch 11, batch 23450, loss[loss=0.2228, simple_loss=0.2911, pruned_loss=0.07728, over 21589.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.31, pruned_loss=0.07934, over 4288163.44 frames. ], batch size: 263, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:51:00,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1970376.0, ans=0.2 2023-06-25 10:51:42,451 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:51:59,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1970556.0, ans=0.0 2023-06-25 10:52:06,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1970616.0, ans=0.0 2023-06-25 10:52:34,805 INFO [train.py:996] (3/4) Epoch 11, batch 23500, loss[loss=0.2084, simple_loss=0.2779, pruned_loss=0.0695, over 21173.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3112, pruned_loss=0.08132, over 4293478.64 frames. ], batch size: 608, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:52:41,429 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.140e+02 8.381e+02 1.197e+03 1.768e+03 4.081e+03, threshold=2.394e+03, percent-clipped=6.0 2023-06-25 10:52:52,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1970676.0, ans=0.0 2023-06-25 10:52:57,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1970736.0, ans=0.2 2023-06-25 10:53:11,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1970796.0, ans=0.125 2023-06-25 10:53:13,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1970796.0, ans=0.0 2023-06-25 10:53:32,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1970796.0, ans=0.1 2023-06-25 10:53:37,847 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-25 10:54:19,574 INFO [train.py:996] (3/4) Epoch 11, batch 23550, loss[loss=0.1935, simple_loss=0.2594, pruned_loss=0.0638, over 21739.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3062, pruned_loss=0.08133, over 4277544.15 frames. ], batch size: 316, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:55:16,377 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-25 10:55:17,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1971096.0, ans=0.1 2023-06-25 10:55:29,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1971156.0, ans=0.2 2023-06-25 10:56:04,918 INFO [train.py:996] (3/4) Epoch 11, batch 23600, loss[loss=0.2253, simple_loss=0.2993, pruned_loss=0.07564, over 21619.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3063, pruned_loss=0.08018, over 4277073.47 frames. ], batch size: 263, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:56:17,383 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.710e+02 7.780e+02 1.013e+03 1.475e+03 2.570e+03, threshold=2.026e+03, percent-clipped=2.0 2023-06-25 10:57:17,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1971456.0, ans=0.125 2023-06-25 10:57:24,060 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-25 10:57:56,464 INFO [train.py:996] (3/4) Epoch 11, batch 23650, loss[loss=0.2304, simple_loss=0.3054, pruned_loss=0.07773, over 21476.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3057, pruned_loss=0.07824, over 4284566.95 frames. ], batch size: 194, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:58:12,391 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.88 vs. limit=15.0 2023-06-25 10:58:16,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1971636.0, ans=0.2 2023-06-25 10:59:44,376 INFO [train.py:996] (3/4) Epoch 11, batch 23700, loss[loss=0.2058, simple_loss=0.2833, pruned_loss=0.06415, over 21691.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3069, pruned_loss=0.07832, over 4282048.04 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:59:56,586 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.633e+02 7.722e+02 1.155e+03 1.933e+03 4.444e+03, threshold=2.311e+03, percent-clipped=20.0 2023-06-25 11:00:45,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1971996.0, ans=0.05 2023-06-25 11:00:56,104 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:01:40,528 INFO [train.py:996] (3/4) Epoch 11, batch 23750, loss[loss=0.1988, simple_loss=0.2885, pruned_loss=0.05453, over 21269.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3103, pruned_loss=0.07896, over 4279599.58 frames. ], batch size: 176, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:01:44,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1972176.0, ans=0.125 2023-06-25 11:01:49,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1972176.0, ans=0.125 2023-06-25 11:02:48,051 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=22.5 2023-06-25 11:02:57,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1972356.0, ans=0.1 2023-06-25 11:03:09,932 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.70 vs. limit=15.0 2023-06-25 11:03:19,197 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-25 11:03:22,511 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=22.5 2023-06-25 11:03:28,669 INFO [train.py:996] (3/4) Epoch 11, batch 23800, loss[loss=0.3743, simple_loss=0.4363, pruned_loss=0.1562, over 21430.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3099, pruned_loss=0.07744, over 4280506.49 frames. ], batch size: 507, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:03:34,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1972476.0, ans=0.125 2023-06-25 11:03:35,175 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.508e+02 9.725e+02 1.368e+03 2.347e+03 4.369e+03, threshold=2.737e+03, percent-clipped=25.0 2023-06-25 11:03:48,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1972476.0, ans=0.125 2023-06-25 11:04:54,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1972656.0, ans=0.0 2023-06-25 11:05:16,752 INFO [train.py:996] (3/4) Epoch 11, batch 23850, loss[loss=0.2807, simple_loss=0.3543, pruned_loss=0.1036, over 21500.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3167, pruned_loss=0.07913, over 4284204.84 frames. ], batch size: 131, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:05:38,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1972776.0, ans=0.1 2023-06-25 11:05:38,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1972776.0, ans=0.125 2023-06-25 11:05:43,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1972776.0, ans=0.0 2023-06-25 11:05:48,467 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-25 11:05:59,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1972836.0, ans=0.1 2023-06-25 11:06:33,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1972956.0, ans=0.0 2023-06-25 11:06:36,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1972956.0, ans=0.125 2023-06-25 11:06:38,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-25 11:07:14,231 INFO [train.py:996] (3/4) Epoch 11, batch 23900, loss[loss=0.1959, simple_loss=0.2715, pruned_loss=0.06012, over 21448.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3223, pruned_loss=0.08087, over 4279914.77 frames. ], batch size: 212, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:07:19,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1973076.0, ans=0.2 2023-06-25 11:07:20,911 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.632e+02 1.020e+03 1.662e+03 2.575e+03 5.101e+03, threshold=3.324e+03, percent-clipped=22.0 2023-06-25 11:07:21,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1973076.0, ans=0.125 2023-06-25 11:07:24,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1973076.0, ans=0.1 2023-06-25 11:07:50,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1973196.0, ans=0.0 2023-06-25 11:08:30,719 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:08:54,232 INFO [train.py:996] (3/4) Epoch 11, batch 23950, loss[loss=0.2442, simple_loss=0.3128, pruned_loss=0.08782, over 21178.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3174, pruned_loss=0.08051, over 4269254.48 frames. ], batch size: 159, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:09:18,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1973376.0, ans=0.0 2023-06-25 11:09:20,095 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-25 11:09:27,500 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-25 11:10:06,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1973556.0, ans=0.1 2023-06-25 11:10:21,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1973616.0, ans=0.0 2023-06-25 11:10:23,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1973616.0, ans=0.0 2023-06-25 11:10:31,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1973616.0, ans=0.125 2023-06-25 11:10:47,910 INFO [train.py:996] (3/4) Epoch 11, batch 24000, loss[loss=0.2843, simple_loss=0.3649, pruned_loss=0.1018, over 21832.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3185, pruned_loss=0.08324, over 4272219.59 frames. ], batch size: 118, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 11:10:47,911 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 11:11:07,124 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.263, simple_loss=0.3578, pruned_loss=0.08405, over 1796401.00 frames. 2023-06-25 11:11:07,124 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-25 11:11:14,107 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.140e+02 7.509e+02 1.143e+03 1.580e+03 3.381e+03, threshold=2.286e+03, percent-clipped=1.0 2023-06-25 11:11:42,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1973736.0, ans=0.0 2023-06-25 11:11:47,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1973796.0, ans=0.0 2023-06-25 11:11:48,354 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.54 vs. limit=15.0 2023-06-25 11:11:56,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1973796.0, ans=0.125 2023-06-25 11:11:56,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1973796.0, ans=0.07 2023-06-25 11:12:54,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1973976.0, ans=0.125 2023-06-25 11:12:55,326 INFO [train.py:996] (3/4) Epoch 11, batch 24050, loss[loss=0.2549, simple_loss=0.3454, pruned_loss=0.08226, over 21488.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.32, pruned_loss=0.08318, over 4282081.71 frames. ], batch size: 471, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 11:12:58,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1973976.0, ans=0.125 2023-06-25 11:13:29,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1974036.0, ans=0.1 2023-06-25 11:13:32,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1974036.0, ans=0.025 2023-06-25 11:14:02,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1974156.0, ans=0.125 2023-06-25 11:14:18,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1974156.0, ans=0.0 2023-06-25 11:14:19,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1974156.0, ans=0.125 2023-06-25 11:14:34,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1974216.0, ans=0.125 2023-06-25 11:14:36,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1974216.0, ans=0.125 2023-06-25 11:14:43,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1974276.0, ans=0.1 2023-06-25 11:14:43,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1974276.0, ans=0.0 2023-06-25 11:14:44,586 INFO [train.py:996] (3/4) Epoch 11, batch 24100, loss[loss=0.2644, simple_loss=0.3421, pruned_loss=0.09339, over 21601.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3183, pruned_loss=0.08127, over 4279380.51 frames. ], batch size: 389, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 11:14:49,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1974276.0, ans=0.125 2023-06-25 11:14:50,900 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.104e+02 8.872e+02 1.198e+03 1.771e+03 4.014e+03, threshold=2.396e+03, percent-clipped=16.0 2023-06-25 11:16:29,682 INFO [train.py:996] (3/4) Epoch 11, batch 24150, loss[loss=0.2347, simple_loss=0.3006, pruned_loss=0.08439, over 21850.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3197, pruned_loss=0.08306, over 4281709.25 frames. ], batch size: 298, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:16:48,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1974576.0, ans=0.0 2023-06-25 11:16:49,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1974576.0, ans=0.0 2023-06-25 11:17:47,556 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.38 vs. limit=10.0 2023-06-25 11:18:20,276 INFO [train.py:996] (3/4) Epoch 11, batch 24200, loss[loss=0.2306, simple_loss=0.3095, pruned_loss=0.07588, over 21506.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.322, pruned_loss=0.08453, over 4279376.00 frames. ], batch size: 195, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:18:34,535 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.944e+02 9.608e+02 1.226e+03 1.956e+03 3.417e+03, threshold=2.452e+03, percent-clipped=15.0 2023-06-25 11:19:21,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1974996.0, ans=0.125 2023-06-25 11:19:23,805 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=22.5 2023-06-25 11:19:27,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1975056.0, ans=0.125 2023-06-25 11:19:30,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=15.0 2023-06-25 11:19:50,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1975116.0, ans=0.0 2023-06-25 11:20:09,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1975116.0, ans=0.0 2023-06-25 11:20:15,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1975176.0, ans=0.125 2023-06-25 11:20:16,263 INFO [train.py:996] (3/4) Epoch 11, batch 24250, loss[loss=0.2578, simple_loss=0.3302, pruned_loss=0.09269, over 21478.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3189, pruned_loss=0.07937, over 4272366.71 frames. ], batch size: 548, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:20:59,370 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=12.0 2023-06-25 11:21:01,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1975296.0, ans=0.125 2023-06-25 11:21:21,354 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-25 11:21:23,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1975356.0, ans=0.1 2023-06-25 11:22:04,802 INFO [train.py:996] (3/4) Epoch 11, batch 24300, loss[loss=0.1945, simple_loss=0.2783, pruned_loss=0.05533, over 21743.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3119, pruned_loss=0.07355, over 4275436.05 frames. ], batch size: 414, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:22:12,740 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.467e+02 7.478e+02 1.137e+03 1.748e+03 3.902e+03, threshold=2.274e+03, percent-clipped=10.0 2023-06-25 11:23:03,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1975596.0, ans=0.125 2023-06-25 11:23:07,579 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=14.31 vs. limit=15.0 2023-06-25 11:23:10,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1975656.0, ans=0.125 2023-06-25 11:23:11,432 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-25 11:23:31,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1975716.0, ans=0.0 2023-06-25 11:23:49,702 INFO [train.py:996] (3/4) Epoch 11, batch 24350, loss[loss=0.2632, simple_loss=0.3278, pruned_loss=0.09933, over 21806.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3072, pruned_loss=0.07217, over 4278764.00 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:24:19,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1975836.0, ans=0.125 2023-06-25 11:24:56,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1975956.0, ans=0.125 2023-06-25 11:25:42,074 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:25:43,128 INFO [train.py:996] (3/4) Epoch 11, batch 24400, loss[loss=0.2299, simple_loss=0.3136, pruned_loss=0.0731, over 21685.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3104, pruned_loss=0.0757, over 4279128.11 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 11:26:00,904 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.788e+02 8.688e+02 1.209e+03 1.955e+03 3.228e+03, threshold=2.419e+03, percent-clipped=16.0 2023-06-25 11:26:01,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1976076.0, ans=0.2 2023-06-25 11:26:08,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1976136.0, ans=0.125 2023-06-25 11:26:28,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1976196.0, ans=0.125 2023-06-25 11:26:35,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1976196.0, ans=0.125 2023-06-25 11:27:19,469 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.66 vs. limit=10.0 2023-06-25 11:27:36,880 INFO [train.py:996] (3/4) Epoch 11, batch 24450, loss[loss=0.2306, simple_loss=0.3253, pruned_loss=0.06792, over 21614.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.314, pruned_loss=0.07806, over 4281673.65 frames. ], batch size: 263, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:28:27,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1976496.0, ans=0.125 2023-06-25 11:29:25,316 INFO [train.py:996] (3/4) Epoch 11, batch 24500, loss[loss=0.2097, simple_loss=0.2959, pruned_loss=0.06173, over 21890.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3148, pruned_loss=0.07837, over 4284347.23 frames. ], batch size: 316, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:29:27,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1976676.0, ans=0.1 2023-06-25 11:29:34,927 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.743e+02 7.294e+02 9.026e+02 1.332e+03 4.707e+03, threshold=1.805e+03, percent-clipped=7.0 2023-06-25 11:29:56,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1976736.0, ans=0.0 2023-06-25 11:31:00,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1976916.0, ans=0.1 2023-06-25 11:31:11,553 INFO [train.py:996] (3/4) Epoch 11, batch 24550, loss[loss=0.231, simple_loss=0.3095, pruned_loss=0.07622, over 20764.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3183, pruned_loss=0.07992, over 4280330.52 frames. ], batch size: 607, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:31:33,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1977036.0, ans=0.0 2023-06-25 11:31:39,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1977036.0, ans=0.015 2023-06-25 11:32:00,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1977096.0, ans=0.125 2023-06-25 11:32:27,608 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2023-06-25 11:33:02,808 INFO [train.py:996] (3/4) Epoch 11, batch 24600, loss[loss=0.2221, simple_loss=0.2893, pruned_loss=0.07744, over 21836.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3159, pruned_loss=0.08098, over 4279451.99 frames. ], batch size: 317, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:33:05,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1977276.0, ans=0.1 2023-06-25 11:33:13,016 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.578e+02 9.421e+02 1.303e+03 2.147e+03 3.735e+03, threshold=2.606e+03, percent-clipped=31.0 2023-06-25 11:33:24,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1977336.0, ans=0.125 2023-06-25 11:33:44,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1977396.0, ans=0.125 2023-06-25 11:34:26,635 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=22.5 2023-06-25 11:34:51,931 INFO [train.py:996] (3/4) Epoch 11, batch 24650, loss[loss=0.1838, simple_loss=0.2533, pruned_loss=0.05721, over 21585.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3092, pruned_loss=0.08087, over 4277877.43 frames. ], batch size: 298, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:34:54,580 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-25 11:35:58,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1977756.0, ans=0.125 2023-06-25 11:36:36,907 INFO [train.py:996] (3/4) Epoch 11, batch 24700, loss[loss=0.2295, simple_loss=0.2949, pruned_loss=0.08203, over 21442.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3066, pruned_loss=0.07959, over 4277665.59 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:36:46,541 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.759e+02 8.060e+02 1.267e+03 1.761e+03 3.816e+03, threshold=2.533e+03, percent-clipped=4.0 2023-06-25 11:37:30,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1978056.0, ans=0.0 2023-06-25 11:37:58,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1978116.0, ans=0.125 2023-06-25 11:37:59,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1978116.0, ans=0.125 2023-06-25 11:38:15,600 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.04 vs. limit=15.0 2023-06-25 11:38:17,803 INFO [train.py:996] (3/4) Epoch 11, batch 24750, loss[loss=0.2014, simple_loss=0.262, pruned_loss=0.0704, over 21321.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2979, pruned_loss=0.07654, over 4279444.85 frames. ], batch size: 131, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:38:28,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1978176.0, ans=0.125 2023-06-25 11:38:43,020 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-25 11:39:09,765 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-25 11:39:30,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1978356.0, ans=0.0 2023-06-25 11:39:44,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1978416.0, ans=0.2 2023-06-25 11:39:57,346 INFO [train.py:996] (3/4) Epoch 11, batch 24800, loss[loss=0.2849, simple_loss=0.3225, pruned_loss=0.1237, over 21545.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2931, pruned_loss=0.07642, over 4288881.43 frames. ], batch size: 508, lr: 2.61e-03, grad_scale: 32.0 2023-06-25 11:40:11,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1978476.0, ans=0.2 2023-06-25 11:40:11,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1978476.0, ans=0.125 2023-06-25 11:40:14,470 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.678e+02 6.530e+02 8.946e+02 1.365e+03 3.586e+03, threshold=1.789e+03, percent-clipped=4.0 2023-06-25 11:40:44,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1978596.0, ans=0.0 2023-06-25 11:41:04,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1978656.0, ans=0.0 2023-06-25 11:41:17,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1978656.0, ans=0.125 2023-06-25 11:41:48,315 INFO [train.py:996] (3/4) Epoch 11, batch 24850, loss[loss=0.1883, simple_loss=0.254, pruned_loss=0.06132, over 20104.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2938, pruned_loss=0.07761, over 4294220.67 frames. ], batch size: 703, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:41:52,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1978776.0, ans=0.1 2023-06-25 11:41:55,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1978776.0, ans=0.2 2023-06-25 11:42:44,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1978956.0, ans=0.025 2023-06-25 11:43:05,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1978956.0, ans=0.125 2023-06-25 11:43:14,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=12.0 2023-06-25 11:43:34,973 INFO [train.py:996] (3/4) Epoch 11, batch 24900, loss[loss=0.2061, simple_loss=0.2597, pruned_loss=0.07627, over 20346.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2967, pruned_loss=0.07896, over 4286807.62 frames. ], batch size: 703, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:43:38,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1979076.0, ans=0.125 2023-06-25 11:43:47,681 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.897e+02 9.704e+02 1.419e+03 1.998e+03 4.449e+03, threshold=2.839e+03, percent-clipped=31.0 2023-06-25 11:44:10,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1979136.0, ans=0.2 2023-06-25 11:44:37,250 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.42 vs. limit=12.0 2023-06-25 11:45:02,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1979316.0, ans=10.0 2023-06-25 11:45:08,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1979316.0, ans=0.1 2023-06-25 11:45:08,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1979316.0, ans=0.125 2023-06-25 11:45:12,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1979316.0, ans=0.125 2023-06-25 11:45:15,024 INFO [train.py:996] (3/4) Epoch 11, batch 24950, loss[loss=0.278, simple_loss=0.3431, pruned_loss=0.1065, over 21116.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3062, pruned_loss=0.08366, over 4289219.85 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:45:23,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1979376.0, ans=0.125 2023-06-25 11:45:43,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1979436.0, ans=0.05 2023-06-25 11:45:49,100 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.53 vs. limit=22.5 2023-06-25 11:45:53,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1979436.0, ans=0.1 2023-06-25 11:46:17,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1979496.0, ans=0.125 2023-06-25 11:47:03,406 INFO [train.py:996] (3/4) Epoch 11, batch 25000, loss[loss=0.2116, simple_loss=0.2807, pruned_loss=0.07128, over 21739.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3117, pruned_loss=0.08497, over 4283370.72 frames. ], batch size: 333, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:47:23,244 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.368e+02 7.420e+02 9.724e+02 1.691e+03 3.300e+03, threshold=1.945e+03, percent-clipped=1.0 2023-06-25 11:47:58,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1979796.0, ans=0.125 2023-06-25 11:48:18,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1979856.0, ans=0.0 2023-06-25 11:48:29,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1979856.0, ans=0.1 2023-06-25 11:48:41,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1979916.0, ans=0.0 2023-06-25 11:48:48,895 INFO [train.py:996] (3/4) Epoch 11, batch 25050, loss[loss=0.2095, simple_loss=0.2691, pruned_loss=0.07493, over 21974.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3071, pruned_loss=0.08277, over 4276480.11 frames. ], batch size: 103, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:50:28,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1980216.0, ans=0.0 2023-06-25 11:50:34,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1980216.0, ans=0.125 2023-06-25 11:50:36,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1980276.0, ans=0.125 2023-06-25 11:50:37,631 INFO [train.py:996] (3/4) Epoch 11, batch 25100, loss[loss=0.2341, simple_loss=0.3199, pruned_loss=0.07415, over 21822.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3018, pruned_loss=0.08138, over 4274431.33 frames. ], batch size: 371, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:50:58,387 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.064e+02 8.337e+02 1.105e+03 1.657e+03 3.761e+03, threshold=2.211e+03, percent-clipped=17.0 2023-06-25 11:51:02,546 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.09 vs. limit=15.0 2023-06-25 11:51:38,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1980396.0, ans=0.1 2023-06-25 11:52:05,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1980516.0, ans=0.125 2023-06-25 11:52:22,312 INFO [train.py:996] (3/4) Epoch 11, batch 25150, loss[loss=0.2612, simple_loss=0.3363, pruned_loss=0.093, over 21698.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3034, pruned_loss=0.07984, over 4254201.84 frames. ], batch size: 389, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:52:45,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1980636.0, ans=0.125 2023-06-25 11:52:53,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1980636.0, ans=15.0 2023-06-25 11:53:47,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1980756.0, ans=0.1 2023-06-25 11:53:56,276 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-25 11:54:08,591 INFO [train.py:996] (3/4) Epoch 11, batch 25200, loss[loss=0.1892, simple_loss=0.2647, pruned_loss=0.05685, over 21874.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3006, pruned_loss=0.07646, over 4258820.11 frames. ], batch size: 107, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:54:21,784 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.183e+02 7.466e+02 1.183e+03 1.682e+03 4.504e+03, threshold=2.365e+03, percent-clipped=14.0 2023-06-25 11:54:28,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1980936.0, ans=0.125 2023-06-25 11:54:56,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1980996.0, ans=0.125 2023-06-25 11:55:33,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1981056.0, ans=0.1 2023-06-25 11:55:56,151 INFO [train.py:996] (3/4) Epoch 11, batch 25250, loss[loss=0.2502, simple_loss=0.3101, pruned_loss=0.09511, over 21739.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2982, pruned_loss=0.07495, over 4265805.88 frames. ], batch size: 351, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:56:28,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1981236.0, ans=0.125 2023-06-25 11:57:24,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1981356.0, ans=0.2 2023-06-25 11:57:42,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.62 vs. limit=6.0 2023-06-25 11:57:44,422 INFO [train.py:996] (3/4) Epoch 11, batch 25300, loss[loss=0.1931, simple_loss=0.2771, pruned_loss=0.05455, over 21663.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2959, pruned_loss=0.07412, over 4249381.47 frames. ], batch size: 247, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:57:57,588 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.052e+02 7.897e+02 1.317e+03 1.738e+03 3.362e+03, threshold=2.634e+03, percent-clipped=11.0 2023-06-25 11:58:18,490 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.74 vs. limit=10.0 2023-06-25 11:58:24,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1981536.0, ans=0.125 2023-06-25 11:59:22,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1981716.0, ans=0.125 2023-06-25 11:59:32,060 INFO [train.py:996] (3/4) Epoch 11, batch 25350, loss[loss=0.1828, simple_loss=0.2669, pruned_loss=0.0494, over 21714.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2973, pruned_loss=0.07291, over 4238620.94 frames. ], batch size: 333, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 12:00:07,329 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:01:00,287 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.37 vs. limit=12.0 2023-06-25 12:01:03,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1982016.0, ans=0.125 2023-06-25 12:01:17,468 INFO [train.py:996] (3/4) Epoch 11, batch 25400, loss[loss=0.2074, simple_loss=0.2684, pruned_loss=0.07317, over 21239.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2941, pruned_loss=0.07286, over 4250086.85 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 12:01:22,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1982076.0, ans=0.1 2023-06-25 12:01:37,979 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.367e+02 9.282e+02 1.307e+03 1.888e+03 3.568e+03, threshold=2.613e+03, percent-clipped=8.0 2023-06-25 12:02:03,399 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-25 12:03:02,664 INFO [train.py:996] (3/4) Epoch 11, batch 25450, loss[loss=0.209, simple_loss=0.2906, pruned_loss=0.06373, over 21322.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2944, pruned_loss=0.07386, over 4260659.54 frames. ], batch size: 159, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 12:03:06,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1982376.0, ans=0.0 2023-06-25 12:03:08,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1982376.0, ans=0.125 2023-06-25 12:03:21,718 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=15.0 2023-06-25 12:04:44,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1982616.0, ans=0.1 2023-06-25 12:04:49,932 INFO [train.py:996] (3/4) Epoch 11, batch 25500, loss[loss=0.1738, simple_loss=0.2562, pruned_loss=0.04571, over 21338.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2958, pruned_loss=0.0716, over 4261870.91 frames. ], batch size: 159, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 12:05:10,400 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.710e+02 7.694e+02 1.169e+03 1.712e+03 3.614e+03, threshold=2.338e+03, percent-clipped=5.0 2023-06-25 12:05:57,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1982796.0, ans=0.0 2023-06-25 12:06:34,928 INFO [train.py:996] (3/4) Epoch 11, batch 25550, loss[loss=0.2318, simple_loss=0.3109, pruned_loss=0.07638, over 16583.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3023, pruned_loss=0.07159, over 4240837.93 frames. ], batch size: 60, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 12:07:45,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1983096.0, ans=0.2 2023-06-25 12:07:56,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1983156.0, ans=10.0 2023-06-25 12:08:12,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1983216.0, ans=0.0 2023-06-25 12:08:38,082 INFO [train.py:996] (3/4) Epoch 11, batch 25600, loss[loss=0.3475, simple_loss=0.3971, pruned_loss=0.149, over 21355.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.307, pruned_loss=0.07339, over 4248061.73 frames. ], batch size: 507, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:08:46,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1983276.0, ans=0.125 2023-06-25 12:08:52,918 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.144e+02 7.738e+02 1.030e+03 1.718e+03 3.511e+03, threshold=2.059e+03, percent-clipped=11.0 2023-06-25 12:09:20,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1983396.0, ans=0.0 2023-06-25 12:09:31,255 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.62 vs. limit=12.0 2023-06-25 12:09:59,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=22.5 2023-06-25 12:10:24,087 INFO [train.py:996] (3/4) Epoch 11, batch 25650, loss[loss=0.2539, simple_loss=0.3061, pruned_loss=0.1009, over 21236.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3091, pruned_loss=0.07588, over 4252734.62 frames. ], batch size: 471, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:10:58,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1983636.0, ans=0.0 2023-06-25 12:12:11,571 INFO [train.py:996] (3/4) Epoch 11, batch 25700, loss[loss=0.2422, simple_loss=0.3038, pruned_loss=0.09028, over 21727.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3053, pruned_loss=0.07664, over 4256598.67 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:12:38,810 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.851e+02 8.245e+02 1.134e+03 1.562e+03 3.915e+03, threshold=2.269e+03, percent-clipped=11.0 2023-06-25 12:12:42,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1983936.0, ans=0.125 2023-06-25 12:13:07,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1983996.0, ans=0.125 2023-06-25 12:13:15,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1984056.0, ans=0.125 2023-06-25 12:13:42,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1984116.0, ans=0.0 2023-06-25 12:14:01,422 INFO [train.py:996] (3/4) Epoch 11, batch 25750, loss[loss=0.2701, simple_loss=0.3325, pruned_loss=0.1038, over 21340.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3092, pruned_loss=0.07851, over 4260467.91 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:14:57,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1984356.0, ans=0.125 2023-06-25 12:15:15,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1984356.0, ans=0.0 2023-06-25 12:15:56,833 INFO [train.py:996] (3/4) Epoch 11, batch 25800, loss[loss=0.2371, simple_loss=0.3108, pruned_loss=0.08169, over 21478.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3216, pruned_loss=0.08334, over 4264439.59 frames. ], batch size: 211, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:15:57,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1984476.0, ans=0.0 2023-06-25 12:16:12,232 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.210e+02 8.965e+02 1.490e+03 2.590e+03 4.866e+03, threshold=2.981e+03, percent-clipped=29.0 2023-06-25 12:17:45,333 INFO [train.py:996] (3/4) Epoch 11, batch 25850, loss[loss=0.2312, simple_loss=0.3191, pruned_loss=0.07168, over 20071.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.323, pruned_loss=0.08335, over 4268726.72 frames. ], batch size: 703, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:17:57,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1984776.0, ans=0.125 2023-06-25 12:17:59,527 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-25 12:18:40,066 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2023-06-25 12:19:33,832 INFO [train.py:996] (3/4) Epoch 11, batch 25900, loss[loss=0.2211, simple_loss=0.2922, pruned_loss=0.07494, over 21207.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3252, pruned_loss=0.0847, over 4278013.24 frames. ], batch size: 607, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:19:45,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1985076.0, ans=0.125 2023-06-25 12:19:54,648 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.078e+02 8.473e+02 1.223e+03 1.634e+03 2.981e+03, threshold=2.447e+03, percent-clipped=0.0 2023-06-25 12:19:56,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1985136.0, ans=0.04949747468305833 2023-06-25 12:20:00,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1985136.0, ans=0.125 2023-06-25 12:20:12,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1985136.0, ans=0.125 2023-06-25 12:20:58,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1985256.0, ans=0.125 2023-06-25 12:21:21,876 INFO [train.py:996] (3/4) Epoch 11, batch 25950, loss[loss=0.2742, simple_loss=0.348, pruned_loss=0.1002, over 21580.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3304, pruned_loss=0.08679, over 4274987.67 frames. ], batch size: 414, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:21:36,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1985376.0, ans=0.125 2023-06-25 12:22:15,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1985496.0, ans=0.125 2023-06-25 12:23:14,826 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:23:18,165 INFO [train.py:996] (3/4) Epoch 11, batch 26000, loss[loss=0.2533, simple_loss=0.3342, pruned_loss=0.08617, over 21716.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3301, pruned_loss=0.08572, over 4268292.62 frames. ], batch size: 247, lr: 2.61e-03, grad_scale: 32.0 2023-06-25 12:23:29,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1985676.0, ans=0.0 2023-06-25 12:23:40,950 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.169e+02 7.867e+02 1.001e+03 1.506e+03 3.925e+03, threshold=2.003e+03, percent-clipped=6.0 2023-06-25 12:23:53,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1985736.0, ans=0.0 2023-06-25 12:24:02,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1985736.0, ans=0.1 2023-06-25 12:24:25,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1985856.0, ans=0.2 2023-06-25 12:24:34,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1985856.0, ans=0.0 2023-06-25 12:25:02,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1985976.0, ans=0.125 2023-06-25 12:25:03,152 INFO [train.py:996] (3/4) Epoch 11, batch 26050, loss[loss=0.239, simple_loss=0.3085, pruned_loss=0.08468, over 21872.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3306, pruned_loss=0.0869, over 4272520.34 frames. ], batch size: 371, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:25:09,000 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-25 12:25:24,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1985976.0, ans=0.125 2023-06-25 12:25:27,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1986036.0, ans=0.0 2023-06-25 12:25:29,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1986036.0, ans=15.0 2023-06-25 12:25:30,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1986036.0, ans=0.0 2023-06-25 12:25:40,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1986036.0, ans=0.0 2023-06-25 12:26:47,666 INFO [train.py:996] (3/4) Epoch 11, batch 26100, loss[loss=0.2407, simple_loss=0.2998, pruned_loss=0.09084, over 21942.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3255, pruned_loss=0.08703, over 4279126.57 frames. ], batch size: 316, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:27:09,752 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.534e+02 7.535e+02 1.106e+03 1.701e+03 2.759e+03, threshold=2.213e+03, percent-clipped=15.0 2023-06-25 12:27:10,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1986336.0, ans=0.0 2023-06-25 12:27:33,877 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-25 12:27:36,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1986396.0, ans=0.125 2023-06-25 12:27:42,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1986396.0, ans=0.125 2023-06-25 12:28:26,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1986516.0, ans=0.2 2023-06-25 12:28:30,579 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-25 12:28:39,860 INFO [train.py:996] (3/4) Epoch 11, batch 26150, loss[loss=0.2486, simple_loss=0.3222, pruned_loss=0.08751, over 21593.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3211, pruned_loss=0.08612, over 4285897.73 frames. ], batch size: 389, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:29:48,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1986756.0, ans=0.125 2023-06-25 12:29:50,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1986756.0, ans=0.2 2023-06-25 12:30:09,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1986816.0, ans=0.125 2023-06-25 12:30:26,203 INFO [train.py:996] (3/4) Epoch 11, batch 26200, loss[loss=0.2133, simple_loss=0.3047, pruned_loss=0.0609, over 21181.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.321, pruned_loss=0.08395, over 4275772.92 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:30:46,865 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-25 12:30:47,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1986876.0, ans=0.05 2023-06-25 12:30:53,723 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.107e+02 7.785e+02 1.042e+03 1.454e+03 3.867e+03, threshold=2.084e+03, percent-clipped=10.0 2023-06-25 12:31:04,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=15.0 2023-06-25 12:31:23,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1986996.0, ans=0.125 2023-06-25 12:31:27,351 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-25 12:31:43,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1987056.0, ans=0.125 2023-06-25 12:32:09,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1987176.0, ans=0.0 2023-06-25 12:32:10,629 INFO [train.py:996] (3/4) Epoch 11, batch 26250, loss[loss=0.2381, simple_loss=0.3081, pruned_loss=0.08406, over 21337.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3246, pruned_loss=0.0833, over 4271910.51 frames. ], batch size: 159, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:32:43,454 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-06-25 12:32:47,959 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:34:04,630 INFO [train.py:996] (3/4) Epoch 11, batch 26300, loss[loss=0.2228, simple_loss=0.2973, pruned_loss=0.07414, over 21882.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3216, pruned_loss=0.08346, over 4281526.66 frames. ], batch size: 371, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:34:22,777 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.85 vs. limit=5.0 2023-06-25 12:34:26,035 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.433e+02 7.825e+02 1.057e+03 1.626e+03 4.026e+03, threshold=2.114e+03, percent-clipped=11.0 2023-06-25 12:34:52,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1987596.0, ans=0.125 2023-06-25 12:34:55,255 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-25 12:35:05,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1987656.0, ans=0.025 2023-06-25 12:35:23,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1987656.0, ans=0.125 2023-06-25 12:35:41,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1987716.0, ans=0.0 2023-06-25 12:35:49,277 INFO [train.py:996] (3/4) Epoch 11, batch 26350, loss[loss=0.2594, simple_loss=0.3339, pruned_loss=0.09249, over 21563.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3194, pruned_loss=0.08371, over 4288530.71 frames. ], batch size: 414, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:36:27,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1987896.0, ans=0.125 2023-06-25 12:37:28,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1988016.0, ans=0.125 2023-06-25 12:37:31,592 INFO [train.py:996] (3/4) Epoch 11, batch 26400, loss[loss=0.2034, simple_loss=0.2586, pruned_loss=0.07406, over 21235.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3146, pruned_loss=0.08415, over 4284932.80 frames. ], batch size: 176, lr: 2.61e-03, grad_scale: 32.0 2023-06-25 12:37:32,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1988076.0, ans=0.0 2023-06-25 12:37:48,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1988136.0, ans=0.2 2023-06-25 12:37:50,226 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.280e+02 8.065e+02 9.903e+02 1.362e+03 2.931e+03, threshold=1.981e+03, percent-clipped=5.0 2023-06-25 12:37:50,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1988136.0, ans=0.1 2023-06-25 12:38:45,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1988256.0, ans=0.2 2023-06-25 12:39:16,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1988316.0, ans=0.125 2023-06-25 12:39:22,501 INFO [train.py:996] (3/4) Epoch 11, batch 26450, loss[loss=0.2176, simple_loss=0.2871, pruned_loss=0.07405, over 21137.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3133, pruned_loss=0.08337, over 4273911.62 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:39:42,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1988436.0, ans=0.125 2023-06-25 12:39:46,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1988436.0, ans=0.1 2023-06-25 12:40:15,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1988496.0, ans=0.0 2023-06-25 12:40:36,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1988556.0, ans=0.2 2023-06-25 12:41:00,395 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-25 12:41:04,364 INFO [train.py:996] (3/4) Epoch 11, batch 26500, loss[loss=0.2135, simple_loss=0.2973, pruned_loss=0.06482, over 21728.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3163, pruned_loss=0.08111, over 4279589.64 frames. ], batch size: 298, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:41:34,314 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.135e+02 9.452e+02 1.417e+03 2.268e+03 5.584e+03, threshold=2.834e+03, percent-clipped=34.0 2023-06-25 12:41:49,899 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2023-06-25 12:41:53,227 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.94 vs. limit=15.0 2023-06-25 12:42:16,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1988856.0, ans=0.125 2023-06-25 12:42:24,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1988856.0, ans=0.1 2023-06-25 12:43:02,831 INFO [train.py:996] (3/4) Epoch 11, batch 26550, loss[loss=0.1951, simple_loss=0.2805, pruned_loss=0.05487, over 21629.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3128, pruned_loss=0.07809, over 4275652.71 frames. ], batch size: 247, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:43:20,985 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=22.5 2023-06-25 12:43:30,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1989036.0, ans=0.1 2023-06-25 12:43:56,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1989096.0, ans=0.125 2023-06-25 12:44:11,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1989156.0, ans=0.125 2023-06-25 12:44:43,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1989216.0, ans=0.125 2023-06-25 12:44:55,005 INFO [train.py:996] (3/4) Epoch 11, batch 26600, loss[loss=0.2281, simple_loss=0.2933, pruned_loss=0.08141, over 21318.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3142, pruned_loss=0.0766, over 4280256.79 frames. ], batch size: 131, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:44:57,771 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=12.0 2023-06-25 12:45:07,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1989276.0, ans=0.0 2023-06-25 12:45:17,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1989336.0, ans=0.04949747468305833 2023-06-25 12:45:18,700 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.580e+02 8.378e+02 1.280e+03 1.887e+03 4.610e+03, threshold=2.560e+03, percent-clipped=7.0 2023-06-25 12:46:41,104 INFO [train.py:996] (3/4) Epoch 11, batch 26650, loss[loss=0.1563, simple_loss=0.2371, pruned_loss=0.03779, over 21513.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.307, pruned_loss=0.07535, over 4268495.26 frames. ], batch size: 195, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:46:46,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1989576.0, ans=0.125 2023-06-25 12:46:48,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1989576.0, ans=0.0 2023-06-25 12:46:58,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1989576.0, ans=0.2 2023-06-25 12:47:14,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1989636.0, ans=0.04949747468305833 2023-06-25 12:47:32,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1989696.0, ans=0.125 2023-06-25 12:48:26,206 INFO [train.py:996] (3/4) Epoch 11, batch 26700, loss[loss=0.2553, simple_loss=0.3204, pruned_loss=0.09512, over 21772.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3009, pruned_loss=0.07333, over 4257249.49 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:48:26,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1989876.0, ans=10.0 2023-06-25 12:48:45,413 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=12.0 2023-06-25 12:48:49,818 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.073e+02 6.649e+02 8.698e+02 1.295e+03 2.536e+03, threshold=1.740e+03, percent-clipped=0.0 2023-06-25 12:48:57,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1989936.0, ans=0.0 2023-06-25 12:49:23,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1990056.0, ans=0.125 2023-06-25 12:50:13,228 INFO [train.py:996] (3/4) Epoch 11, batch 26750, loss[loss=0.3012, simple_loss=0.3748, pruned_loss=0.1138, over 21445.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3007, pruned_loss=0.07244, over 4262971.83 frames. ], batch size: 131, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:50:31,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1990236.0, ans=0.1 2023-06-25 12:51:54,737 INFO [train.py:996] (3/4) Epoch 11, batch 26800, loss[loss=0.2489, simple_loss=0.3259, pruned_loss=0.08601, over 21376.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3072, pruned_loss=0.07672, over 4264665.00 frames. ], batch size: 549, lr: 2.61e-03, grad_scale: 32.0 2023-06-25 12:52:15,286 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.285e+02 8.736e+02 1.158e+03 1.774e+03 3.470e+03, threshold=2.315e+03, percent-clipped=25.0 2023-06-25 12:52:28,604 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=22.5 2023-06-25 12:52:44,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1990596.0, ans=0.125 2023-06-25 12:53:10,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1990656.0, ans=0.04949747468305833 2023-06-25 12:53:14,159 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=15.0 2023-06-25 12:53:33,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1990716.0, ans=0.1 2023-06-25 12:53:42,507 INFO [train.py:996] (3/4) Epoch 11, batch 26850, loss[loss=0.2128, simple_loss=0.275, pruned_loss=0.07531, over 21565.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3093, pruned_loss=0.07973, over 4265794.13 frames. ], batch size: 391, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:54:03,637 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=15.0 2023-06-25 12:54:29,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1990896.0, ans=0.125 2023-06-25 12:55:26,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1991076.0, ans=0.125 2023-06-25 12:55:27,795 INFO [train.py:996] (3/4) Epoch 11, batch 26900, loss[loss=0.2093, simple_loss=0.2751, pruned_loss=0.07179, over 21664.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3002, pruned_loss=0.07854, over 4272997.00 frames. ], batch size: 333, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:55:31,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1991076.0, ans=0.125 2023-06-25 12:55:31,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1991076.0, ans=0.1 2023-06-25 12:55:47,030 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.238e+02 7.282e+02 8.869e+02 1.344e+03 2.683e+03, threshold=1.774e+03, percent-clipped=1.0 2023-06-25 12:56:18,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1991196.0, ans=0.0 2023-06-25 12:57:13,431 INFO [train.py:996] (3/4) Epoch 11, batch 26950, loss[loss=0.2531, simple_loss=0.3422, pruned_loss=0.08199, over 21765.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3013, pruned_loss=0.0793, over 4269965.88 frames. ], batch size: 282, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:57:46,925 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:57:56,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1991496.0, ans=0.125 2023-06-25 12:58:30,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1991556.0, ans=0.0 2023-06-25 12:58:59,438 INFO [train.py:996] (3/4) Epoch 11, batch 27000, loss[loss=0.1964, simple_loss=0.2703, pruned_loss=0.06121, over 21234.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3009, pruned_loss=0.07644, over 4266592.42 frames. ], batch size: 159, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 12:58:59,439 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 12:59:15,469 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.2060, 3.1501, 1.7307, 1.6743], device='cuda:3') 2023-06-25 12:59:16,968 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.235, simple_loss=0.334, pruned_loss=0.06803, over 1796401.00 frames. 2023-06-25 12:59:16,969 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-25 12:59:38,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1991736.0, ans=0.125 2023-06-25 12:59:39,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1991736.0, ans=0.125 2023-06-25 12:59:48,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1991736.0, ans=0.125 2023-06-25 12:59:55,904 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.981e+02 9.015e+02 1.282e+03 1.827e+03 4.662e+03, threshold=2.565e+03, percent-clipped=27.0 2023-06-25 13:00:56,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1991916.0, ans=0.0 2023-06-25 13:01:06,431 INFO [train.py:996] (3/4) Epoch 11, batch 27050, loss[loss=0.2203, simple_loss=0.3023, pruned_loss=0.06914, over 21889.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3022, pruned_loss=0.073, over 4270089.44 frames. ], batch size: 316, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:01:06,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1991976.0, ans=0.125 2023-06-25 13:01:23,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1992036.0, ans=0.125 2023-06-25 13:01:23,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1992036.0, ans=0.1 2023-06-25 13:02:16,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1992156.0, ans=0.0 2023-06-25 13:02:21,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1992156.0, ans=0.2 2023-06-25 13:02:25,628 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=22.5 2023-06-25 13:02:29,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1992216.0, ans=0.07 2023-06-25 13:02:31,962 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-25 13:02:51,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1992276.0, ans=0.125 2023-06-25 13:02:53,336 INFO [train.py:996] (3/4) Epoch 11, batch 27100, loss[loss=0.254, simple_loss=0.3238, pruned_loss=0.09209, over 21502.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3041, pruned_loss=0.07468, over 4282072.87 frames. ], batch size: 131, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:03:27,668 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.097e+02 9.568e+02 1.359e+03 2.016e+03 3.804e+03, threshold=2.717e+03, percent-clipped=7.0 2023-06-25 13:04:34,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1992516.0, ans=0.125 2023-06-25 13:04:41,017 INFO [train.py:996] (3/4) Epoch 11, batch 27150, loss[loss=0.3674, simple_loss=0.446, pruned_loss=0.1444, over 21522.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3154, pruned_loss=0.07815, over 4278190.02 frames. ], batch size: 507, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:05:08,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-25 13:06:15,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1992816.0, ans=0.125 2023-06-25 13:06:25,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1992876.0, ans=0.125 2023-06-25 13:06:27,132 INFO [train.py:996] (3/4) Epoch 11, batch 27200, loss[loss=0.2863, simple_loss=0.3634, pruned_loss=0.1046, over 21896.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3264, pruned_loss=0.08211, over 4278532.56 frames. ], batch size: 316, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:07:01,196 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.231e+02 8.854e+02 1.107e+03 1.912e+03 4.473e+03, threshold=2.214e+03, percent-clipped=8.0 2023-06-25 13:07:21,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1992996.0, ans=0.125 2023-06-25 13:08:27,012 INFO [train.py:996] (3/4) Epoch 11, batch 27250, loss[loss=0.2677, simple_loss=0.3338, pruned_loss=0.1008, over 21757.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3295, pruned_loss=0.08626, over 4275461.70 frames. ], batch size: 332, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:08:32,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1993176.0, ans=0.125 2023-06-25 13:10:18,673 INFO [train.py:996] (3/4) Epoch 11, batch 27300, loss[loss=0.2468, simple_loss=0.3278, pruned_loss=0.08289, over 21611.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3306, pruned_loss=0.08624, over 4273405.73 frames. ], batch size: 263, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:10:19,709 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-25 13:10:28,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.69 vs. limit=5.0 2023-06-25 13:10:29,552 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-25 13:10:46,373 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.500e+02 8.064e+02 1.048e+03 1.560e+03 3.072e+03, threshold=2.097e+03, percent-clipped=8.0 2023-06-25 13:11:28,448 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.74 vs. limit=10.0 2023-06-25 13:11:29,624 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=22.5 2023-06-25 13:11:40,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1993656.0, ans=0.1 2023-06-25 13:11:48,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1993716.0, ans=0.05 2023-06-25 13:12:04,739 INFO [train.py:996] (3/4) Epoch 11, batch 27350, loss[loss=0.278, simple_loss=0.3591, pruned_loss=0.09847, over 21380.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3311, pruned_loss=0.08623, over 4276828.37 frames. ], batch size: 131, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:12:09,464 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.68 vs. limit=15.0 2023-06-25 13:12:11,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1993776.0, ans=0.125 2023-06-25 13:12:44,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1993896.0, ans=0.0 2023-06-25 13:12:45,742 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.39 vs. limit=15.0 2023-06-25 13:13:44,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1994016.0, ans=0.0 2023-06-25 13:13:50,489 INFO [train.py:996] (3/4) Epoch 11, batch 27400, loss[loss=0.2513, simple_loss=0.3135, pruned_loss=0.09453, over 21828.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.327, pruned_loss=0.08621, over 4280120.64 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:13:53,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1994076.0, ans=0.09899494936611666 2023-06-25 13:13:55,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1994076.0, ans=0.1 2023-06-25 13:14:17,609 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.960e+02 7.617e+02 1.033e+03 1.386e+03 3.217e+03, threshold=2.066e+03, percent-clipped=9.0 2023-06-25 13:14:20,609 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-25 13:14:28,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1994196.0, ans=0.1 2023-06-25 13:14:54,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1994196.0, ans=0.0 2023-06-25 13:15:27,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1994316.0, ans=0.2 2023-06-25 13:15:36,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1994376.0, ans=0.125 2023-06-25 13:15:37,970 INFO [train.py:996] (3/4) Epoch 11, batch 27450, loss[loss=0.2441, simple_loss=0.335, pruned_loss=0.07655, over 21628.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3196, pruned_loss=0.08371, over 4277165.20 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:15:57,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.16 vs. limit=15.0 2023-06-25 13:16:11,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1994436.0, ans=0.2 2023-06-25 13:16:49,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1994556.0, ans=0.125 2023-06-25 13:17:23,730 INFO [train.py:996] (3/4) Epoch 11, batch 27500, loss[loss=0.263, simple_loss=0.3269, pruned_loss=0.09958, over 21715.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3172, pruned_loss=0.08376, over 4277213.57 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:17:24,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1994676.0, ans=0.2 2023-06-25 13:17:40,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1994676.0, ans=0.05 2023-06-25 13:17:51,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.938e+02 7.259e+02 1.005e+03 1.389e+03 2.816e+03, threshold=2.010e+03, percent-clipped=4.0 2023-06-25 13:18:08,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1994796.0, ans=0.125 2023-06-25 13:18:16,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1994796.0, ans=0.125 2023-06-25 13:18:28,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1994856.0, ans=0.0 2023-06-25 13:18:33,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1994856.0, ans=0.1 2023-06-25 13:18:38,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1994856.0, ans=0.125 2023-06-25 13:18:43,961 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-25 13:19:07,755 INFO [train.py:996] (3/4) Epoch 11, batch 27550, loss[loss=0.2077, simple_loss=0.2897, pruned_loss=0.06282, over 21544.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3127, pruned_loss=0.08086, over 4280848.40 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:20:17,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1995156.0, ans=0.2 2023-06-25 13:20:42,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1995216.0, ans=0.1 2023-06-25 13:20:54,516 INFO [train.py:996] (3/4) Epoch 11, batch 27600, loss[loss=0.1809, simple_loss=0.2414, pruned_loss=0.06023, over 20732.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3059, pruned_loss=0.0798, over 4277495.56 frames. ], batch size: 609, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:20:54,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1995276.0, ans=0.125 2023-06-25 13:21:06,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1995276.0, ans=0.1 2023-06-25 13:21:17,404 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.292e+02 9.329e+02 1.469e+03 1.993e+03 3.791e+03, threshold=2.938e+03, percent-clipped=25.0 2023-06-25 13:22:05,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1995456.0, ans=0.125 2023-06-25 13:22:07,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1995456.0, ans=0.125 2023-06-25 13:22:07,974 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.09 vs. limit=15.0 2023-06-25 13:22:27,929 INFO [train.py:996] (3/4) Epoch 11, batch 27650, loss[loss=0.2354, simple_loss=0.2986, pruned_loss=0.08608, over 21463.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3001, pruned_loss=0.07867, over 4271350.54 frames. ], batch size: 131, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:22:40,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1995576.0, ans=0.0 2023-06-25 13:24:05,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1995816.0, ans=0.0 2023-06-25 13:24:19,710 INFO [train.py:996] (3/4) Epoch 11, batch 27700, loss[loss=0.269, simple_loss=0.3347, pruned_loss=0.1017, over 19963.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3008, pruned_loss=0.07702, over 4270407.46 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:24:43,458 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.775e+02 8.277e+02 1.271e+03 1.738e+03 3.564e+03, threshold=2.542e+03, percent-clipped=2.0 2023-06-25 13:25:10,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1995996.0, ans=0.125 2023-06-25 13:25:41,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1996056.0, ans=0.125 2023-06-25 13:25:41,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1996056.0, ans=0.125 2023-06-25 13:26:01,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1996116.0, ans=0.0 2023-06-25 13:26:05,566 INFO [train.py:996] (3/4) Epoch 11, batch 27750, loss[loss=0.1994, simple_loss=0.2926, pruned_loss=0.05308, over 21845.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.303, pruned_loss=0.07592, over 4273956.24 frames. ], batch size: 316, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:26:56,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1996296.0, ans=0.125 2023-06-25 13:27:19,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1996356.0, ans=0.125 2023-06-25 13:27:20,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1996356.0, ans=0.125 2023-06-25 13:27:32,776 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.93 vs. limit=22.5 2023-06-25 13:27:43,136 INFO [train.py:996] (3/4) Epoch 11, batch 27800, loss[loss=0.2498, simple_loss=0.3106, pruned_loss=0.09453, over 21409.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3013, pruned_loss=0.07626, over 4276899.91 frames. ], batch size: 177, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:28:10,888 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.247e+02 7.249e+02 9.541e+02 1.506e+03 2.955e+03, threshold=1.908e+03, percent-clipped=10.0 2023-06-25 13:28:11,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.20 vs. limit=15.0 2023-06-25 13:28:45,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1996596.0, ans=0.1 2023-06-25 13:28:54,857 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=12.0 2023-06-25 13:29:27,195 INFO [train.py:996] (3/4) Epoch 11, batch 27850, loss[loss=0.2407, simple_loss=0.3307, pruned_loss=0.07535, over 21626.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3006, pruned_loss=0.07733, over 4283118.65 frames. ], batch size: 230, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:29:38,622 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:29:56,559 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=15.0 2023-06-25 13:30:16,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1996836.0, ans=0.125 2023-06-25 13:30:17,385 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=15.0 2023-06-25 13:30:18,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1996896.0, ans=0.0 2023-06-25 13:30:51,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1996956.0, ans=0.015 2023-06-25 13:31:09,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1997016.0, ans=0.1 2023-06-25 13:31:17,903 INFO [train.py:996] (3/4) Epoch 11, batch 27900, loss[loss=0.2649, simple_loss=0.3581, pruned_loss=0.08585, over 21733.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3073, pruned_loss=0.07693, over 4287682.20 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:31:33,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1997076.0, ans=0.0 2023-06-25 13:31:45,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1997136.0, ans=0.125 2023-06-25 13:31:53,156 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.447e+02 7.594e+02 1.073e+03 1.549e+03 3.110e+03, threshold=2.145e+03, percent-clipped=9.0 2023-06-25 13:32:29,358 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-25 13:33:06,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1997316.0, ans=0.0 2023-06-25 13:33:12,650 INFO [train.py:996] (3/4) Epoch 11, batch 27950, loss[loss=0.2583, simple_loss=0.3466, pruned_loss=0.08502, over 21695.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3089, pruned_loss=0.07422, over 4282678.72 frames. ], batch size: 441, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:34:39,489 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:34:52,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1997616.0, ans=0.125 2023-06-25 13:34:57,421 INFO [train.py:996] (3/4) Epoch 11, batch 28000, loss[loss=0.2394, simple_loss=0.3132, pruned_loss=0.08275, over 21884.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3082, pruned_loss=0.07276, over 4287895.39 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 32.0 2023-06-25 13:35:07,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1997676.0, ans=0.125 2023-06-25 13:35:14,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1997676.0, ans=0.125 2023-06-25 13:35:19,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1997736.0, ans=0.125 2023-06-25 13:35:24,788 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=15.0 2023-06-25 13:35:25,419 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.615e+02 8.690e+02 1.335e+03 1.864e+03 4.176e+03, threshold=2.670e+03, percent-clipped=16.0 2023-06-25 13:35:33,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1997736.0, ans=0.0 2023-06-25 13:36:49,662 INFO [train.py:996] (3/4) Epoch 11, batch 28050, loss[loss=0.1749, simple_loss=0.2395, pruned_loss=0.05514, over 21369.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3065, pruned_loss=0.07449, over 4291830.56 frames. ], batch size: 131, lr: 2.60e-03, grad_scale: 32.0 2023-06-25 13:37:34,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1998096.0, ans=0.125 2023-06-25 13:38:03,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1998156.0, ans=0.125 2023-06-25 13:38:14,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1998216.0, ans=0.125 2023-06-25 13:38:37,689 INFO [train.py:996] (3/4) Epoch 11, batch 28100, loss[loss=0.2216, simple_loss=0.2882, pruned_loss=0.07745, over 19918.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3034, pruned_loss=0.07479, over 4285814.63 frames. ], batch size: 703, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:38:56,567 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.14 vs. limit=15.0 2023-06-25 13:39:01,514 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.345e+02 8.283e+02 1.257e+03 1.912e+03 3.792e+03, threshold=2.513e+03, percent-clipped=5.0 2023-06-25 13:39:18,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1998396.0, ans=0.0 2023-06-25 13:39:42,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1998456.0, ans=0.1 2023-06-25 13:39:45,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1998456.0, ans=0.125 2023-06-25 13:40:22,782 INFO [train.py:996] (3/4) Epoch 11, batch 28150, loss[loss=0.2076, simple_loss=0.2752, pruned_loss=0.06997, over 21864.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2969, pruned_loss=0.0746, over 4282836.27 frames. ], batch size: 373, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:41:04,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1998696.0, ans=0.125 2023-06-25 13:41:22,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1998756.0, ans=0.125 2023-06-25 13:41:22,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1998756.0, ans=0.1 2023-06-25 13:42:10,812 INFO [train.py:996] (3/4) Epoch 11, batch 28200, loss[loss=0.281, simple_loss=0.3442, pruned_loss=0.1088, over 21566.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2955, pruned_loss=0.07625, over 4282826.11 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:42:42,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.056e+02 7.864e+02 1.049e+03 1.647e+03 3.891e+03, threshold=2.099e+03, percent-clipped=11.0 2023-06-25 13:43:22,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1999056.0, ans=0.0 2023-06-25 13:43:57,336 INFO [train.py:996] (3/4) Epoch 11, batch 28250, loss[loss=0.2308, simple_loss=0.2925, pruned_loss=0.08452, over 21852.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3007, pruned_loss=0.07942, over 4279590.82 frames. ], batch size: 107, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:44:25,705 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=22.5 2023-06-25 13:44:54,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1999296.0, ans=0.125 2023-06-25 13:45:08,987 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-25 13:45:45,814 INFO [train.py:996] (3/4) Epoch 11, batch 28300, loss[loss=0.1605, simple_loss=0.2347, pruned_loss=0.04312, over 21247.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2982, pruned_loss=0.07717, over 4268658.60 frames. ], batch size: 159, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:45:58,363 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.56 vs. limit=22.5 2023-06-25 13:46:04,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1999476.0, ans=0.125 2023-06-25 13:46:24,388 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.207e+02 7.571e+02 1.027e+03 1.599e+03 2.949e+03, threshold=2.054e+03, percent-clipped=6.0 2023-06-25 13:47:28,471 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-25 13:47:38,009 INFO [train.py:996] (3/4) Epoch 11, batch 28350, loss[loss=0.1966, simple_loss=0.2801, pruned_loss=0.05653, over 21407.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2944, pruned_loss=0.07222, over 4264095.79 frames. ], batch size: 211, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:49:00,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=22.5 2023-06-25 13:49:21,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2000016.0, ans=0.2 2023-06-25 13:49:24,420 INFO [train.py:996] (3/4) Epoch 11, batch 28400, loss[loss=0.2668, simple_loss=0.3305, pruned_loss=0.1016, over 21683.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2914, pruned_loss=0.07282, over 4266491.93 frames. ], batch size: 332, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:49:57,009 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.013e+02 9.043e+02 1.474e+03 1.977e+03 3.910e+03, threshold=2.949e+03, percent-clipped=21.0 2023-06-25 13:50:14,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2000196.0, ans=0.125 2023-06-25 13:51:05,410 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:51:09,800 INFO [train.py:996] (3/4) Epoch 11, batch 28450, loss[loss=0.2581, simple_loss=0.3271, pruned_loss=0.09455, over 21687.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2969, pruned_loss=0.07629, over 4274378.84 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:52:21,644 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=15.0 2023-06-25 13:52:22,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2000556.0, ans=0.125 2023-06-25 13:52:41,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2000616.0, ans=0.125 2023-06-25 13:52:48,922 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.16 vs. limit=15.0 2023-06-25 13:52:51,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2000616.0, ans=0.0 2023-06-25 13:52:54,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2000676.0, ans=0.0 2023-06-25 13:52:55,354 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.00 vs. limit=10.0 2023-06-25 13:53:03,106 INFO [train.py:996] (3/4) Epoch 11, batch 28500, loss[loss=0.2429, simple_loss=0.3059, pruned_loss=0.08991, over 21577.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.2997, pruned_loss=0.07828, over 4276177.88 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:53:18,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2000736.0, ans=0.0 2023-06-25 13:53:22,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2000736.0, ans=0.1 2023-06-25 13:53:46,206 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.310e+02 7.652e+02 9.900e+02 1.430e+03 3.378e+03, threshold=1.980e+03, percent-clipped=2.0 2023-06-25 13:53:58,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2000796.0, ans=0.2 2023-06-25 13:54:30,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2000916.0, ans=0.05 2023-06-25 13:54:48,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2000916.0, ans=10.0 2023-06-25 13:54:51,316 INFO [train.py:996] (3/4) Epoch 11, batch 28550, loss[loss=0.2455, simple_loss=0.3497, pruned_loss=0.07065, over 21397.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3096, pruned_loss=0.08155, over 4281534.82 frames. ], batch size: 211, lr: 2.60e-03, grad_scale: 4.0 2023-06-25 13:55:20,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2001036.0, ans=0.05 2023-06-25 13:55:57,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2001156.0, ans=0.125 2023-06-25 13:56:07,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2001156.0, ans=0.125 2023-06-25 13:56:44,030 INFO [train.py:996] (3/4) Epoch 11, batch 28600, loss[loss=0.2037, simple_loss=0.284, pruned_loss=0.06175, over 21746.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3146, pruned_loss=0.08215, over 4277986.03 frames. ], batch size: 247, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:57:14,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2001336.0, ans=0.125 2023-06-25 13:57:18,465 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.357e+02 7.619e+02 9.941e+02 1.518e+03 3.528e+03, threshold=1.988e+03, percent-clipped=12.0 2023-06-25 13:57:56,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2001456.0, ans=0.125 2023-06-25 13:57:57,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2001456.0, ans=0.125 2023-06-25 13:58:18,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2001516.0, ans=0.125 2023-06-25 13:58:28,351 INFO [train.py:996] (3/4) Epoch 11, batch 28650, loss[loss=0.2084, simple_loss=0.2668, pruned_loss=0.07499, over 21751.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3092, pruned_loss=0.08114, over 4269496.39 frames. ], batch size: 124, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:59:19,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2001696.0, ans=0.5 2023-06-25 13:59:35,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=2001756.0, ans=0.025 2023-06-25 13:59:36,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2001756.0, ans=0.04949747468305833 2023-06-25 14:00:20,281 INFO [train.py:996] (3/4) Epoch 11, batch 28700, loss[loss=0.2686, simple_loss=0.3319, pruned_loss=0.1027, over 21356.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3085, pruned_loss=0.08247, over 4266711.03 frames. ], batch size: 549, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 14:00:45,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2001936.0, ans=0.125 2023-06-25 14:00:55,772 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.441e+02 7.291e+02 9.817e+02 1.860e+03 4.444e+03, threshold=1.963e+03, percent-clipped=16.0 2023-06-25 14:00:56,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=22.5 2023-06-25 14:01:19,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2001996.0, ans=0.1 2023-06-25 14:01:58,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2002116.0, ans=0.2 2023-06-25 14:02:03,138 INFO [train.py:996] (3/4) Epoch 11, batch 28750, loss[loss=0.2689, simple_loss=0.3739, pruned_loss=0.08192, over 19845.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.311, pruned_loss=0.08404, over 4269892.12 frames. ], batch size: 703, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 14:03:08,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2002356.0, ans=0.125 2023-06-25 14:03:42,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2002416.0, ans=0.0 2023-06-25 14:03:48,786 INFO [train.py:996] (3/4) Epoch 11, batch 28800, loss[loss=0.2411, simple_loss=0.3181, pruned_loss=0.08206, over 21767.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3144, pruned_loss=0.08483, over 4275300.55 frames. ], batch size: 332, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:03:52,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2002476.0, ans=0.1 2023-06-25 14:04:29,431 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.183e+02 7.612e+02 1.083e+03 1.492e+03 3.378e+03, threshold=2.166e+03, percent-clipped=11.0 2023-06-25 14:05:06,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2002656.0, ans=0.2 2023-06-25 14:05:07,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2002656.0, ans=0.125 2023-06-25 14:05:18,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2002716.0, ans=0.125 2023-06-25 14:05:25,635 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-25 14:05:29,395 INFO [train.py:996] (3/4) Epoch 11, batch 28850, loss[loss=0.2658, simple_loss=0.329, pruned_loss=0.1013, over 21835.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3161, pruned_loss=0.08644, over 4280513.73 frames. ], batch size: 124, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:05:34,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2002776.0, ans=0.1 2023-06-25 14:06:10,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2002836.0, ans=0.0 2023-06-25 14:06:22,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2002896.0, ans=0.125 2023-06-25 14:06:23,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2002896.0, ans=0.125 2023-06-25 14:06:44,567 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.99 vs. limit=10.0 2023-06-25 14:06:59,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2003016.0, ans=0.0 2023-06-25 14:07:22,625 INFO [train.py:996] (3/4) Epoch 11, batch 28900, loss[loss=0.2271, simple_loss=0.293, pruned_loss=0.08054, over 21331.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3186, pruned_loss=0.08818, over 4283943.80 frames. ], batch size: 194, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:07:30,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2003076.0, ans=0.125 2023-06-25 14:07:47,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-25 14:08:00,100 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.503e+02 7.472e+02 9.917e+02 1.436e+03 2.913e+03, threshold=1.983e+03, percent-clipped=5.0 2023-06-25 14:08:33,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2003256.0, ans=0.125 2023-06-25 14:08:34,502 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2023-06-25 14:09:03,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2003316.0, ans=0.1 2023-06-25 14:09:17,137 INFO [train.py:996] (3/4) Epoch 11, batch 28950, loss[loss=0.2462, simple_loss=0.3399, pruned_loss=0.07626, over 21866.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3194, pruned_loss=0.08689, over 4274580.03 frames. ], batch size: 371, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:09:24,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2003376.0, ans=0.0 2023-06-25 14:09:36,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2003436.0, ans=0.125 2023-06-25 14:09:48,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2003436.0, ans=0.025 2023-06-25 14:10:02,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2003496.0, ans=0.0 2023-06-25 14:10:17,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2003496.0, ans=0.125 2023-06-25 14:10:19,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2003556.0, ans=0.1 2023-06-25 14:10:24,194 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:11:05,991 INFO [train.py:996] (3/4) Epoch 11, batch 29000, loss[loss=0.2624, simple_loss=0.3268, pruned_loss=0.09903, over 21499.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3224, pruned_loss=0.08585, over 4273088.03 frames. ], batch size: 194, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:11:38,486 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.62 vs. limit=15.0 2023-06-25 14:11:47,792 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=15.0 2023-06-25 14:11:48,359 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.370e+02 8.594e+02 1.350e+03 2.116e+03 4.440e+03, threshold=2.700e+03, percent-clipped=27.0 2023-06-25 14:12:00,910 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-25 14:12:01,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2003796.0, ans=0.125 2023-06-25 14:12:45,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2003916.0, ans=0.0 2023-06-25 14:12:52,683 INFO [train.py:996] (3/4) Epoch 11, batch 29050, loss[loss=0.241, simple_loss=0.312, pruned_loss=0.08503, over 21882.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3202, pruned_loss=0.0864, over 4281400.27 frames. ], batch size: 332, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:13:11,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2003976.0, ans=0.1 2023-06-25 14:13:20,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2004036.0, ans=0.125 2023-06-25 14:13:29,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2004036.0, ans=0.125 2023-06-25 14:14:34,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2004216.0, ans=0.125 2023-06-25 14:14:37,977 INFO [train.py:996] (3/4) Epoch 11, batch 29100, loss[loss=0.219, simple_loss=0.2843, pruned_loss=0.07686, over 21569.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3116, pruned_loss=0.08324, over 4276583.95 frames. ], batch size: 391, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:15:16,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2004336.0, ans=0.125 2023-06-25 14:15:19,684 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.995e+02 7.507e+02 9.912e+02 1.574e+03 3.418e+03, threshold=1.982e+03, percent-clipped=5.0 2023-06-25 14:15:57,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2004456.0, ans=0.125 2023-06-25 14:16:05,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2004516.0, ans=0.1 2023-06-25 14:16:15,127 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=22.5 2023-06-25 14:16:19,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2004516.0, ans=0.125 2023-06-25 14:16:24,452 INFO [train.py:996] (3/4) Epoch 11, batch 29150, loss[loss=0.2161, simple_loss=0.3079, pruned_loss=0.06211, over 20818.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.31, pruned_loss=0.08141, over 4276393.29 frames. ], batch size: 607, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:17:02,569 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-25 14:17:30,954 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=22.5 2023-06-25 14:17:33,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2004756.0, ans=0.125 2023-06-25 14:17:36,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2004756.0, ans=0.2 2023-06-25 14:18:06,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2004816.0, ans=0.125 2023-06-25 14:18:08,683 INFO [train.py:996] (3/4) Epoch 11, batch 29200, loss[loss=0.1797, simple_loss=0.2508, pruned_loss=0.05425, over 20731.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3054, pruned_loss=0.08058, over 4275265.59 frames. ], batch size: 608, lr: 2.60e-03, grad_scale: 32.0 2023-06-25 14:18:21,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-25 14:18:40,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2004936.0, ans=0.0 2023-06-25 14:18:49,089 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.775e+02 8.207e+02 1.113e+03 1.658e+03 3.096e+03, threshold=2.226e+03, percent-clipped=9.0 2023-06-25 14:19:20,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2005056.0, ans=0.125 2023-06-25 14:20:00,623 INFO [train.py:996] (3/4) Epoch 11, batch 29250, loss[loss=0.1766, simple_loss=0.2544, pruned_loss=0.04939, over 15853.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3032, pruned_loss=0.07804, over 4267278.48 frames. ], batch size: 60, lr: 2.60e-03, grad_scale: 32.0 2023-06-25 14:20:17,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2005176.0, ans=0.1 2023-06-25 14:20:21,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2005176.0, ans=0.2 2023-06-25 14:20:40,313 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.60 vs. limit=10.0 2023-06-25 14:21:14,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2005356.0, ans=0.05 2023-06-25 14:21:15,066 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2023-06-25 14:21:44,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2005416.0, ans=0.125 2023-06-25 14:21:47,251 INFO [train.py:996] (3/4) Epoch 11, batch 29300, loss[loss=0.1882, simple_loss=0.2609, pruned_loss=0.0578, over 21681.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3058, pruned_loss=0.07781, over 4269175.47 frames. ], batch size: 112, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:21:48,511 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.77 vs. limit=8.0 2023-06-25 14:22:25,276 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.878e+02 9.346e+02 1.272e+03 1.765e+03 3.710e+03, threshold=2.544e+03, percent-clipped=11.0 2023-06-25 14:23:26,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2005716.0, ans=0.0 2023-06-25 14:23:31,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2005776.0, ans=0.125 2023-06-25 14:23:38,504 INFO [train.py:996] (3/4) Epoch 11, batch 29350, loss[loss=0.2286, simple_loss=0.3057, pruned_loss=0.07576, over 21631.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3018, pruned_loss=0.07759, over 4276202.55 frames. ], batch size: 263, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:24:01,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2005836.0, ans=0.125 2023-06-25 14:24:01,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2005836.0, ans=0.0 2023-06-25 14:24:11,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2005836.0, ans=0.125 2023-06-25 14:24:40,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2005956.0, ans=0.05 2023-06-25 14:24:50,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2005956.0, ans=0.125 2023-06-25 14:25:19,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2006016.0, ans=0.07 2023-06-25 14:25:26,800 INFO [train.py:996] (3/4) Epoch 11, batch 29400, loss[loss=0.192, simple_loss=0.2714, pruned_loss=0.05631, over 21707.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3013, pruned_loss=0.07489, over 4278384.81 frames. ], batch size: 298, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:26:04,463 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.130e+02 8.606e+02 1.280e+03 1.886e+03 3.409e+03, threshold=2.560e+03, percent-clipped=11.0 2023-06-25 14:26:33,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2006256.0, ans=0.0 2023-06-25 14:26:50,766 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.38 vs. limit=15.0 2023-06-25 14:26:57,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2006316.0, ans=0.0 2023-06-25 14:27:14,992 INFO [train.py:996] (3/4) Epoch 11, batch 29450, loss[loss=0.2522, simple_loss=0.3329, pruned_loss=0.08577, over 21586.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2972, pruned_loss=0.07347, over 4273707.40 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:28:00,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2006496.0, ans=0.0 2023-06-25 14:28:09,407 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.39 vs. limit=10.0 2023-06-25 14:28:10,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2006496.0, ans=0.125 2023-06-25 14:28:15,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2006556.0, ans=0.0 2023-06-25 14:28:23,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2006556.0, ans=0.95 2023-06-25 14:28:23,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2006556.0, ans=0.1 2023-06-25 14:28:47,919 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:29:00,479 INFO [train.py:996] (3/4) Epoch 11, batch 29500, loss[loss=0.2469, simple_loss=0.3128, pruned_loss=0.09044, over 21562.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3036, pruned_loss=0.07721, over 4276621.57 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:29:16,945 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-25 14:29:27,975 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:29:29,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2006736.0, ans=0.125 2023-06-25 14:29:31,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2006736.0, ans=0.2 2023-06-25 14:29:44,967 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.231e+02 8.728e+02 1.187e+03 1.757e+03 3.879e+03, threshold=2.373e+03, percent-clipped=3.0 2023-06-25 14:30:02,343 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-25 14:30:42,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2006916.0, ans=0.125 2023-06-25 14:30:48,737 INFO [train.py:996] (3/4) Epoch 11, batch 29550, loss[loss=0.2903, simple_loss=0.3421, pruned_loss=0.1192, over 21729.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3058, pruned_loss=0.07966, over 4280931.06 frames. ], batch size: 473, lr: 2.59e-03, grad_scale: 16.0 2023-06-25 14:31:10,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2007036.0, ans=0.07 2023-06-25 14:31:38,664 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-06-25 14:32:42,226 INFO [train.py:996] (3/4) Epoch 11, batch 29600, loss[loss=0.2059, simple_loss=0.2946, pruned_loss=0.05855, over 21351.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3126, pruned_loss=0.0822, over 4282928.80 frames. ], batch size: 131, lr: 2.59e-03, grad_scale: 32.0 2023-06-25 14:32:42,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2007276.0, ans=0.2 2023-06-25 14:32:46,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2007276.0, ans=0.125 2023-06-25 14:32:51,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2007276.0, ans=0.0 2023-06-25 14:33:09,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2007336.0, ans=0.0 2023-06-25 14:33:17,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2007336.0, ans=0.2 2023-06-25 14:33:22,205 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.467e+02 8.361e+02 1.294e+03 2.319e+03 6.850e+03, threshold=2.587e+03, percent-clipped=23.0 2023-06-25 14:34:26,203 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-25 14:34:27,140 INFO [train.py:996] (3/4) Epoch 11, batch 29650, loss[loss=0.1889, simple_loss=0.2591, pruned_loss=0.05933, over 21302.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3123, pruned_loss=0.07965, over 4279380.50 frames. ], batch size: 159, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:34:50,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2007636.0, ans=0.0 2023-06-25 14:34:53,991 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.27 vs. limit=6.0 2023-06-25 14:35:23,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2007696.0, ans=0.125 2023-06-25 14:35:52,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2007816.0, ans=0.125 2023-06-25 14:36:16,282 INFO [train.py:996] (3/4) Epoch 11, batch 29700, loss[loss=0.3236, simple_loss=0.4154, pruned_loss=0.1159, over 21550.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3136, pruned_loss=0.0795, over 4282574.07 frames. ], batch size: 471, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:36:24,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2007876.0, ans=0.0 2023-06-25 14:37:02,465 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.011e+02 9.154e+02 1.304e+03 2.529e+03 6.535e+03, threshold=2.607e+03, percent-clipped=22.0 2023-06-25 14:37:08,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2007996.0, ans=0.125 2023-06-25 14:37:17,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2007996.0, ans=0.0 2023-06-25 14:37:20,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2007996.0, ans=0.1 2023-06-25 14:38:01,611 INFO [train.py:996] (3/4) Epoch 11, batch 29750, loss[loss=0.2127, simple_loss=0.3179, pruned_loss=0.05377, over 19853.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3177, pruned_loss=0.07922, over 4285175.53 frames. ], batch size: 702, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:38:56,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2008296.0, ans=0.125 2023-06-25 14:39:32,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2008416.0, ans=0.1 2023-06-25 14:39:39,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2008416.0, ans=0.07 2023-06-25 14:39:45,385 INFO [train.py:996] (3/4) Epoch 11, batch 29800, loss[loss=0.2199, simple_loss=0.314, pruned_loss=0.06295, over 17342.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3185, pruned_loss=0.08038, over 4284342.98 frames. ], batch size: 60, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:40:30,955 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.324e+02 8.347e+02 1.266e+03 1.868e+03 3.431e+03, threshold=2.532e+03, percent-clipped=7.0 2023-06-25 14:40:33,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2008596.0, ans=0.125 2023-06-25 14:41:28,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2008776.0, ans=0.0 2023-06-25 14:41:30,259 INFO [train.py:996] (3/4) Epoch 11, batch 29850, loss[loss=0.2059, simple_loss=0.2863, pruned_loss=0.0627, over 21868.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3135, pruned_loss=0.0784, over 4284192.03 frames. ], batch size: 333, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:41:54,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2008836.0, ans=0.0 2023-06-25 14:42:40,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=2008956.0, ans=0.2 2023-06-25 14:42:42,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2008956.0, ans=0.1 2023-06-25 14:43:01,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2009016.0, ans=0.2 2023-06-25 14:43:07,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2009016.0, ans=0.0 2023-06-25 14:43:16,103 INFO [train.py:996] (3/4) Epoch 11, batch 29900, loss[loss=0.2559, simple_loss=0.322, pruned_loss=0.0949, over 21651.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3115, pruned_loss=0.07942, over 4292385.26 frames. ], batch size: 263, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:43:41,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2009136.0, ans=0.04949747468305833 2023-06-25 14:44:02,445 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.086e+02 7.684e+02 1.155e+03 1.725e+03 4.466e+03, threshold=2.311e+03, percent-clipped=10.0 2023-06-25 14:44:20,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2009196.0, ans=0.125 2023-06-25 14:44:22,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2009256.0, ans=0.0 2023-06-25 14:45:08,294 INFO [train.py:996] (3/4) Epoch 11, batch 29950, loss[loss=0.29, simple_loss=0.3551, pruned_loss=0.1124, over 21552.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.316, pruned_loss=0.08322, over 4292003.24 frames. ], batch size: 415, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:45:40,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2009436.0, ans=0.125 2023-06-25 14:45:52,379 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-25 14:45:56,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=2009496.0, ans=0.02 2023-06-25 14:46:32,991 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-06-25 14:46:55,261 INFO [train.py:996] (3/4) Epoch 11, batch 30000, loss[loss=0.1991, simple_loss=0.2865, pruned_loss=0.05588, over 21320.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3177, pruned_loss=0.08354, over 4294185.46 frames. ], batch size: 176, lr: 2.59e-03, grad_scale: 16.0 2023-06-25 14:46:55,261 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 14:47:12,453 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7789, 1.6904, 1.4054, 1.8798, 0.9306, 1.7918, 1.7245, 1.5825], device='cuda:3') 2023-06-25 14:47:14,791 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.2475, simple_loss=0.3451, pruned_loss=0.07497, over 1796401.00 frames. 2023-06-25 14:47:14,792 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-25 14:47:16,150 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.84 vs. limit=10.0 2023-06-25 14:47:17,830 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-25 14:47:48,155 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:48:03,289 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.762e+02 8.420e+02 1.340e+03 1.867e+03 3.638e+03, threshold=2.681e+03, percent-clipped=9.0 2023-06-25 14:48:35,473 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=12.0 2023-06-25 14:48:50,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2009916.0, ans=0.0 2023-06-25 14:49:06,779 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.73 vs. limit=22.5 2023-06-25 14:49:16,214 INFO [train.py:996] (3/4) Epoch 11, batch 30050, loss[loss=0.2508, simple_loss=0.386, pruned_loss=0.05783, over 20752.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3207, pruned_loss=0.08051, over 4288562.67 frames. ], batch size: 607, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:49:18,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2009976.0, ans=0.0 2023-06-25 14:50:21,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2010156.0, ans=0.125 2023-06-25 14:50:23,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2010156.0, ans=0.0 2023-06-25 14:50:54,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2010216.0, ans=0.0 2023-06-25 14:51:01,283 INFO [train.py:996] (3/4) Epoch 11, batch 30100, loss[loss=0.2129, simple_loss=0.281, pruned_loss=0.07235, over 21603.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.319, pruned_loss=0.07975, over 4285035.87 frames. ], batch size: 332, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:51:10,715 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.51 vs. limit=10.0 2023-06-25 14:51:30,255 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-25 14:51:41,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2010396.0, ans=0.0 2023-06-25 14:51:44,620 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.182e+02 9.788e+02 1.555e+03 2.396e+03 5.388e+03, threshold=3.111e+03, percent-clipped=17.0 2023-06-25 14:52:49,213 INFO [train.py:996] (3/4) Epoch 11, batch 30150, loss[loss=0.2512, simple_loss=0.3052, pruned_loss=0.09859, over 20224.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3155, pruned_loss=0.08197, over 4284027.68 frames. ], batch size: 702, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:52:49,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2010576.0, ans=0.125 2023-06-25 14:52:54,401 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:53:10,345 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-25 14:53:17,348 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.88 vs. limit=22.5 2023-06-25 14:54:46,704 INFO [train.py:996] (3/4) Epoch 11, batch 30200, loss[loss=0.2398, simple_loss=0.3528, pruned_loss=0.0634, over 21226.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.317, pruned_loss=0.07971, over 4272792.72 frames. ], batch size: 549, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:54:55,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2010876.0, ans=0.125 2023-06-25 14:55:34,489 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.364e+02 8.137e+02 1.157e+03 1.769e+03 3.974e+03, threshold=2.314e+03, percent-clipped=2.0 2023-06-25 14:56:06,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2011056.0, ans=0.125 2023-06-25 14:56:30,724 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-25 14:56:34,975 INFO [train.py:996] (3/4) Epoch 11, batch 30250, loss[loss=0.2427, simple_loss=0.3438, pruned_loss=0.07075, over 21645.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.323, pruned_loss=0.08109, over 4269058.29 frames. ], batch size: 230, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:56:42,868 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-25 14:56:51,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2011176.0, ans=0.125 2023-06-25 14:57:00,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2011176.0, ans=0.2 2023-06-25 14:57:45,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=2011356.0, ans=0.025 2023-06-25 14:58:20,272 INFO [train.py:996] (3/4) Epoch 11, batch 30300, loss[loss=0.223, simple_loss=0.2844, pruned_loss=0.08082, over 21795.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3189, pruned_loss=0.08093, over 4269765.72 frames. ], batch size: 352, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:59:09,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2011596.0, ans=0.125 2023-06-25 14:59:14,779 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.082e+02 1.032e+03 1.380e+03 1.875e+03 4.556e+03, threshold=2.761e+03, percent-clipped=17.0 2023-06-25 14:59:25,877 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2023-06-25 14:59:37,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2011656.0, ans=0.04949747468305833 2023-06-25 15:00:21,496 INFO [train.py:996] (3/4) Epoch 11, batch 30350, loss[loss=0.3281, simple_loss=0.4133, pruned_loss=0.1214, over 21443.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3192, pruned_loss=0.08215, over 4267683.38 frames. ], batch size: 471, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 15:00:47,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2011836.0, ans=0.2 2023-06-25 15:00:50,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2011836.0, ans=0.1 2023-06-25 15:01:43,048 INFO [train.py:996] (3/4) Epoch 11, batch 30400, loss[loss=0.2066, simple_loss=0.2626, pruned_loss=0.07528, over 20391.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3146, pruned_loss=0.08072, over 4259870.72 frames. ], batch size: 703, lr: 2.59e-03, grad_scale: 16.0 2023-06-25 15:01:43,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2012076.0, ans=0.2 2023-06-25 15:01:57,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2012076.0, ans=0.125 2023-06-25 15:02:24,036 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.345e+02 1.116e+03 1.633e+03 2.614e+03 1.022e+04, threshold=3.266e+03, percent-clipped=19.0 2023-06-25 15:02:24,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2012196.0, ans=0.0 2023-06-25 15:02:54,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2012256.0, ans=0.125 2023-06-25 15:03:00,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2012316.0, ans=0.1 2023-06-25 15:03:10,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2012376.0, ans=0.125 2023-06-25 15:03:11,936 INFO [train.py:996] (3/4) Epoch 11, batch 30450, loss[loss=0.2675, simple_loss=0.3927, pruned_loss=0.07115, over 19880.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3157, pruned_loss=0.07933, over 4200714.24 frames. ], batch size: 702, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 15:04:06,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=2012556.0, ans=0.05 2023-06-25 15:04:22,178 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-25 15:04:22,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2012616.0, ans=0.0 2023-06-25 15:06:14,780 INFO [train.py:996] (3/4) Epoch 12, batch 0, loss[loss=0.2376, simple_loss=0.301, pruned_loss=0.08713, over 21807.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.301, pruned_loss=0.08713, over 21807.00 frames. ], batch size: 102, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:06:14,781 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 15:06:38,471 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.246, simple_loss=0.3509, pruned_loss=0.07057, over 1796401.00 frames. 2023-06-25 15:06:38,472 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-25 15:06:42,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2012646.0, ans=0.125 2023-06-25 15:06:45,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2012646.0, ans=0.1 2023-06-25 15:07:19,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2012766.0, ans=0.0 2023-06-25 15:07:28,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2012766.0, ans=0.125 2023-06-25 15:07:29,953 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.737e+02 2.108e+03 3.291e+03 4.750e+03 1.246e+04, threshold=6.583e+03, percent-clipped=51.0 2023-06-25 15:07:58,894 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=22.5 2023-06-25 15:08:23,994 INFO [train.py:996] (3/4) Epoch 12, batch 50, loss[loss=0.2772, simple_loss=0.3805, pruned_loss=0.087, over 21641.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.322, pruned_loss=0.08094, over 956001.58 frames. ], batch size: 414, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:08:26,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2012946.0, ans=0.125 2023-06-25 15:08:34,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2012946.0, ans=0.125 2023-06-25 15:09:07,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2013066.0, ans=0.0 2023-06-25 15:09:57,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2013186.0, ans=0.1 2023-06-25 15:10:07,017 INFO [train.py:996] (3/4) Epoch 12, batch 100, loss[loss=0.2348, simple_loss=0.3271, pruned_loss=0.07124, over 21867.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3332, pruned_loss=0.07965, over 1687533.77 frames. ], batch size: 316, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:10:39,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2013306.0, ans=0.0 2023-06-25 15:11:03,466 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.950e+02 8.744e+02 1.296e+03 2.082e+03 4.002e+03, threshold=2.593e+03, percent-clipped=0.0 2023-06-25 15:11:22,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2013486.0, ans=0.125 2023-06-25 15:11:29,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2013486.0, ans=0.0 2023-06-25 15:11:43,825 INFO [train.py:996] (3/4) Epoch 12, batch 150, loss[loss=0.2348, simple_loss=0.3234, pruned_loss=0.07311, over 21551.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3353, pruned_loss=0.08116, over 2263330.87 frames. ], batch size: 230, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:12:01,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2013546.0, ans=0.2 2023-06-25 15:13:32,353 INFO [train.py:996] (3/4) Epoch 12, batch 200, loss[loss=0.2067, simple_loss=0.2794, pruned_loss=0.067, over 21118.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.333, pruned_loss=0.08216, over 2696540.41 frames. ], batch size: 143, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:14:14,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2013906.0, ans=0.125 2023-06-25 15:14:31,664 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.459e+02 8.933e+02 1.301e+03 1.792e+03 3.949e+03, threshold=2.602e+03, percent-clipped=5.0 2023-06-25 15:14:45,851 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=15.0 2023-06-25 15:15:02,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2014086.0, ans=0.0 2023-06-25 15:15:20,314 INFO [train.py:996] (3/4) Epoch 12, batch 250, loss[loss=0.2206, simple_loss=0.2852, pruned_loss=0.07795, over 21796.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3282, pruned_loss=0.08234, over 3052070.46 frames. ], batch size: 112, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:15:44,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2014206.0, ans=0.125 2023-06-25 15:15:56,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2014206.0, ans=0.0 2023-06-25 15:16:57,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2014386.0, ans=0.0 2023-06-25 15:17:00,562 INFO [train.py:996] (3/4) Epoch 12, batch 300, loss[loss=0.2428, simple_loss=0.3041, pruned_loss=0.09076, over 21888.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3212, pruned_loss=0.08158, over 3335046.09 frames. ], batch size: 316, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:17:23,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2014506.0, ans=0.125 2023-06-25 15:18:01,868 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.795e+02 8.285e+02 1.102e+03 1.636e+03 4.756e+03, threshold=2.203e+03, percent-clipped=8.0 2023-06-25 15:18:22,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2014626.0, ans=0.2 2023-06-25 15:18:49,625 INFO [train.py:996] (3/4) Epoch 12, batch 350, loss[loss=0.2464, simple_loss=0.3193, pruned_loss=0.08672, over 21372.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3147, pruned_loss=0.08123, over 3547117.62 frames. ], batch size: 471, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:19:28,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2014806.0, ans=0.1 2023-06-25 15:20:00,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2014926.0, ans=0.125 2023-06-25 15:20:37,685 INFO [train.py:996] (3/4) Epoch 12, batch 400, loss[loss=0.2047, simple_loss=0.2633, pruned_loss=0.07301, over 21277.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.308, pruned_loss=0.07957, over 3714863.58 frames. ], batch size: 551, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:20:57,854 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.73 vs. limit=8.0 2023-06-25 15:21:20,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2015166.0, ans=0.0 2023-06-25 15:21:37,408 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.810e+02 9.305e+02 1.280e+03 1.857e+03 4.239e+03, threshold=2.560e+03, percent-clipped=17.0 2023-06-25 15:22:24,699 INFO [train.py:996] (3/4) Epoch 12, batch 450, loss[loss=0.1929, simple_loss=0.2931, pruned_loss=0.04636, over 21676.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3039, pruned_loss=0.07732, over 3841197.10 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:22:40,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2015346.0, ans=0.1 2023-06-25 15:22:51,421 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=15.0 2023-06-25 15:23:36,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2015526.0, ans=0.0 2023-06-25 15:23:52,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-25 15:24:15,406 INFO [train.py:996] (3/4) Epoch 12, batch 500, loss[loss=0.2064, simple_loss=0.2763, pruned_loss=0.06824, over 21766.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3039, pruned_loss=0.07527, over 3934146.00 frames. ], batch size: 317, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:24:46,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2015706.0, ans=0.125 2023-06-25 15:25:00,566 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-25 15:25:14,855 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.486e+02 9.539e+02 1.426e+03 2.120e+03 6.298e+03, threshold=2.852e+03, percent-clipped=19.0 2023-06-25 15:25:15,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2015826.0, ans=0.125 2023-06-25 15:25:17,786 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.41 vs. limit=12.0 2023-06-25 15:26:02,510 INFO [train.py:996] (3/4) Epoch 12, batch 550, loss[loss=0.2263, simple_loss=0.3016, pruned_loss=0.07544, over 21887.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3095, pruned_loss=0.07564, over 4006296.02 frames. ], batch size: 118, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:27:17,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2016126.0, ans=0.125 2023-06-25 15:27:49,288 INFO [train.py:996] (3/4) Epoch 12, batch 600, loss[loss=0.3017, simple_loss=0.3971, pruned_loss=0.1031, over 21535.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3144, pruned_loss=0.0767, over 4072776.16 frames. ], batch size: 508, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:27:56,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2016246.0, ans=0.2 2023-06-25 15:28:30,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2016366.0, ans=0.04949747468305833 2023-06-25 15:28:45,203 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:28:49,898 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.802e+02 1.034e+03 1.598e+03 2.241e+03 5.970e+03, threshold=3.196e+03, percent-clipped=11.0 2023-06-25 15:29:38,759 INFO [train.py:996] (3/4) Epoch 12, batch 650, loss[loss=0.2639, simple_loss=0.3263, pruned_loss=0.1008, over 21770.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3159, pruned_loss=0.07769, over 4109781.52 frames. ], batch size: 508, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:29:42,890 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-25 15:29:47,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2016546.0, ans=0.125 2023-06-25 15:30:38,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2016666.0, ans=0.125 2023-06-25 15:30:45,692 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.41 vs. limit=12.0 2023-06-25 15:31:26,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2016846.0, ans=0.035 2023-06-25 15:31:28,086 INFO [train.py:996] (3/4) Epoch 12, batch 700, loss[loss=0.4043, simple_loss=0.4661, pruned_loss=0.1713, over 21569.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3149, pruned_loss=0.07862, over 4153070.57 frames. ], batch size: 508, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:31:36,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2016846.0, ans=0.1 2023-06-25 15:31:41,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2016846.0, ans=0.125 2023-06-25 15:31:44,358 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-06-25 15:31:55,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=2016906.0, ans=15.0 2023-06-25 15:32:06,590 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-25 15:32:25,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2016966.0, ans=0.0 2023-06-25 15:32:28,830 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.353e+02 8.196e+02 1.234e+03 2.056e+03 5.759e+03, threshold=2.467e+03, percent-clipped=11.0 2023-06-25 15:32:34,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2017026.0, ans=0.125 2023-06-25 15:32:38,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2017026.0, ans=0.125 2023-06-25 15:32:44,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2017026.0, ans=0.0 2023-06-25 15:33:10,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2017086.0, ans=0.2 2023-06-25 15:33:16,913 INFO [train.py:996] (3/4) Epoch 12, batch 750, loss[loss=0.2345, simple_loss=0.364, pruned_loss=0.05255, over 19801.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3139, pruned_loss=0.07975, over 4180279.90 frames. ], batch size: 703, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:34:19,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2017326.0, ans=0.125 2023-06-25 15:34:36,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-25 15:35:08,412 INFO [train.py:996] (3/4) Epoch 12, batch 800, loss[loss=0.2288, simple_loss=0.317, pruned_loss=0.07028, over 21706.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3118, pruned_loss=0.07991, over 4194111.98 frames. ], batch size: 298, lr: 2.47e-03, grad_scale: 32.0 2023-06-25 15:36:07,750 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=22.5 2023-06-25 15:36:10,816 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.338e+02 9.265e+02 1.322e+03 1.960e+03 3.991e+03, threshold=2.645e+03, percent-clipped=16.0 2023-06-25 15:36:12,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-25 15:36:36,299 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=22.5 2023-06-25 15:36:58,009 INFO [train.py:996] (3/4) Epoch 12, batch 850, loss[loss=0.2021, simple_loss=0.293, pruned_loss=0.05555, over 21657.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3083, pruned_loss=0.07888, over 4214042.97 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:37:25,000 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.16 vs. limit=10.0 2023-06-25 15:37:51,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2017866.0, ans=0.2 2023-06-25 15:38:02,484 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-25 15:38:26,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2017986.0, ans=0.2 2023-06-25 15:38:46,345 INFO [train.py:996] (3/4) Epoch 12, batch 900, loss[loss=0.1816, simple_loss=0.2547, pruned_loss=0.05432, over 21848.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3052, pruned_loss=0.07792, over 4236895.65 frames. ], batch size: 118, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:38:50,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2018046.0, ans=0.1 2023-06-25 15:39:03,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2018046.0, ans=0.125 2023-06-25 15:39:06,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2018106.0, ans=0.125 2023-06-25 15:39:49,734 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.898e+02 1.048e+03 1.588e+03 2.681e+03 4.714e+03, threshold=3.177e+03, percent-clipped=25.0 2023-06-25 15:39:50,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2018226.0, ans=0.125 2023-06-25 15:40:09,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2018226.0, ans=0.2 2023-06-25 15:40:23,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2018286.0, ans=0.125 2023-06-25 15:40:34,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2018286.0, ans=0.125 2023-06-25 15:40:37,407 INFO [train.py:996] (3/4) Epoch 12, batch 950, loss[loss=0.2279, simple_loss=0.2844, pruned_loss=0.08569, over 21316.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.304, pruned_loss=0.07764, over 4253220.12 frames. ], batch size: 548, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:40:42,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2018346.0, ans=0.125 2023-06-25 15:41:26,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2018466.0, ans=0.1 2023-06-25 15:41:38,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2018526.0, ans=0.125 2023-06-25 15:42:25,944 INFO [train.py:996] (3/4) Epoch 12, batch 1000, loss[loss=0.1902, simple_loss=0.2607, pruned_loss=0.05981, over 21169.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.304, pruned_loss=0.07687, over 4262728.07 frames. ], batch size: 143, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:42:58,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2018706.0, ans=0.0 2023-06-25 15:43:00,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2018706.0, ans=0.125 2023-06-25 15:43:07,939 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-25 15:43:12,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2018766.0, ans=0.0 2023-06-25 15:43:29,119 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.398e+02 9.331e+02 1.352e+03 1.940e+03 3.326e+03, threshold=2.703e+03, percent-clipped=1.0 2023-06-25 15:43:53,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2018826.0, ans=0.025 2023-06-25 15:43:53,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2018826.0, ans=0.125 2023-06-25 15:44:21,936 INFO [train.py:996] (3/4) Epoch 12, batch 1050, loss[loss=0.2383, simple_loss=0.3096, pruned_loss=0.0835, over 21773.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3037, pruned_loss=0.07701, over 4266498.19 frames. ], batch size: 389, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:44:25,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2018946.0, ans=0.125 2023-06-25 15:45:16,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2019066.0, ans=0.125 2023-06-25 15:46:07,433 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2023-06-25 15:46:13,564 INFO [train.py:996] (3/4) Epoch 12, batch 1100, loss[loss=0.2622, simple_loss=0.3389, pruned_loss=0.09277, over 21297.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3052, pruned_loss=0.0767, over 4270706.30 frames. ], batch size: 159, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:46:31,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2019246.0, ans=0.1 2023-06-25 15:46:50,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2019306.0, ans=0.0 2023-06-25 15:46:55,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2019366.0, ans=0.1 2023-06-25 15:47:12,023 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.833e+02 8.143e+02 1.241e+03 1.792e+03 5.093e+03, threshold=2.482e+03, percent-clipped=8.0 2023-06-25 15:47:51,890 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:47:59,966 INFO [train.py:996] (3/4) Epoch 12, batch 1150, loss[loss=0.3092, simple_loss=0.3707, pruned_loss=0.1239, over 21589.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3053, pruned_loss=0.07561, over 4268207.72 frames. ], batch size: 389, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:48:00,835 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=22.5 2023-06-25 15:48:07,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2019546.0, ans=0.07 2023-06-25 15:48:12,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2019546.0, ans=0.0 2023-06-25 15:49:09,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2019726.0, ans=0.1 2023-06-25 15:49:31,015 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=12.0 2023-06-25 15:49:56,063 INFO [train.py:996] (3/4) Epoch 12, batch 1200, loss[loss=0.2581, simple_loss=0.3367, pruned_loss=0.08974, over 21749.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3057, pruned_loss=0.07549, over 4268542.39 frames. ], batch size: 351, lr: 2.47e-03, grad_scale: 32.0 2023-06-25 15:49:58,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2019846.0, ans=0.05 2023-06-25 15:49:58,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2019846.0, ans=0.125 2023-06-25 15:50:11,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2019906.0, ans=0.0 2023-06-25 15:50:38,947 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-25 15:51:00,442 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.676e+02 8.295e+02 1.207e+03 1.708e+03 3.534e+03, threshold=2.414e+03, percent-clipped=4.0 2023-06-25 15:51:47,225 INFO [train.py:996] (3/4) Epoch 12, batch 1250, loss[loss=0.2415, simple_loss=0.3178, pruned_loss=0.08259, over 21386.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.309, pruned_loss=0.07755, over 4275022.91 frames. ], batch size: 548, lr: 2.47e-03, grad_scale: 32.0 2023-06-25 15:51:52,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.20 vs. limit=15.0 2023-06-25 15:52:06,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2020206.0, ans=0.125 2023-06-25 15:52:47,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2020266.0, ans=0.0 2023-06-25 15:52:48,054 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-06-25 15:53:41,550 INFO [train.py:996] (3/4) Epoch 12, batch 1300, loss[loss=0.2548, simple_loss=0.3241, pruned_loss=0.09275, over 21846.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3107, pruned_loss=0.07829, over 4285176.82 frames. ], batch size: 332, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:54:20,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2020566.0, ans=0.125 2023-06-25 15:54:36,192 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-25 15:54:46,920 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.616e+02 1.419e+03 1.872e+03 2.553e+03 5.619e+03, threshold=3.744e+03, percent-clipped=29.0 2023-06-25 15:55:07,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2020686.0, ans=0.1 2023-06-25 15:55:30,060 INFO [train.py:996] (3/4) Epoch 12, batch 1350, loss[loss=0.1968, simple_loss=0.2769, pruned_loss=0.05828, over 21206.00 frames. ], tot_loss[loss=0.235, simple_loss=0.312, pruned_loss=0.079, over 4288716.41 frames. ], batch size: 159, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:55:35,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2020746.0, ans=0.125 2023-06-25 15:56:08,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2020806.0, ans=0.2 2023-06-25 15:56:51,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2020926.0, ans=0.0 2023-06-25 15:57:02,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2020986.0, ans=0.125 2023-06-25 15:57:18,675 INFO [train.py:996] (3/4) Epoch 12, batch 1400, loss[loss=0.2258, simple_loss=0.3095, pruned_loss=0.07101, over 21282.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.31, pruned_loss=0.07855, over 4292789.58 frames. ], batch size: 548, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:58:29,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2021226.0, ans=0.125 2023-06-25 15:58:31,002 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.714e+02 6.812e+02 9.835e+02 1.527e+03 2.832e+03, threshold=1.967e+03, percent-clipped=0.0 2023-06-25 15:58:51,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2021286.0, ans=0.125 2023-06-25 15:59:07,913 INFO [train.py:996] (3/4) Epoch 12, batch 1450, loss[loss=0.2225, simple_loss=0.3402, pruned_loss=0.05238, over 19815.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3101, pruned_loss=0.07831, over 4293209.50 frames. ], batch size: 703, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:00:56,907 INFO [train.py:996] (3/4) Epoch 12, batch 1500, loss[loss=0.2499, simple_loss=0.3245, pruned_loss=0.08768, over 21738.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3109, pruned_loss=0.07988, over 4293848.91 frames. ], batch size: 389, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:01:06,479 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=22.5 2023-06-25 16:01:51,364 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:02:04,710 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.651e+02 8.922e+02 1.288e+03 1.827e+03 4.851e+03, threshold=2.577e+03, percent-clipped=21.0 2023-06-25 16:02:05,578 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-06-25 16:02:24,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2021886.0, ans=0.125 2023-06-25 16:02:43,230 INFO [train.py:996] (3/4) Epoch 12, batch 1550, loss[loss=0.2454, simple_loss=0.3161, pruned_loss=0.08737, over 21824.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3091, pruned_loss=0.07854, over 4293268.77 frames. ], batch size: 414, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:02:49,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2021946.0, ans=0.125 2023-06-25 16:02:56,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2021946.0, ans=0.125 2023-06-25 16:03:41,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2022066.0, ans=0.125 2023-06-25 16:03:48,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2022066.0, ans=0.0 2023-06-25 16:04:06,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2022126.0, ans=0.125 2023-06-25 16:04:30,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2022186.0, ans=0.125 2023-06-25 16:04:35,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=2022246.0, ans=15.0 2023-06-25 16:04:36,321 INFO [train.py:996] (3/4) Epoch 12, batch 1600, loss[loss=0.2222, simple_loss=0.2944, pruned_loss=0.07497, over 21669.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3058, pruned_loss=0.07666, over 4284911.63 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:04:37,699 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=22.5 2023-06-25 16:05:55,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2022426.0, ans=0.0 2023-06-25 16:06:00,524 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.153e+02 9.608e+02 1.362e+03 1.874e+03 5.231e+03, threshold=2.724e+03, percent-clipped=11.0 2023-06-25 16:06:38,383 INFO [train.py:996] (3/4) Epoch 12, batch 1650, loss[loss=0.2223, simple_loss=0.3098, pruned_loss=0.06743, over 21937.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3043, pruned_loss=0.07599, over 4289603.78 frames. ], batch size: 317, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:07:02,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2022606.0, ans=0.125 2023-06-25 16:07:10,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2022606.0, ans=0.125 2023-06-25 16:07:26,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2022666.0, ans=0.035 2023-06-25 16:08:31,810 INFO [train.py:996] (3/4) Epoch 12, batch 1700, loss[loss=0.2449, simple_loss=0.3158, pruned_loss=0.08706, over 21448.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3051, pruned_loss=0.0772, over 4283942.38 frames. ], batch size: 194, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:08:36,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=22.5 2023-06-25 16:09:13,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2022906.0, ans=0.1 2023-06-25 16:09:43,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2023026.0, ans=0.1 2023-06-25 16:09:48,445 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.619e+02 9.466e+02 1.193e+03 1.792e+03 3.263e+03, threshold=2.387e+03, percent-clipped=3.0 2023-06-25 16:10:32,416 INFO [train.py:996] (3/4) Epoch 12, batch 1750, loss[loss=0.1724, simple_loss=0.2622, pruned_loss=0.04126, over 21764.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3069, pruned_loss=0.07632, over 4285591.44 frames. ], batch size: 282, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:10:38,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2023146.0, ans=0.1 2023-06-25 16:12:06,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2023386.0, ans=0.125 2023-06-25 16:12:29,635 INFO [train.py:996] (3/4) Epoch 12, batch 1800, loss[loss=0.2281, simple_loss=0.3261, pruned_loss=0.06504, over 21710.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3051, pruned_loss=0.0732, over 4274927.81 frames. ], batch size: 332, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:12:35,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2023446.0, ans=0.125 2023-06-25 16:13:02,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2023506.0, ans=0.125 2023-06-25 16:13:40,954 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.675e+02 9.606e+02 1.409e+03 2.077e+03 5.009e+03, threshold=2.818e+03, percent-clipped=17.0 2023-06-25 16:14:21,244 INFO [train.py:996] (3/4) Epoch 12, batch 1850, loss[loss=0.2709, simple_loss=0.3594, pruned_loss=0.09116, over 21522.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3076, pruned_loss=0.07176, over 4272953.05 frames. ], batch size: 507, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:14:55,166 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.24 vs. limit=15.0 2023-06-25 16:15:07,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2023866.0, ans=0.0 2023-06-25 16:15:32,985 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2023-06-25 16:16:10,907 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2023-06-25 16:16:21,179 INFO [train.py:996] (3/4) Epoch 12, batch 1900, loss[loss=0.2405, simple_loss=0.3054, pruned_loss=0.08775, over 21808.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3092, pruned_loss=0.07278, over 4276516.78 frames. ], batch size: 112, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:16:29,072 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-25 16:16:35,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2024046.0, ans=0.0 2023-06-25 16:16:40,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2024106.0, ans=0.125 2023-06-25 16:17:30,668 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.998e+02 9.175e+02 1.448e+03 2.003e+03 3.751e+03, threshold=2.896e+03, percent-clipped=10.0 2023-06-25 16:17:57,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2024286.0, ans=0.0 2023-06-25 16:18:12,806 INFO [train.py:996] (3/4) Epoch 12, batch 1950, loss[loss=0.177, simple_loss=0.2482, pruned_loss=0.05294, over 21582.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3057, pruned_loss=0.07261, over 4280650.35 frames. ], batch size: 282, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:18:27,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2024346.0, ans=0.0 2023-06-25 16:19:55,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2024586.0, ans=0.0 2023-06-25 16:20:04,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2024646.0, ans=0.0 2023-06-25 16:20:05,995 INFO [train.py:996] (3/4) Epoch 12, batch 2000, loss[loss=0.2434, simple_loss=0.3342, pruned_loss=0.07635, over 20018.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3012, pruned_loss=0.07151, over 4269761.04 frames. ], batch size: 702, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:20:47,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=2024766.0, ans=22.5 2023-06-25 16:20:55,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2024766.0, ans=0.125 2023-06-25 16:21:15,635 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.378e+02 9.911e+02 1.522e+03 2.173e+03 4.229e+03, threshold=3.044e+03, percent-clipped=10.0 2023-06-25 16:21:56,964 INFO [train.py:996] (3/4) Epoch 12, batch 2050, loss[loss=0.2468, simple_loss=0.3335, pruned_loss=0.08001, over 21571.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3038, pruned_loss=0.07202, over 4271323.38 frames. ], batch size: 230, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:22:04,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2024946.0, ans=0.125 2023-06-25 16:23:18,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2025126.0, ans=0.0 2023-06-25 16:23:39,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2025186.0, ans=0.0 2023-06-25 16:23:44,353 INFO [train.py:996] (3/4) Epoch 12, batch 2100, loss[loss=0.2524, simple_loss=0.3411, pruned_loss=0.08182, over 21585.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3069, pruned_loss=0.07384, over 4266814.54 frames. ], batch size: 414, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:23:59,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-25 16:24:04,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=2025306.0, ans=0.2 2023-06-25 16:24:16,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2025306.0, ans=0.0 2023-06-25 16:24:57,633 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.106e+02 8.786e+02 1.413e+03 2.170e+03 3.783e+03, threshold=2.827e+03, percent-clipped=9.0 2023-06-25 16:25:06,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2025426.0, ans=0.125 2023-06-25 16:25:38,349 INFO [train.py:996] (3/4) Epoch 12, batch 2150, loss[loss=0.2047, simple_loss=0.2792, pruned_loss=0.0651, over 21674.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3066, pruned_loss=0.07569, over 4269506.82 frames. ], batch size: 298, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:25:59,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2025606.0, ans=0.0 2023-06-25 16:25:59,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2025606.0, ans=0.0 2023-06-25 16:27:24,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2025786.0, ans=0.125 2023-06-25 16:27:26,621 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:27:31,492 INFO [train.py:996] (3/4) Epoch 12, batch 2200, loss[loss=0.1832, simple_loss=0.2525, pruned_loss=0.05699, over 15889.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3084, pruned_loss=0.07597, over 4264900.83 frames. ], batch size: 62, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:27:52,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2025906.0, ans=0.125 2023-06-25 16:28:46,650 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.633e+02 9.347e+02 1.379e+03 2.152e+03 4.543e+03, threshold=2.758e+03, percent-clipped=14.0 2023-06-25 16:29:23,287 INFO [train.py:996] (3/4) Epoch 12, batch 2250, loss[loss=0.201, simple_loss=0.275, pruned_loss=0.06355, over 21655.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3059, pruned_loss=0.07525, over 4262135.38 frames. ], batch size: 332, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:30:10,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2026266.0, ans=0.1 2023-06-25 16:30:32,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2026326.0, ans=0.125 2023-06-25 16:31:01,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2026386.0, ans=0.125 2023-06-25 16:31:01,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2026386.0, ans=0.125 2023-06-25 16:31:01,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2026386.0, ans=0.0 2023-06-25 16:31:14,286 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-06-25 16:31:14,989 INFO [train.py:996] (3/4) Epoch 12, batch 2300, loss[loss=0.1946, simple_loss=0.2553, pruned_loss=0.06699, over 21410.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3003, pruned_loss=0.07448, over 4255837.74 frames. ], batch size: 194, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:31:47,860 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:32:11,040 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=22.5 2023-06-25 16:32:31,098 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.590e+02 8.645e+02 1.249e+03 1.812e+03 4.519e+03, threshold=2.497e+03, percent-clipped=11.0 2023-06-25 16:32:53,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2026686.0, ans=0.125 2023-06-25 16:33:07,258 INFO [train.py:996] (3/4) Epoch 12, batch 2350, loss[loss=0.2355, simple_loss=0.3015, pruned_loss=0.08477, over 21737.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2978, pruned_loss=0.07538, over 4260058.01 frames. ], batch size: 282, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:33:31,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2026806.0, ans=0.0 2023-06-25 16:34:39,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2026926.0, ans=0.2 2023-06-25 16:34:53,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2026986.0, ans=0.2 2023-06-25 16:34:59,624 INFO [train.py:996] (3/4) Epoch 12, batch 2400, loss[loss=0.2613, simple_loss=0.3276, pruned_loss=0.09748, over 21298.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3023, pruned_loss=0.07796, over 4267406.49 frames. ], batch size: 159, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 16:35:27,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2027106.0, ans=0.125 2023-06-25 16:35:43,646 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-25 16:36:26,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2027226.0, ans=0.0 2023-06-25 16:36:28,135 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.619e+02 8.583e+02 1.300e+03 1.930e+03 5.128e+03, threshold=2.600e+03, percent-clipped=13.0 2023-06-25 16:36:47,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2027286.0, ans=0.125 2023-06-25 16:37:04,197 INFO [train.py:996] (3/4) Epoch 12, batch 2450, loss[loss=0.2413, simple_loss=0.3156, pruned_loss=0.08347, over 21882.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3081, pruned_loss=0.07987, over 4269637.57 frames. ], batch size: 317, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:37:20,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2027406.0, ans=10.0 2023-06-25 16:38:22,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2027526.0, ans=0.0 2023-06-25 16:38:39,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2027586.0, ans=0.2 2023-06-25 16:38:41,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2027586.0, ans=0.0 2023-06-25 16:38:41,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2027586.0, ans=10.0 2023-06-25 16:38:54,286 INFO [train.py:996] (3/4) Epoch 12, batch 2500, loss[loss=0.23, simple_loss=0.2853, pruned_loss=0.08732, over 21257.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.306, pruned_loss=0.07929, over 4276701.70 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:40:11,313 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.584e+02 1.091e+03 1.592e+03 2.289e+03 5.240e+03, threshold=3.184e+03, percent-clipped=19.0 2023-06-25 16:40:44,908 INFO [train.py:996] (3/4) Epoch 12, batch 2550, loss[loss=0.2362, simple_loss=0.3066, pruned_loss=0.0829, over 21570.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3044, pruned_loss=0.0785, over 4275519.59 frames. ], batch size: 391, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:41:14,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2028006.0, ans=0.0 2023-06-25 16:41:35,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2028066.0, ans=0.125 2023-06-25 16:42:25,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2028246.0, ans=0.125 2023-06-25 16:42:26,387 INFO [train.py:996] (3/4) Epoch 12, batch 2600, loss[loss=0.2625, simple_loss=0.3843, pruned_loss=0.07035, over 19780.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3065, pruned_loss=0.07954, over 4265444.79 frames. ], batch size: 703, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:42:41,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2028246.0, ans=10.0 2023-06-25 16:43:03,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2028306.0, ans=0.0 2023-06-25 16:43:19,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2028366.0, ans=0.0 2023-06-25 16:43:32,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2028366.0, ans=0.0 2023-06-25 16:43:48,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2028426.0, ans=0.0 2023-06-25 16:43:49,904 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.618e+02 9.833e+02 1.370e+03 2.330e+03 4.697e+03, threshold=2.739e+03, percent-clipped=12.0 2023-06-25 16:44:25,795 INFO [train.py:996] (3/4) Epoch 12, batch 2650, loss[loss=0.2223, simple_loss=0.3093, pruned_loss=0.06767, over 21852.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3073, pruned_loss=0.08079, over 4278887.03 frames. ], batch size: 282, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:44:26,088 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:44:26,677 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-25 16:45:26,589 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-25 16:45:59,591 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-25 16:46:18,778 INFO [train.py:996] (3/4) Epoch 12, batch 2700, loss[loss=0.2098, simple_loss=0.2884, pruned_loss=0.06558, over 21848.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3069, pruned_loss=0.0797, over 4279577.90 frames. ], batch size: 333, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:47:03,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2028966.0, ans=0.2 2023-06-25 16:47:35,846 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.499e+02 8.174e+02 1.316e+03 1.880e+03 3.948e+03, threshold=2.631e+03, percent-clipped=11.0 2023-06-25 16:47:45,830 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.16 vs. limit=15.0 2023-06-25 16:48:09,310 INFO [train.py:996] (3/4) Epoch 12, batch 2750, loss[loss=0.2217, simple_loss=0.2838, pruned_loss=0.0798, over 21525.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3067, pruned_loss=0.0798, over 4271026.11 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:48:57,216 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=15.0 2023-06-25 16:50:00,320 INFO [train.py:996] (3/4) Epoch 12, batch 2800, loss[loss=0.2408, simple_loss=0.3352, pruned_loss=0.07322, over 21789.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3123, pruned_loss=0.08125, over 4268788.86 frames. ], batch size: 298, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 16:50:12,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=2029446.0, ans=0.05 2023-06-25 16:50:14,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2029446.0, ans=0.125 2023-06-25 16:50:16,006 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:50:48,659 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=15.0 2023-06-25 16:51:26,709 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.572e+02 1.005e+03 1.362e+03 2.253e+03 4.999e+03, threshold=2.724e+03, percent-clipped=18.0 2023-06-25 16:51:53,143 INFO [train.py:996] (3/4) Epoch 12, batch 2850, loss[loss=0.2784, simple_loss=0.3499, pruned_loss=0.1034, over 21634.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3127, pruned_loss=0.08187, over 4274110.38 frames. ], batch size: 389, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:52:54,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2029866.0, ans=0.0 2023-06-25 16:53:24,555 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=22.5 2023-06-25 16:53:38,278 INFO [train.py:996] (3/4) Epoch 12, batch 2900, loss[loss=0.2248, simple_loss=0.3001, pruned_loss=0.07472, over 21821.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3111, pruned_loss=0.08155, over 4271490.17 frames. ], batch size: 332, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:54:50,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2030226.0, ans=0.0 2023-06-25 16:54:52,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2030226.0, ans=0.1 2023-06-25 16:54:56,893 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.061e+02 9.423e+02 1.341e+03 2.226e+03 4.607e+03, threshold=2.681e+03, percent-clipped=12.0 2023-06-25 16:55:24,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2030286.0, ans=0.125 2023-06-25 16:55:28,735 INFO [train.py:996] (3/4) Epoch 12, batch 2950, loss[loss=0.2242, simple_loss=0.2948, pruned_loss=0.07674, over 21707.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3119, pruned_loss=0.08162, over 4282798.66 frames. ], batch size: 112, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:55:50,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2030406.0, ans=0.125 2023-06-25 16:56:02,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2030406.0, ans=0.125 2023-06-25 16:56:43,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2030526.0, ans=0.2 2023-06-25 16:56:52,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2030526.0, ans=0.125 2023-06-25 16:56:57,679 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-25 16:57:14,396 INFO [train.py:996] (3/4) Epoch 12, batch 3000, loss[loss=0.2737, simple_loss=0.3427, pruned_loss=0.1024, over 21243.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3169, pruned_loss=0.08257, over 4280810.74 frames. ], batch size: 143, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:57:14,397 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 16:57:41,087 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.2513, simple_loss=0.3439, pruned_loss=0.07939, over 1796401.00 frames. 2023-06-25 16:57:41,089 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-25 16:57:48,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2030646.0, ans=0.04949747468305833 2023-06-25 16:58:52,812 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.014e+02 9.167e+02 1.270e+03 1.811e+03 4.329e+03, threshold=2.541e+03, percent-clipped=6.0 2023-06-25 16:59:03,575 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:59:03,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2030886.0, ans=0.125 2023-06-25 16:59:10,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2030886.0, ans=0.125 2023-06-25 16:59:25,814 INFO [train.py:996] (3/4) Epoch 12, batch 3050, loss[loss=0.2152, simple_loss=0.2939, pruned_loss=0.06827, over 21862.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3165, pruned_loss=0.0806, over 4287060.98 frames. ], batch size: 282, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:59:34,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2030946.0, ans=0.0 2023-06-25 16:59:52,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2031006.0, ans=0.0 2023-06-25 17:00:37,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2031126.0, ans=0.125 2023-06-25 17:00:42,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2031126.0, ans=0.2 2023-06-25 17:01:06,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2031186.0, ans=0.1 2023-06-25 17:01:18,204 INFO [train.py:996] (3/4) Epoch 12, batch 3100, loss[loss=0.2303, simple_loss=0.2995, pruned_loss=0.08051, over 21714.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3152, pruned_loss=0.07926, over 4285426.30 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:01:39,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2031306.0, ans=0.125 2023-06-25 17:01:58,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2031366.0, ans=0.0 2023-06-25 17:02:17,466 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:02:25,529 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.476e+02 8.628e+02 1.556e+03 2.270e+03 3.749e+03, threshold=3.112e+03, percent-clipped=16.0 2023-06-25 17:03:06,701 INFO [train.py:996] (3/4) Epoch 12, batch 3150, loss[loss=0.3416, simple_loss=0.3958, pruned_loss=0.1437, over 21434.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3167, pruned_loss=0.08035, over 4279741.37 frames. ], batch size: 471, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:03:07,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2031546.0, ans=0.125 2023-06-25 17:03:47,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2031606.0, ans=0.0 2023-06-25 17:03:51,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=2031666.0, ans=15.0 2023-06-25 17:04:21,763 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.08 vs. limit=15.0 2023-06-25 17:04:42,475 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:04:53,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2031786.0, ans=0.2 2023-06-25 17:05:01,250 INFO [train.py:996] (3/4) Epoch 12, batch 3200, loss[loss=0.2454, simple_loss=0.3366, pruned_loss=0.07707, over 21695.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.316, pruned_loss=0.07975, over 4276421.36 frames. ], batch size: 414, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 17:06:07,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2031966.0, ans=0.09899494936611666 2023-06-25 17:06:21,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.588e+02 9.417e+02 1.305e+03 2.002e+03 3.314e+03, threshold=2.610e+03, percent-clipped=4.0 2023-06-25 17:06:45,584 INFO [train.py:996] (3/4) Epoch 12, batch 3250, loss[loss=0.2214, simple_loss=0.29, pruned_loss=0.07638, over 21392.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3177, pruned_loss=0.0808, over 4274515.04 frames. ], batch size: 211, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:07:00,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2032146.0, ans=0.2 2023-06-25 17:08:01,762 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-25 17:08:38,538 INFO [train.py:996] (3/4) Epoch 12, batch 3300, loss[loss=0.2047, simple_loss=0.2896, pruned_loss=0.05997, over 21565.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3155, pruned_loss=0.08116, over 4270933.19 frames. ], batch size: 230, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:08:43,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2032446.0, ans=0.1 2023-06-25 17:08:47,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2032446.0, ans=0.125 2023-06-25 17:09:56,041 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.590e+02 8.857e+02 1.384e+03 2.131e+03 4.581e+03, threshold=2.768e+03, percent-clipped=14.0 2023-06-25 17:09:56,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2032626.0, ans=0.1 2023-06-25 17:10:28,057 INFO [train.py:996] (3/4) Epoch 12, batch 3350, loss[loss=0.2359, simple_loss=0.3192, pruned_loss=0.07632, over 21437.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3175, pruned_loss=0.08176, over 4268879.42 frames. ], batch size: 131, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:11:06,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2032806.0, ans=0.0 2023-06-25 17:11:06,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2032806.0, ans=0.2 2023-06-25 17:11:56,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2032926.0, ans=0.1 2023-06-25 17:12:17,377 INFO [train.py:996] (3/4) Epoch 12, batch 3400, loss[loss=0.229, simple_loss=0.3063, pruned_loss=0.07588, over 21856.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3174, pruned_loss=0.08255, over 4278323.43 frames. ], batch size: 414, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:12:39,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2033106.0, ans=0.2 2023-06-25 17:13:28,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=2033226.0, ans=0.02 2023-06-25 17:13:38,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=2033226.0, ans=15.0 2023-06-25 17:13:48,681 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.332e+02 9.099e+02 1.349e+03 1.796e+03 3.997e+03, threshold=2.698e+03, percent-clipped=5.0 2023-06-25 17:14:13,562 INFO [train.py:996] (3/4) Epoch 12, batch 3450, loss[loss=0.2664, simple_loss=0.3413, pruned_loss=0.09572, over 21845.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3116, pruned_loss=0.08193, over 4283788.82 frames. ], batch size: 372, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:14:48,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2033406.0, ans=0.1 2023-06-25 17:15:06,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2033466.0, ans=0.0 2023-06-25 17:15:43,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2033526.0, ans=0.125 2023-06-25 17:16:10,306 INFO [train.py:996] (3/4) Epoch 12, batch 3500, loss[loss=0.3909, simple_loss=0.4662, pruned_loss=0.1578, over 21461.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.319, pruned_loss=0.08513, over 4282948.72 frames. ], batch size: 507, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:16:57,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2033766.0, ans=0.2 2023-06-25 17:17:24,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=2033826.0, ans=0.5 2023-06-25 17:17:32,086 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.249e+02 1.085e+03 1.477e+03 2.105e+03 4.608e+03, threshold=2.953e+03, percent-clipped=10.0 2023-06-25 17:17:41,550 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-25 17:17:53,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2033886.0, ans=0.05 2023-06-25 17:18:17,196 INFO [train.py:996] (3/4) Epoch 12, batch 3550, loss[loss=0.2582, simple_loss=0.3444, pruned_loss=0.08596, over 21676.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.324, pruned_loss=0.08677, over 4276703.94 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:18:28,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2033946.0, ans=0.125 2023-06-25 17:18:35,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2034006.0, ans=0.0 2023-06-25 17:18:39,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2034006.0, ans=0.0 2023-06-25 17:18:56,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2034066.0, ans=0.0 2023-06-25 17:19:48,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2034186.0, ans=0.1 2023-06-25 17:19:59,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2034186.0, ans=0.07 2023-06-25 17:20:12,176 INFO [train.py:996] (3/4) Epoch 12, batch 3600, loss[loss=0.2338, simple_loss=0.3005, pruned_loss=0.08358, over 21197.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3186, pruned_loss=0.08636, over 4276603.54 frames. ], batch size: 608, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:20:16,872 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-25 17:20:28,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2034306.0, ans=0.2 2023-06-25 17:20:41,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2034306.0, ans=0.125 2023-06-25 17:20:42,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2034306.0, ans=0.125 2023-06-25 17:20:57,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2034366.0, ans=0.0 2023-06-25 17:21:25,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2034426.0, ans=0.125 2023-06-25 17:21:27,847 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.714e+02 9.639e+02 1.612e+03 2.381e+03 4.879e+03, threshold=3.225e+03, percent-clipped=14.0 2023-06-25 17:21:39,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2034486.0, ans=0.2 2023-06-25 17:21:58,845 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-25 17:22:02,913 INFO [train.py:996] (3/4) Epoch 12, batch 3650, loss[loss=0.2031, simple_loss=0.2844, pruned_loss=0.06093, over 21756.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3183, pruned_loss=0.08638, over 4280423.00 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:22:09,059 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=22.5 2023-06-25 17:22:25,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2034606.0, ans=0.125 2023-06-25 17:22:30,685 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:23:51,396 INFO [train.py:996] (3/4) Epoch 12, batch 3700, loss[loss=0.2233, simple_loss=0.3026, pruned_loss=0.07195, over 21322.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3157, pruned_loss=0.08421, over 4279515.54 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:24:05,782 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.40 vs. limit=15.0 2023-06-25 17:24:10,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2034906.0, ans=0.0 2023-06-25 17:24:52,316 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-06-25 17:25:05,391 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.252e+02 8.433e+02 1.159e+03 1.645e+03 2.871e+03, threshold=2.319e+03, percent-clipped=0.0 2023-06-25 17:25:07,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2035026.0, ans=0.125 2023-06-25 17:25:39,732 INFO [train.py:996] (3/4) Epoch 12, batch 3750, loss[loss=0.2136, simple_loss=0.3021, pruned_loss=0.06255, over 19930.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3137, pruned_loss=0.08298, over 4280472.97 frames. ], batch size: 703, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:26:11,753 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-25 17:26:28,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2035266.0, ans=0.0 2023-06-25 17:26:46,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2035326.0, ans=0.125 2023-06-25 17:26:48,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2035326.0, ans=0.0 2023-06-25 17:27:31,414 INFO [train.py:996] (3/4) Epoch 12, batch 3800, loss[loss=0.3285, simple_loss=0.3776, pruned_loss=0.1397, over 21404.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3112, pruned_loss=0.08108, over 4287693.02 frames. ], batch size: 509, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:27:39,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2035446.0, ans=0.1 2023-06-25 17:29:02,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.463e+02 9.532e+02 1.366e+03 2.202e+03 4.372e+03, threshold=2.732e+03, percent-clipped=24.0 2023-06-25 17:29:09,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2035686.0, ans=0.125 2023-06-25 17:29:14,178 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=22.5 2023-06-25 17:29:24,662 INFO [train.py:996] (3/4) Epoch 12, batch 3850, loss[loss=0.1966, simple_loss=0.262, pruned_loss=0.06562, over 21184.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3098, pruned_loss=0.08169, over 4283604.20 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:29:25,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2035746.0, ans=0.0 2023-06-25 17:29:39,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2035746.0, ans=0.0 2023-06-25 17:30:19,118 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.10 vs. limit=22.5 2023-06-25 17:30:42,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2035926.0, ans=0.0 2023-06-25 17:31:15,950 INFO [train.py:996] (3/4) Epoch 12, batch 3900, loss[loss=0.2104, simple_loss=0.2808, pruned_loss=0.07004, over 21815.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3053, pruned_loss=0.08079, over 4275814.80 frames. ], batch size: 112, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:31:23,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2036046.0, ans=0.0 2023-06-25 17:31:32,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2036106.0, ans=0.0 2023-06-25 17:32:10,673 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=12.0 2023-06-25 17:32:26,838 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.59 vs. limit=10.0 2023-06-25 17:32:40,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2036226.0, ans=0.07 2023-06-25 17:32:41,845 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.422e+02 7.358e+02 1.058e+03 1.629e+03 3.913e+03, threshold=2.115e+03, percent-clipped=2.0 2023-06-25 17:33:01,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2036286.0, ans=0.015 2023-06-25 17:33:04,645 INFO [train.py:996] (3/4) Epoch 12, batch 3950, loss[loss=0.1838, simple_loss=0.2749, pruned_loss=0.0464, over 21730.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3069, pruned_loss=0.07973, over 4273193.43 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:34:56,632 INFO [train.py:996] (3/4) Epoch 12, batch 4000, loss[loss=0.1923, simple_loss=0.2612, pruned_loss=0.06165, over 21643.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3, pruned_loss=0.07669, over 4267317.51 frames. ], batch size: 282, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 17:34:58,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.66 vs. limit=22.5 2023-06-25 17:35:07,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2036646.0, ans=0.125 2023-06-25 17:35:25,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2036706.0, ans=0.125 2023-06-25 17:35:54,371 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.53 vs. limit=12.0 2023-06-25 17:36:31,372 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.824e+02 9.183e+02 1.353e+03 2.384e+03 4.707e+03, threshold=2.707e+03, percent-clipped=29.0 2023-06-25 17:36:51,616 INFO [train.py:996] (3/4) Epoch 12, batch 4050, loss[loss=0.2094, simple_loss=0.2997, pruned_loss=0.05957, over 21770.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3, pruned_loss=0.07531, over 4274698.20 frames. ], batch size: 298, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:37:33,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2037066.0, ans=0.125 2023-06-25 17:37:49,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2037066.0, ans=0.1 2023-06-25 17:37:54,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2037126.0, ans=0.04949747468305833 2023-06-25 17:38:23,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2037126.0, ans=0.0 2023-06-25 17:38:29,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2037186.0, ans=0.0 2023-06-25 17:38:37,398 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=15.0 2023-06-25 17:38:44,033 INFO [train.py:996] (3/4) Epoch 12, batch 4100, loss[loss=0.2203, simple_loss=0.2954, pruned_loss=0.07263, over 21275.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3016, pruned_loss=0.07581, over 4284344.65 frames. ], batch size: 143, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:39:25,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2037366.0, ans=0.125 2023-06-25 17:40:04,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2037426.0, ans=0.125 2023-06-25 17:40:13,186 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=22.5 2023-06-25 17:40:15,657 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.136e+02 8.322e+02 1.179e+03 1.731e+03 4.243e+03, threshold=2.358e+03, percent-clipped=9.0 2023-06-25 17:40:37,172 INFO [train.py:996] (3/4) Epoch 12, batch 4150, loss[loss=0.2437, simple_loss=0.309, pruned_loss=0.08926, over 20022.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3021, pruned_loss=0.07377, over 4266109.24 frames. ], batch size: 703, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:41:55,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2037726.0, ans=0.125 2023-06-25 17:42:35,946 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.14 vs. limit=22.5 2023-06-25 17:42:39,756 INFO [train.py:996] (3/4) Epoch 12, batch 4200, loss[loss=0.2052, simple_loss=0.2701, pruned_loss=0.07011, over 21287.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3046, pruned_loss=0.0745, over 4259697.22 frames. ], batch size: 144, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:42:53,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2037846.0, ans=0.125 2023-06-25 17:43:16,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=2037906.0, ans=22.5 2023-06-25 17:43:58,894 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.587e+02 1.298e+03 1.935e+03 2.585e+03 6.035e+03, threshold=3.870e+03, percent-clipped=37.0 2023-06-25 17:44:26,653 INFO [train.py:996] (3/4) Epoch 12, batch 4250, loss[loss=0.2966, simple_loss=0.3727, pruned_loss=0.1102, over 21411.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3094, pruned_loss=0.07604, over 4257779.21 frames. ], batch size: 471, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:45:08,666 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2023-06-25 17:45:20,304 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-25 17:45:21,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2038266.0, ans=0.2 2023-06-25 17:45:30,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2038266.0, ans=0.0 2023-06-25 17:46:03,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2038386.0, ans=0.125 2023-06-25 17:46:20,934 INFO [train.py:996] (3/4) Epoch 12, batch 4300, loss[loss=0.1498, simple_loss=0.1951, pruned_loss=0.05229, over 17344.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3136, pruned_loss=0.07681, over 4249563.95 frames. ], batch size: 62, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:47:12,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2038566.0, ans=0.0 2023-06-25 17:47:49,723 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.415e+02 1.013e+03 1.560e+03 2.387e+03 5.571e+03, threshold=3.121e+03, percent-clipped=6.0 2023-06-25 17:48:02,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2038686.0, ans=0.0 2023-06-25 17:48:21,416 INFO [train.py:996] (3/4) Epoch 12, batch 4350, loss[loss=0.1949, simple_loss=0.2567, pruned_loss=0.06651, over 21563.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3134, pruned_loss=0.07681, over 4252423.15 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:48:28,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2038746.0, ans=0.0 2023-06-25 17:48:44,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2038806.0, ans=0.125 2023-06-25 17:50:07,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2038986.0, ans=0.0 2023-06-25 17:50:21,649 INFO [train.py:996] (3/4) Epoch 12, batch 4400, loss[loss=0.2469, simple_loss=0.3367, pruned_loss=0.0786, over 21574.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3098, pruned_loss=0.07659, over 4258999.03 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 17:51:00,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-25 17:51:18,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2039166.0, ans=0.0 2023-06-25 17:51:49,906 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.648e+02 9.673e+02 1.616e+03 2.348e+03 4.188e+03, threshold=3.231e+03, percent-clipped=7.0 2023-06-25 17:52:17,308 INFO [train.py:996] (3/4) Epoch 12, batch 4450, loss[loss=0.2935, simple_loss=0.3985, pruned_loss=0.09428, over 21261.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.317, pruned_loss=0.0781, over 4260128.82 frames. ], batch size: 549, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:52:40,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2039406.0, ans=0.125 2023-06-25 17:52:49,960 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-25 17:52:53,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2039406.0, ans=0.125 2023-06-25 17:53:28,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2039526.0, ans=0.125 2023-06-25 17:53:47,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2039526.0, ans=0.1 2023-06-25 17:54:08,229 INFO [train.py:996] (3/4) Epoch 12, batch 4500, loss[loss=0.2298, simple_loss=0.3206, pruned_loss=0.06955, over 21751.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3186, pruned_loss=0.07996, over 4264095.37 frames. ], batch size: 414, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:55:02,537 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=15.0 2023-06-25 17:55:41,737 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.020e+02 9.450e+02 1.326e+03 2.168e+03 4.753e+03, threshold=2.653e+03, percent-clipped=7.0 2023-06-25 17:56:06,445 INFO [train.py:996] (3/4) Epoch 12, batch 4550, loss[loss=0.265, simple_loss=0.3439, pruned_loss=0.09308, over 21932.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3207, pruned_loss=0.07949, over 4263163.62 frames. ], batch size: 372, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:56:24,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2040006.0, ans=0.125 2023-06-25 17:56:50,000 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2023-06-25 17:57:21,994 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=22.5 2023-06-25 17:57:52,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2040186.0, ans=0.125 2023-06-25 17:57:54,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2040186.0, ans=0.125 2023-06-25 17:57:59,514 INFO [train.py:996] (3/4) Epoch 12, batch 4600, loss[loss=0.2485, simple_loss=0.3233, pruned_loss=0.08681, over 21330.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3217, pruned_loss=0.08026, over 4266668.90 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:58:41,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2040366.0, ans=0.0 2023-06-25 17:59:14,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2040426.0, ans=0.0 2023-06-25 17:59:34,144 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.831e+02 1.036e+03 1.391e+03 1.906e+03 3.939e+03, threshold=2.783e+03, percent-clipped=3.0 2023-06-25 17:59:53,319 INFO [train.py:996] (3/4) Epoch 12, batch 4650, loss[loss=0.1858, simple_loss=0.2654, pruned_loss=0.05309, over 21782.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3167, pruned_loss=0.07918, over 4277328.63 frames. ], batch size: 282, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:59:58,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2040546.0, ans=0.015 2023-06-25 18:01:11,968 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-25 18:01:47,017 INFO [train.py:996] (3/4) Epoch 12, batch 4700, loss[loss=0.2534, simple_loss=0.3704, pruned_loss=0.06819, over 20804.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.31, pruned_loss=0.07744, over 4266867.90 frames. ], batch size: 607, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 18:01:58,540 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:02:48,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2040966.0, ans=0.125 2023-06-25 18:03:19,760 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.793e+02 9.702e+02 1.366e+03 2.356e+03 4.995e+03, threshold=2.732e+03, percent-clipped=18.0 2023-06-25 18:03:39,167 INFO [train.py:996] (3/4) Epoch 12, batch 4750, loss[loss=0.2379, simple_loss=0.3064, pruned_loss=0.08469, over 21818.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3053, pruned_loss=0.07707, over 4262569.03 frames. ], batch size: 414, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 18:04:35,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2041266.0, ans=0.2 2023-06-25 18:04:57,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2041326.0, ans=0.125 2023-06-25 18:05:07,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2041326.0, ans=0.0 2023-06-25 18:05:24,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2041386.0, ans=0.0 2023-06-25 18:05:29,013 INFO [train.py:996] (3/4) Epoch 12, batch 4800, loss[loss=0.2482, simple_loss=0.3219, pruned_loss=0.08726, over 21871.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3065, pruned_loss=0.07831, over 4272314.30 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 18:06:01,078 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-25 18:06:27,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2041566.0, ans=0.1 2023-06-25 18:06:38,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2041626.0, ans=0.125 2023-06-25 18:06:40,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2041626.0, ans=0.0 2023-06-25 18:06:56,302 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.862e+02 9.240e+02 1.237e+03 1.888e+03 3.806e+03, threshold=2.475e+03, percent-clipped=7.0 2023-06-25 18:07:05,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2041686.0, ans=0.125 2023-06-25 18:07:13,719 INFO [train.py:996] (3/4) Epoch 12, batch 4850, loss[loss=0.2549, simple_loss=0.3773, pruned_loss=0.06627, over 20799.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3059, pruned_loss=0.07734, over 4271238.81 frames. ], batch size: 608, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 18:07:35,804 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=15.0 2023-06-25 18:08:09,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2041866.0, ans=0.2 2023-06-25 18:08:18,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2041866.0, ans=0.125 2023-06-25 18:08:23,948 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.12 vs. limit=15.0 2023-06-25 18:08:31,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2041926.0, ans=0.125 2023-06-25 18:09:06,940 INFO [train.py:996] (3/4) Epoch 12, batch 4900, loss[loss=0.2528, simple_loss=0.3528, pruned_loss=0.07636, over 19876.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3071, pruned_loss=0.07697, over 4271494.66 frames. ], batch size: 703, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 18:09:34,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2042106.0, ans=0.125 2023-06-25 18:09:39,019 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.68 vs. limit=15.0 2023-06-25 18:10:33,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2042226.0, ans=0.125 2023-06-25 18:10:36,690 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.253e+02 9.394e+02 1.371e+03 2.232e+03 4.474e+03, threshold=2.741e+03, percent-clipped=21.0 2023-06-25 18:10:48,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2042286.0, ans=0.07 2023-06-25 18:10:55,730 INFO [train.py:996] (3/4) Epoch 12, batch 4950, loss[loss=0.1944, simple_loss=0.2627, pruned_loss=0.06304, over 21792.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.31, pruned_loss=0.07614, over 4273595.06 frames. ], batch size: 124, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:11:45,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2042466.0, ans=0.125 2023-06-25 18:12:40,747 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.68 vs. limit=12.0 2023-06-25 18:12:41,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2042586.0, ans=0.125 2023-06-25 18:12:45,005 INFO [train.py:996] (3/4) Epoch 12, batch 5000, loss[loss=0.2676, simple_loss=0.3834, pruned_loss=0.07592, over 20751.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3099, pruned_loss=0.07361, over 4275131.43 frames. ], batch size: 607, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:13:01,777 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-25 18:13:04,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2042646.0, ans=0.125 2023-06-25 18:13:37,478 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.22 vs. limit=15.0 2023-06-25 18:14:04,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2042826.0, ans=0.125 2023-06-25 18:14:19,284 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.147e+02 9.354e+02 1.620e+03 2.183e+03 4.157e+03, threshold=3.240e+03, percent-clipped=13.0 2023-06-25 18:14:34,559 INFO [train.py:996] (3/4) Epoch 12, batch 5050, loss[loss=0.2681, simple_loss=0.3195, pruned_loss=0.1084, over 21783.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3107, pruned_loss=0.07571, over 4278138.70 frames. ], batch size: 508, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:14:58,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2043006.0, ans=0.2 2023-06-25 18:14:58,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2043006.0, ans=0.0 2023-06-25 18:16:24,374 INFO [train.py:996] (3/4) Epoch 12, batch 5100, loss[loss=0.2084, simple_loss=0.2925, pruned_loss=0.06217, over 21373.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3082, pruned_loss=0.07628, over 4287394.94 frames. ], batch size: 548, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:16:40,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2043246.0, ans=0.125 2023-06-25 18:17:16,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2043366.0, ans=0.2 2023-06-25 18:17:32,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2043366.0, ans=0.2 2023-06-25 18:17:55,856 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:17:58,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2043486.0, ans=0.125 2023-06-25 18:18:00,372 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.858e+02 8.212e+02 1.178e+03 1.481e+03 2.753e+03, threshold=2.355e+03, percent-clipped=0.0 2023-06-25 18:18:09,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2043486.0, ans=0.125 2023-06-25 18:18:15,849 INFO [train.py:996] (3/4) Epoch 12, batch 5150, loss[loss=0.2147, simple_loss=0.2796, pruned_loss=0.07495, over 20061.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3058, pruned_loss=0.07632, over 4290047.87 frames. ], batch size: 703, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:18:17,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2043546.0, ans=0.125 2023-06-25 18:18:21,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2043546.0, ans=0.0 2023-06-25 18:18:30,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2043546.0, ans=0.125 2023-06-25 18:18:43,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2043606.0, ans=0.125 2023-06-25 18:18:48,090 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-06-25 18:19:54,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2043786.0, ans=0.125 2023-06-25 18:20:01,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2043786.0, ans=0.2 2023-06-25 18:20:06,060 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-25 18:20:10,106 INFO [train.py:996] (3/4) Epoch 12, batch 5200, loss[loss=0.2507, simple_loss=0.3568, pruned_loss=0.07231, over 21660.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3082, pruned_loss=0.07768, over 4292110.31 frames. ], batch size: 389, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:21:35,846 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.53 vs. limit=6.0 2023-06-25 18:21:37,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2044026.0, ans=0.125 2023-06-25 18:21:44,966 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.871e+02 1.084e+03 1.828e+03 2.790e+03 5.969e+03, threshold=3.657e+03, percent-clipped=36.0 2023-06-25 18:21:45,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2044086.0, ans=0.125 2023-06-25 18:21:52,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=2044086.0, ans=10.0 2023-06-25 18:22:00,072 INFO [train.py:996] (3/4) Epoch 12, batch 5250, loss[loss=0.2246, simple_loss=0.3151, pruned_loss=0.06704, over 21786.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3127, pruned_loss=0.07654, over 4292508.04 frames. ], batch size: 282, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:23:24,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2044326.0, ans=0.125 2023-06-25 18:23:38,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2044386.0, ans=0.125 2023-06-25 18:23:52,650 INFO [train.py:996] (3/4) Epoch 12, batch 5300, loss[loss=0.226, simple_loss=0.2939, pruned_loss=0.07908, over 21831.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3121, pruned_loss=0.07721, over 4285775.69 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:23:58,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2044446.0, ans=0.2 2023-06-25 18:24:37,163 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-06-25 18:25:15,296 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-25 18:25:20,846 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.228e+02 8.444e+02 1.453e+03 2.194e+03 4.490e+03, threshold=2.906e+03, percent-clipped=2.0 2023-06-25 18:25:38,996 INFO [train.py:996] (3/4) Epoch 12, batch 5350, loss[loss=0.1978, simple_loss=0.2633, pruned_loss=0.06609, over 21679.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3097, pruned_loss=0.07812, over 4290883.68 frames. ], batch size: 230, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:26:30,847 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-06-25 18:27:19,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-06-25 18:27:23,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2044986.0, ans=0.125 2023-06-25 18:27:28,455 INFO [train.py:996] (3/4) Epoch 12, batch 5400, loss[loss=0.1978, simple_loss=0.2777, pruned_loss=0.05893, over 21843.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3073, pruned_loss=0.07849, over 4285102.86 frames. ], batch size: 316, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:28:13,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2045166.0, ans=0.125 2023-06-25 18:28:17,761 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.07 vs. limit=8.0 2023-06-25 18:28:44,321 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=12.0 2023-06-25 18:29:02,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=2045286.0, ans=0.025 2023-06-25 18:29:03,875 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.530e+02 9.524e+02 1.484e+03 2.155e+03 3.165e+03, threshold=2.968e+03, percent-clipped=5.0 2023-06-25 18:29:04,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2045286.0, ans=0.125 2023-06-25 18:29:04,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2045286.0, ans=0.125 2023-06-25 18:29:13,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2045286.0, ans=0.0 2023-06-25 18:29:16,965 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-25 18:29:17,833 INFO [train.py:996] (3/4) Epoch 12, batch 5450, loss[loss=0.2403, simple_loss=0.332, pruned_loss=0.07426, over 21850.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3081, pruned_loss=0.07721, over 4285845.48 frames. ], batch size: 282, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:29:33,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2045346.0, ans=0.0 2023-06-25 18:31:20,593 INFO [train.py:996] (3/4) Epoch 12, batch 5500, loss[loss=0.245, simple_loss=0.334, pruned_loss=0.07796, over 21347.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3128, pruned_loss=0.07454, over 4274419.27 frames. ], batch size: 176, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:32:54,273 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.103e+02 8.816e+02 1.255e+03 2.207e+03 4.493e+03, threshold=2.511e+03, percent-clipped=9.0 2023-06-25 18:33:15,521 INFO [train.py:996] (3/4) Epoch 12, batch 5550, loss[loss=0.2099, simple_loss=0.3001, pruned_loss=0.05986, over 21684.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3126, pruned_loss=0.0716, over 4272133.01 frames. ], batch size: 298, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:33:43,018 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:33:50,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2046006.0, ans=0.125 2023-06-25 18:34:11,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2046066.0, ans=0.1 2023-06-25 18:34:49,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2046186.0, ans=0.0 2023-06-25 18:34:54,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2046186.0, ans=0.0 2023-06-25 18:35:15,787 INFO [train.py:996] (3/4) Epoch 12, batch 5600, loss[loss=0.2428, simple_loss=0.3389, pruned_loss=0.07339, over 21773.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3111, pruned_loss=0.06927, over 4274073.02 frames. ], batch size: 282, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:36:01,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2046366.0, ans=0.125 2023-06-25 18:36:43,813 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.414e+02 9.096e+02 1.468e+03 2.250e+03 5.132e+03, threshold=2.936e+03, percent-clipped=21.0 2023-06-25 18:37:03,766 INFO [train.py:996] (3/4) Epoch 12, batch 5650, loss[loss=0.2821, simple_loss=0.3422, pruned_loss=0.111, over 21858.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3136, pruned_loss=0.07168, over 4283352.75 frames. ], batch size: 414, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:37:19,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2046606.0, ans=0.125 2023-06-25 18:38:22,319 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-25 18:38:44,357 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=22.5 2023-06-25 18:38:53,826 INFO [train.py:996] (3/4) Epoch 12, batch 5700, loss[loss=0.2139, simple_loss=0.2873, pruned_loss=0.07024, over 21214.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3125, pruned_loss=0.07319, over 4283102.80 frames. ], batch size: 159, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:39:26,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2046906.0, ans=0.1 2023-06-25 18:39:38,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2046906.0, ans=0.125 2023-06-25 18:40:22,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2047026.0, ans=0.07 2023-06-25 18:40:34,219 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.374e+02 7.754e+02 1.090e+03 1.653e+03 4.716e+03, threshold=2.180e+03, percent-clipped=6.0 2023-06-25 18:40:42,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2047086.0, ans=0.0 2023-06-25 18:40:48,800 INFO [train.py:996] (3/4) Epoch 12, batch 5750, loss[loss=0.1866, simple_loss=0.286, pruned_loss=0.04362, over 21647.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3122, pruned_loss=0.07045, over 4277734.45 frames. ], batch size: 263, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:40:57,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2047146.0, ans=0.125 2023-06-25 18:41:03,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2047146.0, ans=0.125 2023-06-25 18:41:20,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=2047206.0, ans=15.0 2023-06-25 18:42:13,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-25 18:42:22,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2047386.0, ans=0.125 2023-06-25 18:42:26,710 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-25 18:42:41,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2047386.0, ans=0.125 2023-06-25 18:42:45,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2047446.0, ans=0.0 2023-06-25 18:42:46,097 INFO [train.py:996] (3/4) Epoch 12, batch 5800, loss[loss=0.286, simple_loss=0.3776, pruned_loss=0.0972, over 21661.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3104, pruned_loss=0.06889, over 4276793.92 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:42:52,824 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-25 18:43:07,722 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.33 vs. limit=10.0 2023-06-25 18:44:02,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2047626.0, ans=0.125 2023-06-25 18:44:16,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2047686.0, ans=0.125 2023-06-25 18:44:17,450 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.217e+02 1.046e+03 1.663e+03 2.546e+03 5.272e+03, threshold=3.326e+03, percent-clipped=34.0 2023-06-25 18:44:42,281 INFO [train.py:996] (3/4) Epoch 12, batch 5850, loss[loss=0.1765, simple_loss=0.2831, pruned_loss=0.03496, over 21762.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.3072, pruned_loss=0.06574, over 4276523.93 frames. ], batch size: 282, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:45:14,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2047806.0, ans=0.125 2023-06-25 18:45:17,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2047806.0, ans=0.1 2023-06-25 18:46:01,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2047926.0, ans=0.05 2023-06-25 18:46:01,858 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-06-25 18:46:34,288 INFO [train.py:996] (3/4) Epoch 12, batch 5900, loss[loss=0.1707, simple_loss=0.2497, pruned_loss=0.04586, over 21411.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2981, pruned_loss=0.06027, over 4273525.60 frames. ], batch size: 194, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:46:57,225 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:47:33,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2048166.0, ans=0.1 2023-06-25 18:47:33,101 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:48:00,965 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 8.415e+02 1.286e+03 1.722e+03 5.284e+03, threshold=2.571e+03, percent-clipped=3.0 2023-06-25 18:48:21,900 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.60 vs. limit=10.0 2023-06-25 18:48:22,660 INFO [train.py:996] (3/4) Epoch 12, batch 5950, loss[loss=0.2278, simple_loss=0.2862, pruned_loss=0.08471, over 21282.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2994, pruned_loss=0.06407, over 4282541.52 frames. ], batch size: 176, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:49:11,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2048466.0, ans=0.125 2023-06-25 18:50:19,475 INFO [train.py:996] (3/4) Epoch 12, batch 6000, loss[loss=0.1811, simple_loss=0.243, pruned_loss=0.05963, over 21221.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2964, pruned_loss=0.06678, over 4280733.73 frames. ], batch size: 551, lr: 2.45e-03, grad_scale: 32.0 2023-06-25 18:50:19,476 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 18:50:36,762 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.2561, simple_loss=0.3516, pruned_loss=0.08031, over 1796401.00 frames. 2023-06-25 18:50:36,763 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-25 18:51:35,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=2048766.0, ans=10.0 2023-06-25 18:51:49,510 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-25 18:51:52,626 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.78 vs. limit=15.0 2023-06-25 18:52:06,336 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.84 vs. limit=10.0 2023-06-25 18:52:13,641 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.382e+02 1.134e+03 1.580e+03 2.203e+03 4.281e+03, threshold=3.160e+03, percent-clipped=13.0 2023-06-25 18:52:21,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2048886.0, ans=0.1 2023-06-25 18:52:25,704 INFO [train.py:996] (3/4) Epoch 12, batch 6050, loss[loss=0.2366, simple_loss=0.2855, pruned_loss=0.09383, over 21406.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2919, pruned_loss=0.06759, over 4277824.63 frames. ], batch size: 476, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:52:40,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2048946.0, ans=0.125 2023-06-25 18:53:04,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2049066.0, ans=0.125 2023-06-25 18:54:04,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2049186.0, ans=0.125 2023-06-25 18:54:14,011 INFO [train.py:996] (3/4) Epoch 12, batch 6100, loss[loss=0.2515, simple_loss=0.3228, pruned_loss=0.09009, over 21786.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2917, pruned_loss=0.06714, over 4283477.04 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:54:25,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2049246.0, ans=0.125 2023-06-25 18:54:49,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2049306.0, ans=0.0 2023-06-25 18:55:36,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2049426.0, ans=0.2 2023-06-25 18:55:51,212 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.240e+02 1.094e+03 1.641e+03 2.565e+03 7.490e+03, threshold=3.281e+03, percent-clipped=16.0 2023-06-25 18:55:58,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2049486.0, ans=0.125 2023-06-25 18:56:03,339 INFO [train.py:996] (3/4) Epoch 12, batch 6150, loss[loss=0.1914, simple_loss=0.3078, pruned_loss=0.0375, over 20852.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2931, pruned_loss=0.06945, over 4277085.85 frames. ], batch size: 608, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:56:19,668 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=15.0 2023-06-25 18:56:51,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2049666.0, ans=0.2 2023-06-25 18:57:11,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2049726.0, ans=0.125 2023-06-25 18:57:56,188 INFO [train.py:996] (3/4) Epoch 12, batch 6200, loss[loss=0.1727, simple_loss=0.2529, pruned_loss=0.04622, over 21520.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2961, pruned_loss=0.06988, over 4280938.96 frames. ], batch size: 212, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:59:13,685 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-25 18:59:16,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2050026.0, ans=0.1 2023-06-25 18:59:28,915 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.378e+02 1.039e+03 1.350e+03 2.072e+03 4.321e+03, threshold=2.700e+03, percent-clipped=2.0 2023-06-25 18:59:30,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2050086.0, ans=0.125 2023-06-25 18:59:45,846 INFO [train.py:996] (3/4) Epoch 12, batch 6250, loss[loss=0.1941, simple_loss=0.2944, pruned_loss=0.0469, over 21717.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.3003, pruned_loss=0.06928, over 4280329.48 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:00:24,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2050266.0, ans=0.2 2023-06-25 19:00:44,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2050266.0, ans=0.125 2023-06-25 19:01:34,286 INFO [train.py:996] (3/4) Epoch 12, batch 6300, loss[loss=0.2175, simple_loss=0.2903, pruned_loss=0.07238, over 21867.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3033, pruned_loss=0.06814, over 4282018.69 frames. ], batch size: 298, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:01:55,693 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.54 vs. limit=5.0 2023-06-25 19:03:12,171 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.168e+02 9.742e+02 1.547e+03 2.074e+03 3.988e+03, threshold=3.093e+03, percent-clipped=9.0 2023-06-25 19:03:22,522 INFO [train.py:996] (3/4) Epoch 12, batch 6350, loss[loss=0.2855, simple_loss=0.3579, pruned_loss=0.1065, over 21407.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3058, pruned_loss=0.07171, over 4285208.85 frames. ], batch size: 131, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:03:36,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2050746.0, ans=0.125 2023-06-25 19:04:07,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2050806.0, ans=0.1 2023-06-25 19:04:18,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2050866.0, ans=0.0 2023-06-25 19:04:27,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2050866.0, ans=0.0 2023-06-25 19:04:48,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-25 19:05:15,811 INFO [train.py:996] (3/4) Epoch 12, batch 6400, loss[loss=0.266, simple_loss=0.3368, pruned_loss=0.09758, over 21530.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3125, pruned_loss=0.07666, over 4286208.48 frames. ], batch size: 194, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:06:23,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2051166.0, ans=0.2 2023-06-25 19:06:34,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2051226.0, ans=0.05 2023-06-25 19:06:46,888 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-25 19:06:54,044 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.078e+02 9.080e+02 1.264e+03 1.614e+03 4.055e+03, threshold=2.529e+03, percent-clipped=6.0 2023-06-25 19:07:09,036 INFO [train.py:996] (3/4) Epoch 12, batch 6450, loss[loss=0.2381, simple_loss=0.3297, pruned_loss=0.07327, over 21616.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3155, pruned_loss=0.07698, over 4290337.31 frames. ], batch size: 414, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:07:29,967 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:07:46,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2051406.0, ans=0.0 2023-06-25 19:08:51,543 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-06-25 19:08:52,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2051586.0, ans=0.1 2023-06-25 19:08:59,114 INFO [train.py:996] (3/4) Epoch 12, batch 6500, loss[loss=0.1613, simple_loss=0.2344, pruned_loss=0.04405, over 21545.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3064, pruned_loss=0.07534, over 4284227.57 frames. ], batch size: 231, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:08:59,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2051646.0, ans=0.0 2023-06-25 19:09:43,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2051706.0, ans=0.125 2023-06-25 19:09:54,987 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-25 19:10:04,288 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.31 vs. limit=12.0 2023-06-25 19:10:23,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=15.0 2023-06-25 19:10:35,309 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.489e+02 7.750e+02 1.055e+03 1.738e+03 4.061e+03, threshold=2.109e+03, percent-clipped=10.0 2023-06-25 19:10:51,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2051946.0, ans=0.125 2023-06-25 19:10:52,970 INFO [train.py:996] (3/4) Epoch 12, batch 6550, loss[loss=0.2063, simple_loss=0.2886, pruned_loss=0.06204, over 21787.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.305, pruned_loss=0.07427, over 4281728.01 frames. ], batch size: 282, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:11:12,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2051946.0, ans=0.0 2023-06-25 19:11:14,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2051946.0, ans=0.125 2023-06-25 19:11:33,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2052006.0, ans=0.0 2023-06-25 19:11:40,474 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=12.0 2023-06-25 19:12:42,056 INFO [train.py:996] (3/4) Epoch 12, batch 6600, loss[loss=0.2508, simple_loss=0.3035, pruned_loss=0.09905, over 21645.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2998, pruned_loss=0.07436, over 4261066.57 frames. ], batch size: 416, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:13:42,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2052366.0, ans=0.2 2023-06-25 19:14:19,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=2052486.0, ans=0.5 2023-06-25 19:14:19,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2052486.0, ans=0.05 2023-06-25 19:14:22,494 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.983e+02 6.750e+02 1.048e+03 1.422e+03 4.566e+03, threshold=2.096e+03, percent-clipped=9.0 2023-06-25 19:14:35,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2052546.0, ans=0.125 2023-06-25 19:14:36,273 INFO [train.py:996] (3/4) Epoch 12, batch 6650, loss[loss=0.1846, simple_loss=0.2589, pruned_loss=0.05518, over 21264.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2947, pruned_loss=0.07201, over 4266271.62 frames. ], batch size: 551, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:14:39,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2052546.0, ans=0.2 2023-06-25 19:14:57,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2052606.0, ans=0.2 2023-06-25 19:15:07,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2052606.0, ans=0.125 2023-06-25 19:15:19,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2052666.0, ans=0.09899494936611666 2023-06-25 19:15:28,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2052666.0, ans=0.125 2023-06-25 19:15:32,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2052726.0, ans=0.125 2023-06-25 19:15:34,278 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=15.0 2023-06-25 19:15:35,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2052726.0, ans=0.1 2023-06-25 19:15:49,211 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-25 19:15:55,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2052786.0, ans=0.1 2023-06-25 19:16:12,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2052786.0, ans=0.125 2023-06-25 19:16:24,149 INFO [train.py:996] (3/4) Epoch 12, batch 6700, loss[loss=0.2179, simple_loss=0.2838, pruned_loss=0.07599, over 21706.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2912, pruned_loss=0.07233, over 4254180.15 frames. ], batch size: 333, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:16:31,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2052846.0, ans=0.0 2023-06-25 19:17:31,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2053026.0, ans=0.95 2023-06-25 19:17:59,529 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.379e+02 9.118e+02 1.454e+03 2.169e+03 5.767e+03, threshold=2.907e+03, percent-clipped=27.0 2023-06-25 19:18:01,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2053086.0, ans=0.0 2023-06-25 19:18:13,642 INFO [train.py:996] (3/4) Epoch 12, batch 6750, loss[loss=0.1821, simple_loss=0.2527, pruned_loss=0.05578, over 21449.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2912, pruned_loss=0.07322, over 4252791.88 frames. ], batch size: 212, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:18:28,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2053206.0, ans=0.2 2023-06-25 19:18:31,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2053206.0, ans=0.2 2023-06-25 19:18:38,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2053206.0, ans=0.0 2023-06-25 19:18:40,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2053206.0, ans=0.0 2023-06-25 19:20:02,634 INFO [train.py:996] (3/4) Epoch 12, batch 6800, loss[loss=0.1934, simple_loss=0.2749, pruned_loss=0.05597, over 21094.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2922, pruned_loss=0.07468, over 4257446.63 frames. ], batch size: 607, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:20:03,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2053446.0, ans=0.0 2023-06-25 19:20:55,756 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.78 vs. limit=15.0 2023-06-25 19:21:00,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2053626.0, ans=0.0 2023-06-25 19:21:29,037 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.142e+02 1.053e+03 1.480e+03 2.142e+03 3.474e+03, threshold=2.960e+03, percent-clipped=10.0 2023-06-25 19:21:42,858 INFO [train.py:996] (3/4) Epoch 12, batch 6850, loss[loss=0.2076, simple_loss=0.2735, pruned_loss=0.07083, over 21822.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2895, pruned_loss=0.07517, over 4265397.13 frames. ], batch size: 333, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:21:43,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2053746.0, ans=0.04949747468305833 2023-06-25 19:22:43,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2053866.0, ans=0.2 2023-06-25 19:23:09,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2053986.0, ans=0.125 2023-06-25 19:23:39,644 INFO [train.py:996] (3/4) Epoch 12, batch 6900, loss[loss=0.1755, simple_loss=0.2619, pruned_loss=0.04459, over 21348.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2902, pruned_loss=0.07625, over 4274725.99 frames. ], batch size: 176, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:24:55,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2054226.0, ans=0.125 2023-06-25 19:25:13,077 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-25 19:25:20,635 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.318e+02 8.019e+02 1.199e+03 1.523e+03 3.693e+03, threshold=2.398e+03, percent-clipped=1.0 2023-06-25 19:25:28,990 INFO [train.py:996] (3/4) Epoch 12, batch 6950, loss[loss=0.1742, simple_loss=0.2431, pruned_loss=0.05263, over 21799.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2913, pruned_loss=0.07258, over 4274840.10 frames. ], batch size: 102, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:25:34,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2054346.0, ans=0.0 2023-06-25 19:25:36,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2054346.0, ans=0.125 2023-06-25 19:27:09,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2054586.0, ans=0.2 2023-06-25 19:27:17,489 INFO [train.py:996] (3/4) Epoch 12, batch 7000, loss[loss=0.2275, simple_loss=0.2903, pruned_loss=0.0823, over 21889.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2961, pruned_loss=0.07475, over 4274138.19 frames. ], batch size: 107, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:27:22,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2054646.0, ans=0.125 2023-06-25 19:27:39,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2054706.0, ans=0.125 2023-06-25 19:28:05,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2054766.0, ans=0.125 2023-06-25 19:28:23,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2054826.0, ans=0.125 2023-06-25 19:28:34,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2054826.0, ans=0.04949747468305833 2023-06-25 19:28:56,660 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.600e+02 9.157e+02 1.451e+03 1.908e+03 5.399e+03, threshold=2.901e+03, percent-clipped=14.0 2023-06-25 19:29:05,336 INFO [train.py:996] (3/4) Epoch 12, batch 7050, loss[loss=0.2096, simple_loss=0.3025, pruned_loss=0.0583, over 21744.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2936, pruned_loss=0.07337, over 4266871.44 frames. ], batch size: 332, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:29:50,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2055066.0, ans=0.0 2023-06-25 19:29:55,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2055066.0, ans=0.1 2023-06-25 19:29:59,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2055066.0, ans=0.0 2023-06-25 19:30:28,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2055126.0, ans=0.0 2023-06-25 19:30:36,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2055186.0, ans=0.2 2023-06-25 19:30:40,498 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=15.0 2023-06-25 19:30:58,051 INFO [train.py:996] (3/4) Epoch 12, batch 7100, loss[loss=0.183, simple_loss=0.2374, pruned_loss=0.06432, over 20820.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2986, pruned_loss=0.07498, over 4271239.80 frames. ], batch size: 608, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:31:06,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2055246.0, ans=0.125 2023-06-25 19:31:34,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2055306.0, ans=0.1 2023-06-25 19:32:06,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=15.0 2023-06-25 19:32:28,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2055486.0, ans=0.125 2023-06-25 19:32:32,638 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.001e+02 9.554e+02 1.238e+03 1.810e+03 4.265e+03, threshold=2.476e+03, percent-clipped=5.0 2023-06-25 19:32:44,673 INFO [train.py:996] (3/4) Epoch 12, batch 7150, loss[loss=0.2508, simple_loss=0.3193, pruned_loss=0.0912, over 21435.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2974, pruned_loss=0.07355, over 4270487.02 frames. ], batch size: 194, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:32:57,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2055546.0, ans=0.04949747468305833 2023-06-25 19:33:44,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2055666.0, ans=0.125 2023-06-25 19:34:03,744 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-25 19:34:08,614 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.58 vs. limit=15.0 2023-06-25 19:34:19,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2055786.0, ans=0.1 2023-06-25 19:34:23,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2055786.0, ans=0.125 2023-06-25 19:34:31,889 INFO [train.py:996] (3/4) Epoch 12, batch 7200, loss[loss=0.2542, simple_loss=0.3153, pruned_loss=0.09653, over 21717.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2997, pruned_loss=0.07551, over 4273760.40 frames. ], batch size: 351, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:34:46,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2055846.0, ans=0.125 2023-06-25 19:34:46,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2055846.0, ans=0.125 2023-06-25 19:35:24,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2055966.0, ans=10.0 2023-06-25 19:36:17,041 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.460e+02 1.008e+03 1.554e+03 2.511e+03 5.348e+03, threshold=3.107e+03, percent-clipped=25.0 2023-06-25 19:36:21,715 INFO [train.py:996] (3/4) Epoch 12, batch 7250, loss[loss=0.218, simple_loss=0.2784, pruned_loss=0.07877, over 15730.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2965, pruned_loss=0.07591, over 4270390.14 frames. ], batch size: 66, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:36:22,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2056146.0, ans=0.1 2023-06-25 19:37:56,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2056386.0, ans=0.2 2023-06-25 19:38:03,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2056386.0, ans=0.0 2023-06-25 19:38:09,771 INFO [train.py:996] (3/4) Epoch 12, batch 7300, loss[loss=0.2543, simple_loss=0.2915, pruned_loss=0.1086, over 21500.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2904, pruned_loss=0.07521, over 4270525.12 frames. ], batch size: 512, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:38:10,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2056446.0, ans=0.0 2023-06-25 19:38:16,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2056446.0, ans=0.0 2023-06-25 19:38:45,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2056506.0, ans=0.125 2023-06-25 19:39:51,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2056686.0, ans=0.125 2023-06-25 19:39:54,698 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.105e+02 8.292e+02 1.311e+03 1.863e+03 3.675e+03, threshold=2.622e+03, percent-clipped=4.0 2023-06-25 19:39:59,537 INFO [train.py:996] (3/4) Epoch 12, batch 7350, loss[loss=0.2295, simple_loss=0.3067, pruned_loss=0.07614, over 21689.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2906, pruned_loss=0.07672, over 4266537.77 frames. ], batch size: 351, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:40:24,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2056746.0, ans=0.0 2023-06-25 19:41:16,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2056926.0, ans=0.125 2023-06-25 19:42:01,535 INFO [train.py:996] (3/4) Epoch 12, batch 7400, loss[loss=0.2057, simple_loss=0.3044, pruned_loss=0.05356, over 21831.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2967, pruned_loss=0.07663, over 4258529.77 frames. ], batch size: 372, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:42:55,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=15.0 2023-06-25 19:43:31,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2057286.0, ans=0.125 2023-06-25 19:43:43,646 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.378e+02 8.354e+02 1.413e+03 2.372e+03 4.608e+03, threshold=2.826e+03, percent-clipped=17.0 2023-06-25 19:43:48,712 INFO [train.py:996] (3/4) Epoch 12, batch 7450, loss[loss=0.1859, simple_loss=0.257, pruned_loss=0.05743, over 21526.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2958, pruned_loss=0.07488, over 4251623.77 frames. ], batch size: 263, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:44:51,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2057526.0, ans=0.0 2023-06-25 19:44:58,883 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-25 19:45:28,785 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:45:31,155 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-25 19:45:31,253 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-25 19:45:39,184 INFO [train.py:996] (3/4) Epoch 12, batch 7500, loss[loss=0.3047, simple_loss=0.4038, pruned_loss=0.1028, over 21669.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3016, pruned_loss=0.07711, over 4257228.20 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:46:01,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2057706.0, ans=0.125 2023-06-25 19:47:23,851 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.023e+02 9.552e+02 1.495e+03 2.068e+03 4.249e+03, threshold=2.990e+03, percent-clipped=12.0 2023-06-25 19:47:26,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2057886.0, ans=0.2 2023-06-25 19:47:36,187 INFO [train.py:996] (3/4) Epoch 12, batch 7550, loss[loss=0.2332, simple_loss=0.2849, pruned_loss=0.09077, over 20324.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3075, pruned_loss=0.07645, over 4254097.52 frames. ], batch size: 703, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:48:35,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2058126.0, ans=0.125 2023-06-25 19:49:11,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2058186.0, ans=0.0 2023-06-25 19:49:14,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2058186.0, ans=0.0 2023-06-25 19:49:16,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2058246.0, ans=0.0 2023-06-25 19:49:17,670 INFO [train.py:996] (3/4) Epoch 12, batch 7600, loss[loss=0.2573, simple_loss=0.3184, pruned_loss=0.09808, over 21771.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3062, pruned_loss=0.07563, over 4262578.33 frames. ], batch size: 507, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:49:28,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2058246.0, ans=0.0 2023-06-25 19:49:39,050 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.28 vs. limit=6.0 2023-06-25 19:50:01,988 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-25 19:50:06,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2058366.0, ans=0.04949747468305833 2023-06-25 19:51:01,878 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.519e+02 8.115e+02 1.091e+03 1.750e+03 3.922e+03, threshold=2.181e+03, percent-clipped=5.0 2023-06-25 19:51:12,191 INFO [train.py:996] (3/4) Epoch 12, batch 7650, loss[loss=0.2492, simple_loss=0.3125, pruned_loss=0.09294, over 21802.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3056, pruned_loss=0.07724, over 4269079.34 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:51:23,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2058546.0, ans=0.2 2023-06-25 19:51:28,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2058606.0, ans=0.1 2023-06-25 19:51:57,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2058666.0, ans=0.1 2023-06-25 19:51:58,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2058666.0, ans=0.1 2023-06-25 19:52:02,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2058666.0, ans=0.0 2023-06-25 19:52:46,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2058786.0, ans=0.0 2023-06-25 19:53:04,299 INFO [train.py:996] (3/4) Epoch 12, batch 7700, loss[loss=0.2903, simple_loss=0.3547, pruned_loss=0.1129, over 21835.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3075, pruned_loss=0.07939, over 4275082.18 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:53:23,641 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=22.5 2023-06-25 19:53:26,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2058906.0, ans=0.125 2023-06-25 19:53:30,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2058906.0, ans=0.0 2023-06-25 19:54:29,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2059026.0, ans=0.0 2023-06-25 19:54:47,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2059086.0, ans=0.125 2023-06-25 19:54:52,160 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.071e+02 8.849e+02 1.361e+03 2.010e+03 4.988e+03, threshold=2.722e+03, percent-clipped=19.0 2023-06-25 19:55:00,894 INFO [train.py:996] (3/4) Epoch 12, batch 7750, loss[loss=0.2622, simple_loss=0.3528, pruned_loss=0.08577, over 21609.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.311, pruned_loss=0.07936, over 4272491.22 frames. ], batch size: 230, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 19:55:29,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2059206.0, ans=0.1 2023-06-25 19:56:16,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-25 19:56:37,289 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:56:37,771 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-25 19:56:50,902 INFO [train.py:996] (3/4) Epoch 12, batch 7800, loss[loss=0.2197, simple_loss=0.2762, pruned_loss=0.08156, over 20823.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.315, pruned_loss=0.08039, over 4273254.40 frames. ], batch size: 609, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 19:56:55,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2059446.0, ans=0.125 2023-06-25 19:57:01,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2059446.0, ans=0.125 2023-06-25 19:57:05,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2059446.0, ans=0.2 2023-06-25 19:57:08,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2059506.0, ans=0.125 2023-06-25 19:58:20,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2059686.0, ans=0.1 2023-06-25 19:58:38,494 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.766e+02 1.116e+03 1.579e+03 2.542e+03 4.639e+03, threshold=3.158e+03, percent-clipped=18.0 2023-06-25 19:58:42,191 INFO [train.py:996] (3/4) Epoch 12, batch 7850, loss[loss=0.198, simple_loss=0.2634, pruned_loss=0.06628, over 21522.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3079, pruned_loss=0.07965, over 4275318.23 frames. ], batch size: 195, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 19:59:05,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2059806.0, ans=0.125 2023-06-25 19:59:29,207 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-25 19:59:30,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2059866.0, ans=0.0 2023-06-25 19:59:32,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2059866.0, ans=0.125 2023-06-25 19:59:40,028 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-25 19:59:44,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2059926.0, ans=0.125 2023-06-25 20:00:33,859 INFO [train.py:996] (3/4) Epoch 12, batch 7900, loss[loss=0.218, simple_loss=0.2775, pruned_loss=0.0793, over 21140.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3038, pruned_loss=0.07888, over 4276063.50 frames. ], batch size: 143, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:02:17,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2060286.0, ans=0.125 2023-06-25 20:02:23,975 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.116e+02 9.466e+02 1.589e+03 2.565e+03 7.022e+03, threshold=3.178e+03, percent-clipped=11.0 2023-06-25 20:02:27,426 INFO [train.py:996] (3/4) Epoch 12, batch 7950, loss[loss=0.2367, simple_loss=0.3264, pruned_loss=0.07355, over 21787.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3064, pruned_loss=0.07747, over 4278335.70 frames. ], batch size: 332, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:02:47,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2060346.0, ans=0.05 2023-06-25 20:03:18,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2060466.0, ans=0.125 2023-06-25 20:03:28,407 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-25 20:03:53,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2060526.0, ans=0.125 2023-06-25 20:04:21,416 INFO [train.py:996] (3/4) Epoch 12, batch 8000, loss[loss=0.2535, simple_loss=0.3362, pruned_loss=0.08541, over 21902.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3116, pruned_loss=0.08003, over 4272452.52 frames. ], batch size: 316, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:04:51,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2060706.0, ans=0.125 2023-06-25 20:05:52,186 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.15 vs. limit=12.0 2023-06-25 20:06:06,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2060886.0, ans=0.2 2023-06-25 20:06:09,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2060886.0, ans=0.0 2023-06-25 20:06:16,538 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.154e+02 1.106e+03 1.585e+03 2.816e+03 6.535e+03, threshold=3.170e+03, percent-clipped=17.0 2023-06-25 20:06:25,694 INFO [train.py:996] (3/4) Epoch 12, batch 8050, loss[loss=0.2246, simple_loss=0.2875, pruned_loss=0.08087, over 21496.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3163, pruned_loss=0.08034, over 4267378.78 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:06:26,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2060946.0, ans=0.0 2023-06-25 20:06:38,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2060946.0, ans=0.1 2023-06-25 20:06:42,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2061006.0, ans=0.125 2023-06-25 20:06:58,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2061006.0, ans=0.2 2023-06-25 20:07:12,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2061066.0, ans=0.125 2023-06-25 20:07:17,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2061066.0, ans=0.125 2023-06-25 20:08:15,646 INFO [train.py:996] (3/4) Epoch 12, batch 8100, loss[loss=0.2169, simple_loss=0.2967, pruned_loss=0.06855, over 21783.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3149, pruned_loss=0.08098, over 4268256.79 frames. ], batch size: 441, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:08:25,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2061246.0, ans=0.2 2023-06-25 20:08:25,764 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.91 vs. limit=15.0 2023-06-25 20:09:05,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2061366.0, ans=0.125 2023-06-25 20:09:13,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2061366.0, ans=0.125 2023-06-25 20:10:11,161 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.662e+02 1.217e+03 1.664e+03 2.613e+03 6.278e+03, threshold=3.327e+03, percent-clipped=14.0 2023-06-25 20:10:14,522 INFO [train.py:996] (3/4) Epoch 12, batch 8150, loss[loss=0.2116, simple_loss=0.2904, pruned_loss=0.06644, over 21502.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3201, pruned_loss=0.08143, over 4258934.97 frames. ], batch size: 195, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:11:32,577 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-25 20:11:42,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2061786.0, ans=0.0 2023-06-25 20:12:03,266 INFO [train.py:996] (3/4) Epoch 12, batch 8200, loss[loss=0.2332, simple_loss=0.294, pruned_loss=0.08622, over 21805.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3124, pruned_loss=0.07931, over 4265163.08 frames. ], batch size: 352, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:12:12,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=22.5 2023-06-25 20:13:41,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2062086.0, ans=0.0 2023-06-25 20:13:50,884 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.075e+02 8.315e+02 1.225e+03 2.435e+03 5.897e+03, threshold=2.449e+03, percent-clipped=16.0 2023-06-25 20:13:54,416 INFO [train.py:996] (3/4) Epoch 12, batch 8250, loss[loss=0.2219, simple_loss=0.2977, pruned_loss=0.07311, over 21298.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3114, pruned_loss=0.07833, over 4270309.59 frames. ], batch size: 159, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:14:05,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2062146.0, ans=0.125 2023-06-25 20:15:02,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2062326.0, ans=0.125 2023-06-25 20:15:44,041 INFO [train.py:996] (3/4) Epoch 12, batch 8300, loss[loss=0.1964, simple_loss=0.2719, pruned_loss=0.06049, over 21298.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3097, pruned_loss=0.07561, over 4269111.43 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:16:35,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=2062566.0, ans=15.0 2023-06-25 20:16:36,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2062566.0, ans=0.125 2023-06-25 20:16:50,868 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:17:28,691 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2023-06-25 20:17:29,289 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.578e+02 9.193e+02 1.531e+03 2.133e+03 6.656e+03, threshold=3.063e+03, percent-clipped=18.0 2023-06-25 20:17:32,759 INFO [train.py:996] (3/4) Epoch 12, batch 8350, loss[loss=0.2125, simple_loss=0.2987, pruned_loss=0.06311, over 21234.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3097, pruned_loss=0.07409, over 4269049.98 frames. ], batch size: 159, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:18:07,435 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.89 vs. limit=10.0 2023-06-25 20:18:07,496 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.51 vs. limit=10.0 2023-06-25 20:18:10,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2062806.0, ans=0.0 2023-06-25 20:18:15,020 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-25 20:19:26,193 INFO [train.py:996] (3/4) Epoch 12, batch 8400, loss[loss=0.1782, simple_loss=0.2525, pruned_loss=0.05194, over 21715.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.306, pruned_loss=0.07117, over 4270284.43 frames. ], batch size: 124, lr: 2.44e-03, grad_scale: 32.0 2023-06-25 20:19:26,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2063046.0, ans=0.125 2023-06-25 20:20:06,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2063106.0, ans=0.2 2023-06-25 20:20:15,056 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-25 20:20:19,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2063166.0, ans=0.125 2023-06-25 20:20:19,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2063166.0, ans=0.125 2023-06-25 20:20:50,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2063226.0, ans=0.0 2023-06-25 20:20:51,958 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:21:11,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2063286.0, ans=0.125 2023-06-25 20:21:14,090 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.481e+02 1.044e+03 1.681e+03 2.727e+03 5.790e+03, threshold=3.363e+03, percent-clipped=19.0 2023-06-25 20:21:14,112 INFO [train.py:996] (3/4) Epoch 12, batch 8450, loss[loss=0.2456, simple_loss=0.3094, pruned_loss=0.09091, over 21882.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3048, pruned_loss=0.0712, over 4271550.49 frames. ], batch size: 332, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:22:15,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2063466.0, ans=0.1 2023-06-25 20:22:41,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2063526.0, ans=0.1 2023-06-25 20:22:43,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2063526.0, ans=0.125 2023-06-25 20:23:04,267 INFO [train.py:996] (3/4) Epoch 12, batch 8500, loss[loss=0.2274, simple_loss=0.327, pruned_loss=0.06386, over 20901.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3006, pruned_loss=0.0722, over 4272806.52 frames. ], batch size: 607, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:23:24,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2063646.0, ans=0.125 2023-06-25 20:23:52,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2063766.0, ans=0.0 2023-06-25 20:24:34,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2063826.0, ans=0.125 2023-06-25 20:24:58,598 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.271e+02 9.731e+02 1.377e+03 2.109e+03 5.965e+03, threshold=2.755e+03, percent-clipped=8.0 2023-06-25 20:24:58,620 INFO [train.py:996] (3/4) Epoch 12, batch 8550, loss[loss=0.2309, simple_loss=0.3296, pruned_loss=0.06611, over 21798.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3062, pruned_loss=0.07547, over 4274433.15 frames. ], batch size: 316, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:25:19,617 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=22.5 2023-06-25 20:25:35,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2064006.0, ans=0.125 2023-06-25 20:25:55,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2064066.0, ans=0.05 2023-06-25 20:26:06,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2064066.0, ans=0.125 2023-06-25 20:26:19,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.76 vs. limit=10.0 2023-06-25 20:26:26,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2064126.0, ans=0.1 2023-06-25 20:26:36,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2064186.0, ans=0.09899494936611666 2023-06-25 20:26:46,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2064186.0, ans=0.125 2023-06-25 20:26:57,753 INFO [train.py:996] (3/4) Epoch 12, batch 8600, loss[loss=0.241, simple_loss=0.3192, pruned_loss=0.08143, over 21453.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3139, pruned_loss=0.07774, over 4279896.38 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:26:59,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.68 vs. limit=15.0 2023-06-25 20:27:47,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2064306.0, ans=0.0 2023-06-25 20:28:06,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2064426.0, ans=0.1 2023-06-25 20:28:15,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2064426.0, ans=0.2 2023-06-25 20:28:17,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2064426.0, ans=0.125 2023-06-25 20:28:53,058 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.484e+02 8.709e+02 1.194e+03 1.940e+03 4.638e+03, threshold=2.389e+03, percent-clipped=12.0 2023-06-25 20:28:53,081 INFO [train.py:996] (3/4) Epoch 12, batch 8650, loss[loss=0.2275, simple_loss=0.3279, pruned_loss=0.06353, over 21692.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3177, pruned_loss=0.07794, over 4281818.12 frames. ], batch size: 389, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:29:42,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2064666.0, ans=0.0 2023-06-25 20:30:08,556 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-25 20:30:30,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2064786.0, ans=10.0 2023-06-25 20:30:40,423 INFO [train.py:996] (3/4) Epoch 12, batch 8700, loss[loss=0.1957, simple_loss=0.2625, pruned_loss=0.06443, over 21634.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3087, pruned_loss=0.07461, over 4286400.84 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:30:53,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2064846.0, ans=0.125 2023-06-25 20:30:58,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2064906.0, ans=0.125 2023-06-25 20:31:05,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2064906.0, ans=0.0 2023-06-25 20:31:38,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2064966.0, ans=0.0 2023-06-25 20:32:02,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2065086.0, ans=0.125 2023-06-25 20:32:25,078 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.287e+02 9.432e+02 1.371e+03 2.158e+03 4.053e+03, threshold=2.743e+03, percent-clipped=19.0 2023-06-25 20:32:25,099 INFO [train.py:996] (3/4) Epoch 12, batch 8750, loss[loss=0.2323, simple_loss=0.3047, pruned_loss=0.07993, over 21929.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3034, pruned_loss=0.07558, over 4272293.67 frames. ], batch size: 333, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:32:41,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2065146.0, ans=0.0 2023-06-25 20:34:04,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2065386.0, ans=0.2 2023-06-25 20:34:21,893 INFO [train.py:996] (3/4) Epoch 12, batch 8800, loss[loss=0.3121, simple_loss=0.3842, pruned_loss=0.12, over 21783.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3122, pruned_loss=0.07844, over 4276438.08 frames. ], batch size: 441, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:35:48,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2065686.0, ans=0.0 2023-06-25 20:36:06,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2065686.0, ans=10.0 2023-06-25 20:36:12,845 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.737e+02 1.070e+03 1.502e+03 2.029e+03 4.867e+03, threshold=3.004e+03, percent-clipped=9.0 2023-06-25 20:36:12,866 INFO [train.py:996] (3/4) Epoch 12, batch 8850, loss[loss=0.2075, simple_loss=0.2944, pruned_loss=0.06027, over 21743.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3184, pruned_loss=0.07997, over 4278815.07 frames. ], batch size: 351, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:36:41,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2065806.0, ans=0.0 2023-06-25 20:36:50,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2065806.0, ans=0.125 2023-06-25 20:37:06,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2065866.0, ans=0.0 2023-06-25 20:37:09,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2065866.0, ans=0.125 2023-06-25 20:37:21,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2065926.0, ans=0.125 2023-06-25 20:37:25,804 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-25 20:37:30,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2065926.0, ans=0.125 2023-06-25 20:37:47,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2065986.0, ans=0.09899494936611666 2023-06-25 20:37:56,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2065986.0, ans=0.1 2023-06-25 20:38:14,919 INFO [train.py:996] (3/4) Epoch 12, batch 8900, loss[loss=0.1868, simple_loss=0.2603, pruned_loss=0.05665, over 21390.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3134, pruned_loss=0.07915, over 4265637.93 frames. ], batch size: 211, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:38:20,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2066046.0, ans=0.1 2023-06-25 20:38:30,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=2066046.0, ans=15.0 2023-06-25 20:39:04,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2066166.0, ans=0.0 2023-06-25 20:39:22,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2066226.0, ans=0.0 2023-06-25 20:39:59,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2066286.0, ans=0.125 2023-06-25 20:40:06,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2066286.0, ans=0.125 2023-06-25 20:40:09,286 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.332e+02 9.490e+02 1.490e+03 1.972e+03 6.536e+03, threshold=2.979e+03, percent-clipped=12.0 2023-06-25 20:40:09,315 INFO [train.py:996] (3/4) Epoch 12, batch 8950, loss[loss=0.2727, simple_loss=0.3603, pruned_loss=0.09252, over 21648.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3133, pruned_loss=0.07763, over 4251654.65 frames. ], batch size: 414, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:41:13,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2066526.0, ans=0.0 2023-06-25 20:41:20,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2066526.0, ans=0.125 2023-06-25 20:41:57,034 INFO [train.py:996] (3/4) Epoch 12, batch 9000, loss[loss=0.2207, simple_loss=0.2759, pruned_loss=0.08276, over 21352.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3056, pruned_loss=0.07682, over 4259817.84 frames. ], batch size: 144, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:41:57,035 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 20:42:15,054 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.2658, simple_loss=0.3589, pruned_loss=0.08634, over 1796401.00 frames. 2023-06-25 20:42:15,055 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-25 20:42:52,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2066706.0, ans=0.125 2023-06-25 20:43:11,640 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.21 vs. limit=10.0 2023-06-25 20:43:49,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2066886.0, ans=0.125 2023-06-25 20:44:02,538 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.409e+02 9.189e+02 1.168e+03 1.657e+03 3.184e+03, threshold=2.336e+03, percent-clipped=2.0 2023-06-25 20:44:02,560 INFO [train.py:996] (3/4) Epoch 12, batch 9050, loss[loss=0.2511, simple_loss=0.3264, pruned_loss=0.08791, over 21786.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3003, pruned_loss=0.07305, over 4256307.60 frames. ], batch size: 332, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:44:16,106 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-25 20:45:59,845 INFO [train.py:996] (3/4) Epoch 12, batch 9100, loss[loss=0.2175, simple_loss=0.3175, pruned_loss=0.0587, over 21757.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3065, pruned_loss=0.07601, over 4259827.07 frames. ], batch size: 282, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:46:14,902 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-25 20:47:05,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2067366.0, ans=0.2 2023-06-25 20:47:20,148 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-06-25 20:47:43,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2067486.0, ans=0.05 2023-06-25 20:47:56,004 INFO [train.py:996] (3/4) Epoch 12, batch 9150, loss[loss=0.2426, simple_loss=0.3273, pruned_loss=0.07895, over 21663.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3109, pruned_loss=0.07398, over 4249897.42 frames. ], batch size: 263, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:47:57,728 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.614e+02 8.747e+02 1.495e+03 2.212e+03 5.275e+03, threshold=2.990e+03, percent-clipped=21.0 2023-06-25 20:48:19,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2067606.0, ans=0.1 2023-06-25 20:49:44,665 INFO [train.py:996] (3/4) Epoch 12, batch 9200, loss[loss=0.2276, simple_loss=0.3174, pruned_loss=0.06893, over 19858.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.312, pruned_loss=0.0723, over 4258758.41 frames. ], batch size: 703, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:49:46,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2067846.0, ans=0.1 2023-06-25 20:50:28,553 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:50:41,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2067966.0, ans=0.125 2023-06-25 20:51:29,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2068086.0, ans=0.1 2023-06-25 20:51:31,750 INFO [train.py:996] (3/4) Epoch 12, batch 9250, loss[loss=0.288, simple_loss=0.353, pruned_loss=0.1116, over 21401.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3148, pruned_loss=0.07575, over 4254807.80 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:51:33,445 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.008e+02 1.206e+03 1.769e+03 2.365e+03 5.380e+03, threshold=3.537e+03, percent-clipped=9.0 2023-06-25 20:53:02,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2068386.0, ans=0.125 2023-06-25 20:53:22,059 INFO [train.py:996] (3/4) Epoch 12, batch 9300, loss[loss=0.2396, simple_loss=0.3345, pruned_loss=0.07232, over 21652.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.312, pruned_loss=0.07591, over 4253368.53 frames. ], batch size: 263, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:53:41,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2068446.0, ans=0.0 2023-06-25 20:54:49,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2068626.0, ans=0.125 2023-06-25 20:55:18,008 INFO [train.py:996] (3/4) Epoch 12, batch 9350, loss[loss=0.281, simple_loss=0.3565, pruned_loss=0.1027, over 21454.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.318, pruned_loss=0.07741, over 4254839.46 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:55:18,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2068746.0, ans=0.125 2023-06-25 20:55:18,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2068746.0, ans=0.125 2023-06-25 20:55:19,920 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.869e+02 1.391e+03 2.110e+03 3.228e+03 6.570e+03, threshold=4.220e+03, percent-clipped=18.0 2023-06-25 20:55:39,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2068806.0, ans=0.0 2023-06-25 20:56:18,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2068866.0, ans=0.125 2023-06-25 20:56:31,809 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=15.0 2023-06-25 20:57:09,363 INFO [train.py:996] (3/4) Epoch 12, batch 9400, loss[loss=0.2555, simple_loss=0.3794, pruned_loss=0.06583, over 19759.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3198, pruned_loss=0.07813, over 4253757.01 frames. ], batch size: 702, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:57:13,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2069046.0, ans=0.125 2023-06-25 20:58:40,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2069286.0, ans=0.0 2023-06-25 20:58:57,572 INFO [train.py:996] (3/4) Epoch 12, batch 9450, loss[loss=0.2226, simple_loss=0.287, pruned_loss=0.07909, over 21522.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.313, pruned_loss=0.07689, over 4249010.97 frames. ], batch size: 391, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:58:59,285 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.200e+02 9.646e+02 1.392e+03 2.055e+03 4.300e+03, threshold=2.785e+03, percent-clipped=2.0 2023-06-25 20:59:01,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2069346.0, ans=0.0 2023-06-25 20:59:14,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2069406.0, ans=0.125 2023-06-25 20:59:14,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2069406.0, ans=0.125 2023-06-25 21:00:45,736 INFO [train.py:996] (3/4) Epoch 12, batch 9500, loss[loss=0.1877, simple_loss=0.2625, pruned_loss=0.05646, over 21611.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3079, pruned_loss=0.07616, over 4245637.12 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:01:56,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=12.0 2023-06-25 21:02:29,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2069886.0, ans=0.125 2023-06-25 21:02:37,410 INFO [train.py:996] (3/4) Epoch 12, batch 9550, loss[loss=0.2662, simple_loss=0.3322, pruned_loss=0.1, over 21185.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3092, pruned_loss=0.07754, over 4253492.11 frames. ], batch size: 143, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:02:40,597 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.839e+02 1.229e+03 2.098e+03 2.991e+03 5.309e+03, threshold=4.197e+03, percent-clipped=32.0 2023-06-25 21:02:49,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2069946.0, ans=0.0 2023-06-25 21:03:15,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2070006.0, ans=0.015 2023-06-25 21:03:27,026 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.92 vs. limit=10.0 2023-06-25 21:03:43,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2070126.0, ans=0.0 2023-06-25 21:04:14,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2070186.0, ans=0.0 2023-06-25 21:04:20,242 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-25 21:04:29,462 INFO [train.py:996] (3/4) Epoch 12, batch 9600, loss[loss=0.2871, simple_loss=0.3609, pruned_loss=0.1067, over 21505.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.314, pruned_loss=0.08027, over 4262648.91 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:04:36,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2070246.0, ans=0.2 2023-06-25 21:05:14,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2070366.0, ans=0.0 2023-06-25 21:06:19,435 INFO [train.py:996] (3/4) Epoch 12, batch 9650, loss[loss=0.306, simple_loss=0.3636, pruned_loss=0.1242, over 21388.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3138, pruned_loss=0.08047, over 4271893.71 frames. ], batch size: 508, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:06:21,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2070546.0, ans=10.0 2023-06-25 21:06:23,080 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.038e+02 1.081e+03 1.726e+03 2.542e+03 4.912e+03, threshold=3.453e+03, percent-clipped=2.0 2023-06-25 21:07:04,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2070666.0, ans=0.125 2023-06-25 21:08:13,733 INFO [train.py:996] (3/4) Epoch 12, batch 9700, loss[loss=0.2347, simple_loss=0.3111, pruned_loss=0.07915, over 21656.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3174, pruned_loss=0.08139, over 4266615.15 frames. ], batch size: 230, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:08:38,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2070906.0, ans=0.0 2023-06-25 21:08:43,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2070906.0, ans=10.0 2023-06-25 21:08:57,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2070966.0, ans=0.0 2023-06-25 21:09:10,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2070966.0, ans=0.125 2023-06-25 21:09:30,665 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-25 21:09:36,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2071026.0, ans=0.1 2023-06-25 21:09:42,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=2071086.0, ans=15.0 2023-06-25 21:09:43,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2071086.0, ans=0.125 2023-06-25 21:10:01,539 INFO [train.py:996] (3/4) Epoch 12, batch 9750, loss[loss=0.2249, simple_loss=0.2898, pruned_loss=0.08, over 21405.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3125, pruned_loss=0.08015, over 4270535.38 frames. ], batch size: 389, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:10:04,503 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.823e+02 1.223e+03 1.776e+03 2.509e+03 4.467e+03, threshold=3.552e+03, percent-clipped=5.0 2023-06-25 21:10:36,678 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-25 21:11:42,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2071446.0, ans=0.125 2023-06-25 21:11:43,854 INFO [train.py:996] (3/4) Epoch 12, batch 9800, loss[loss=0.2313, simple_loss=0.304, pruned_loss=0.07929, over 21746.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3102, pruned_loss=0.07995, over 4261477.06 frames. ], batch size: 389, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:12:07,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2071446.0, ans=0.125 2023-06-25 21:12:11,394 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=15.0 2023-06-25 21:12:19,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2071506.0, ans=0.0 2023-06-25 21:13:11,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2071626.0, ans=0.025 2023-06-25 21:13:12,516 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.30 vs. limit=22.5 2023-06-25 21:13:15,581 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-25 21:13:25,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2071686.0, ans=0.125 2023-06-25 21:13:35,334 INFO [train.py:996] (3/4) Epoch 12, batch 9850, loss[loss=0.2203, simple_loss=0.3124, pruned_loss=0.06411, over 21404.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3068, pruned_loss=0.07972, over 4262615.96 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:13:44,000 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.466e+02 9.389e+02 1.315e+03 1.721e+03 3.595e+03, threshold=2.631e+03, percent-clipped=2.0 2023-06-25 21:13:45,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2071746.0, ans=0.1 2023-06-25 21:14:30,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2071866.0, ans=0.125 2023-06-25 21:14:37,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2071926.0, ans=0.125 2023-06-25 21:15:09,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2071986.0, ans=0.05 2023-06-25 21:15:31,644 INFO [train.py:996] (3/4) Epoch 12, batch 9900, loss[loss=0.2505, simple_loss=0.3208, pruned_loss=0.09009, over 21804.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3037, pruned_loss=0.07853, over 4259071.67 frames. ], batch size: 282, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:15:34,303 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-06-25 21:15:55,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2072106.0, ans=0.125 2023-06-25 21:16:00,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2072106.0, ans=0.125 2023-06-25 21:16:05,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2072106.0, ans=0.0 2023-06-25 21:16:10,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2072166.0, ans=0.035 2023-06-25 21:16:13,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2072166.0, ans=0.1 2023-06-25 21:16:14,246 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-25 21:16:50,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2072226.0, ans=0.125 2023-06-25 21:17:12,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2072286.0, ans=0.2 2023-06-25 21:17:16,864 INFO [train.py:996] (3/4) Epoch 12, batch 9950, loss[loss=0.2284, simple_loss=0.2845, pruned_loss=0.08617, over 21595.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3047, pruned_loss=0.0799, over 4249400.69 frames. ], batch size: 263, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:17:25,533 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.617e+02 8.355e+02 1.275e+03 2.214e+03 4.972e+03, threshold=2.550e+03, percent-clipped=15.0 2023-06-25 21:17:51,311 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.30 vs. limit=10.0 2023-06-25 21:17:56,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2072406.0, ans=0.2 2023-06-25 21:18:01,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2072406.0, ans=0.125 2023-06-25 21:18:47,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2072526.0, ans=0.125 2023-06-25 21:18:57,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2072586.0, ans=0.2 2023-06-25 21:19:16,242 INFO [train.py:996] (3/4) Epoch 12, batch 10000, loss[loss=0.2351, simple_loss=0.3011, pruned_loss=0.08455, over 21837.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.302, pruned_loss=0.07982, over 4253139.80 frames. ], batch size: 124, lr: 2.44e-03, grad_scale: 32.0 2023-06-25 21:19:43,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2072706.0, ans=0.125 2023-06-25 21:19:43,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2072706.0, ans=0.0 2023-06-25 21:19:59,006 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:20:29,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2072826.0, ans=0.125 2023-06-25 21:20:38,626 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.97 vs. limit=22.5 2023-06-25 21:20:40,415 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.01 vs. limit=10.0 2023-06-25 21:21:04,406 INFO [train.py:996] (3/4) Epoch 12, batch 10050, loss[loss=0.1925, simple_loss=0.2727, pruned_loss=0.05616, over 21543.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3027, pruned_loss=0.08009, over 4261068.67 frames. ], batch size: 389, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:21:16,481 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.942e+02 8.782e+02 1.169e+03 1.851e+03 4.390e+03, threshold=2.338e+03, percent-clipped=10.0 2023-06-25 21:21:20,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2072946.0, ans=0.04949747468305833 2023-06-25 21:21:30,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2073006.0, ans=0.04949747468305833 2023-06-25 21:21:31,310 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=12.0 2023-06-25 21:21:39,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2073006.0, ans=0.125 2023-06-25 21:22:39,662 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-25 21:22:42,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2073186.0, ans=0.0 2023-06-25 21:23:02,179 INFO [train.py:996] (3/4) Epoch 12, batch 10100, loss[loss=0.2531, simple_loss=0.3715, pruned_loss=0.06734, over 19860.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3026, pruned_loss=0.07851, over 4260185.62 frames. ], batch size: 703, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:23:27,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2073306.0, ans=0.125 2023-06-25 21:23:27,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2073306.0, ans=0.2 2023-06-25 21:23:54,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2073366.0, ans=0.025 2023-06-25 21:24:23,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2073426.0, ans=10.0 2023-06-25 21:24:42,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2073486.0, ans=0.125 2023-06-25 21:24:52,514 INFO [train.py:996] (3/4) Epoch 12, batch 10150, loss[loss=0.2366, simple_loss=0.3146, pruned_loss=0.07931, over 21819.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3063, pruned_loss=0.08013, over 4269533.11 frames. ], batch size: 118, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:24:56,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2073546.0, ans=0.0 2023-06-25 21:24:57,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2073546.0, ans=0.5 2023-06-25 21:24:58,881 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.161e+02 9.851e+02 1.656e+03 2.564e+03 6.129e+03, threshold=3.312e+03, percent-clipped=27.0 2023-06-25 21:25:03,462 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.75 vs. limit=5.0 2023-06-25 21:25:38,091 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-25 21:25:55,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2073726.0, ans=0.0 2023-06-25 21:26:41,518 INFO [train.py:996] (3/4) Epoch 12, batch 10200, loss[loss=0.212, simple_loss=0.2997, pruned_loss=0.06219, over 21724.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3074, pruned_loss=0.07954, over 4262781.79 frames. ], batch size: 352, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:26:55,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2073846.0, ans=0.0 2023-06-25 21:27:06,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2073906.0, ans=0.0 2023-06-25 21:28:31,283 INFO [train.py:996] (3/4) Epoch 12, batch 10250, loss[loss=0.1531, simple_loss=0.2382, pruned_loss=0.03395, over 21402.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3011, pruned_loss=0.07332, over 4258194.89 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:28:38,271 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.344e+02 7.759e+02 1.163e+03 1.703e+03 3.224e+03, threshold=2.326e+03, percent-clipped=0.0 2023-06-25 21:28:47,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2074206.0, ans=0.125 2023-06-25 21:29:14,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=2074206.0, ans=22.5 2023-06-25 21:30:21,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2074386.0, ans=10.0 2023-06-25 21:30:24,246 INFO [train.py:996] (3/4) Epoch 12, batch 10300, loss[loss=0.2286, simple_loss=0.2815, pruned_loss=0.0879, over 20331.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3027, pruned_loss=0.07437, over 4258195.50 frames. ], batch size: 703, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:30:24,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2074446.0, ans=0.125 2023-06-25 21:30:28,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2074446.0, ans=0.025 2023-06-25 21:30:58,688 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=22.5 2023-06-25 21:31:08,331 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=22.5 2023-06-25 21:31:13,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2074566.0, ans=0.04949747468305833 2023-06-25 21:31:34,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2074566.0, ans=0.015 2023-06-25 21:32:13,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2074686.0, ans=0.1 2023-06-25 21:32:24,442 INFO [train.py:996] (3/4) Epoch 12, batch 10350, loss[loss=0.2054, simple_loss=0.2812, pruned_loss=0.06477, over 21734.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3048, pruned_loss=0.07394, over 4258291.40 frames. ], batch size: 282, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:32:31,424 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.446e+02 9.833e+02 1.578e+03 2.772e+03 4.867e+03, threshold=3.157e+03, percent-clipped=30.0 2023-06-25 21:32:35,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2074746.0, ans=0.125 2023-06-25 21:32:51,985 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.69 vs. limit=12.0 2023-06-25 21:33:27,235 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:34:19,095 INFO [train.py:996] (3/4) Epoch 12, batch 10400, loss[loss=0.2329, simple_loss=0.3095, pruned_loss=0.07812, over 21921.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3003, pruned_loss=0.07312, over 4267830.54 frames. ], batch size: 373, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:34:24,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2075046.0, ans=0.1 2023-06-25 21:34:34,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2075046.0, ans=0.0 2023-06-25 21:35:13,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2075166.0, ans=0.95 2023-06-25 21:35:15,892 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.31 vs. limit=15.0 2023-06-25 21:35:16,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2075166.0, ans=0.0 2023-06-25 21:35:36,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2075226.0, ans=0.2 2023-06-25 21:36:14,870 INFO [train.py:996] (3/4) Epoch 12, batch 10450, loss[loss=0.2545, simple_loss=0.3458, pruned_loss=0.08157, over 21644.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3056, pruned_loss=0.07708, over 4274131.49 frames. ], batch size: 414, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:36:22,625 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.406e+02 1.055e+03 1.797e+03 3.015e+03 7.446e+03, threshold=3.594e+03, percent-clipped=22.0 2023-06-25 21:36:40,707 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2023-06-25 21:37:04,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2075466.0, ans=0.125 2023-06-25 21:37:34,538 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-25 21:37:44,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.69 vs. limit=22.5 2023-06-25 21:37:49,610 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.17 vs. limit=15.0 2023-06-25 21:38:00,965 INFO [train.py:996] (3/4) Epoch 12, batch 10500, loss[loss=0.271, simple_loss=0.3195, pruned_loss=0.1113, over 21407.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3056, pruned_loss=0.07611, over 4266095.23 frames. ], batch size: 508, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:38:17,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2075646.0, ans=0.125 2023-06-25 21:38:24,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-06-25 21:39:53,045 INFO [train.py:996] (3/4) Epoch 12, batch 10550, loss[loss=0.1937, simple_loss=0.2523, pruned_loss=0.06756, over 21480.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3007, pruned_loss=0.07508, over 4264547.43 frames. ], batch size: 195, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:40:05,041 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.958e+02 1.039e+03 1.480e+03 2.237e+03 5.696e+03, threshold=2.960e+03, percent-clipped=5.0 2023-06-25 21:40:33,239 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=22.5 2023-06-25 21:40:41,697 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=12.0 2023-06-25 21:41:39,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2076186.0, ans=0.2 2023-06-25 21:41:44,554 INFO [train.py:996] (3/4) Epoch 12, batch 10600, loss[loss=0.2006, simple_loss=0.2961, pruned_loss=0.05253, over 21686.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2948, pruned_loss=0.07286, over 4268077.97 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:41:50,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2076246.0, ans=0.1 2023-06-25 21:41:55,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2076246.0, ans=0.1 2023-06-25 21:41:59,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2076246.0, ans=0.125 2023-06-25 21:42:25,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2076306.0, ans=0.1 2023-06-25 21:42:34,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2076366.0, ans=0.2 2023-06-25 21:42:41,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=2076366.0, ans=0.1 2023-06-25 21:43:41,755 INFO [train.py:996] (3/4) Epoch 12, batch 10650, loss[loss=0.2367, simple_loss=0.3208, pruned_loss=0.07635, over 21586.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2965, pruned_loss=0.07183, over 4265751.00 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:43:48,913 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.385e+02 8.599e+02 1.360e+03 2.205e+03 4.639e+03, threshold=2.719e+03, percent-clipped=13.0 2023-06-25 21:44:33,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=2076666.0, ans=22.5 2023-06-25 21:45:03,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2076726.0, ans=0.0 2023-06-25 21:45:06,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2076726.0, ans=0.125 2023-06-25 21:45:32,583 INFO [train.py:996] (3/4) Epoch 12, batch 10700, loss[loss=0.2596, simple_loss=0.3277, pruned_loss=0.09571, over 21498.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2951, pruned_loss=0.07206, over 4265021.87 frames. ], batch size: 194, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:46:41,431 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:47:04,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2077086.0, ans=0.0 2023-06-25 21:47:15,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2077086.0, ans=0.0 2023-06-25 21:47:29,848 INFO [train.py:996] (3/4) Epoch 12, batch 10750, loss[loss=0.1887, simple_loss=0.2635, pruned_loss=0.05695, over 20770.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3048, pruned_loss=0.0759, over 4265204.94 frames. ], batch size: 608, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:47:36,393 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.154e+02 8.331e+02 1.141e+03 1.485e+03 5.663e+03, threshold=2.282e+03, percent-clipped=4.0 2023-06-25 21:47:54,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2077206.0, ans=0.125 2023-06-25 21:49:20,787 INFO [train.py:996] (3/4) Epoch 12, batch 10800, loss[loss=0.2398, simple_loss=0.3577, pruned_loss=0.0609, over 20721.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3088, pruned_loss=0.07645, over 4260066.51 frames. ], batch size: 607, lr: 2.43e-03, grad_scale: 32.0 2023-06-25 21:49:55,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2077506.0, ans=0.125 2023-06-25 21:50:06,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2077566.0, ans=0.0 2023-06-25 21:50:20,289 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=22.5 2023-06-25 21:50:37,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2077626.0, ans=0.125 2023-06-25 21:50:37,570 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=22.5 2023-06-25 21:50:53,739 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=12.0 2023-06-25 21:50:59,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=12.0 2023-06-25 21:51:01,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2077686.0, ans=0.1 2023-06-25 21:51:15,766 INFO [train.py:996] (3/4) Epoch 12, batch 10850, loss[loss=0.2013, simple_loss=0.2699, pruned_loss=0.06633, over 15490.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3108, pruned_loss=0.07733, over 4249782.26 frames. ], batch size: 61, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:51:31,515 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.495e+02 1.105e+03 1.666e+03 2.535e+03 5.598e+03, threshold=3.333e+03, percent-clipped=30.0 2023-06-25 21:53:15,996 INFO [train.py:996] (3/4) Epoch 12, batch 10900, loss[loss=0.2004, simple_loss=0.2672, pruned_loss=0.06676, over 21370.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3039, pruned_loss=0.07593, over 4250721.56 frames. ], batch size: 144, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 21:53:57,967 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=22.5 2023-06-25 21:54:22,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2078226.0, ans=0.0 2023-06-25 21:54:50,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2078286.0, ans=0.0 2023-06-25 21:54:50,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2078286.0, ans=0.2 2023-06-25 21:54:59,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2078346.0, ans=0.04949747468305833 2023-06-25 21:55:00,766 INFO [train.py:996] (3/4) Epoch 12, batch 10950, loss[loss=0.2414, simple_loss=0.3453, pruned_loss=0.06876, over 20847.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3004, pruned_loss=0.07426, over 4254488.70 frames. ], batch size: 608, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 21:55:02,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2078346.0, ans=0.1 2023-06-25 21:55:18,492 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.435e+02 8.397e+02 1.288e+03 2.017e+03 5.203e+03, threshold=2.576e+03, percent-clipped=6.0 2023-06-25 21:55:22,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2078406.0, ans=0.0 2023-06-25 21:55:24,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2078406.0, ans=0.2 2023-06-25 21:55:26,730 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=22.5 2023-06-25 21:55:29,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2078406.0, ans=0.025 2023-06-25 21:55:48,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-25 21:56:01,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2078466.0, ans=0.015 2023-06-25 21:56:43,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2078586.0, ans=0.0 2023-06-25 21:56:51,896 INFO [train.py:996] (3/4) Epoch 12, batch 11000, loss[loss=0.2578, simple_loss=0.3132, pruned_loss=0.1013, over 21561.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3005, pruned_loss=0.07487, over 4254004.76 frames. ], batch size: 212, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 21:56:53,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2078646.0, ans=0.0 2023-06-25 21:57:04,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2078646.0, ans=0.125 2023-06-25 21:57:22,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2078706.0, ans=0.125 2023-06-25 21:58:05,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2078826.0, ans=0.125 2023-06-25 21:58:09,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2078826.0, ans=0.5 2023-06-25 21:58:28,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2078886.0, ans=0.125 2023-06-25 21:58:39,768 INFO [train.py:996] (3/4) Epoch 12, batch 11050, loss[loss=0.2074, simple_loss=0.2635, pruned_loss=0.07564, over 21388.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2993, pruned_loss=0.07581, over 4257069.06 frames. ], batch size: 177, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 21:58:57,591 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.227e+02 8.641e+02 1.192e+03 1.708e+03 4.413e+03, threshold=2.383e+03, percent-clipped=10.0 2023-06-25 21:59:33,240 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.45 vs. limit=12.0 2023-06-25 22:00:15,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2079186.0, ans=0.0 2023-06-25 22:00:20,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2079186.0, ans=0.1 2023-06-25 22:00:22,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2079186.0, ans=0.125 2023-06-25 22:00:30,664 INFO [train.py:996] (3/4) Epoch 12, batch 11100, loss[loss=0.1971, simple_loss=0.2677, pruned_loss=0.0633, over 21781.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.296, pruned_loss=0.07539, over 4257164.02 frames. ], batch size: 371, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:02:21,195 INFO [train.py:996] (3/4) Epoch 12, batch 11150, loss[loss=0.2025, simple_loss=0.2741, pruned_loss=0.06544, over 21842.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2931, pruned_loss=0.07512, over 4245467.55 frames. ], batch size: 318, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:02:22,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=2079546.0, ans=15.0 2023-06-25 22:02:38,382 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.021e+02 8.746e+02 1.220e+03 1.810e+03 4.073e+03, threshold=2.441e+03, percent-clipped=6.0 2023-06-25 22:02:50,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2079606.0, ans=0.0 2023-06-25 22:02:52,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2079606.0, ans=0.125 2023-06-25 22:03:50,398 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:04:09,049 INFO [train.py:996] (3/4) Epoch 12, batch 11200, loss[loss=0.1988, simple_loss=0.2702, pruned_loss=0.06371, over 21532.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2913, pruned_loss=0.07419, over 4251738.95 frames. ], batch size: 263, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:04:39,681 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:04:44,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2079906.0, ans=0.125 2023-06-25 22:05:02,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2079966.0, ans=0.125 2023-06-25 22:05:03,205 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-25 22:05:04,801 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:05:22,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2080026.0, ans=0.0 2023-06-25 22:05:22,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2080026.0, ans=0.1 2023-06-25 22:05:35,497 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-25 22:05:40,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2080086.0, ans=0.09899494936611666 2023-06-25 22:05:53,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2080086.0, ans=0.0 2023-06-25 22:05:56,513 INFO [train.py:996] (3/4) Epoch 12, batch 11250, loss[loss=0.2278, simple_loss=0.3066, pruned_loss=0.07443, over 21871.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2903, pruned_loss=0.075, over 4255854.53 frames. ], batch size: 316, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:06:07,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2080146.0, ans=0.125 2023-06-25 22:06:14,384 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.198e+02 9.018e+02 1.214e+03 1.830e+03 3.568e+03, threshold=2.429e+03, percent-clipped=9.0 2023-06-25 22:06:16,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2080146.0, ans=0.0 2023-06-25 22:06:18,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2080206.0, ans=0.1 2023-06-25 22:07:05,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2080326.0, ans=0.125 2023-06-25 22:07:26,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2080386.0, ans=0.125 2023-06-25 22:07:46,474 INFO [train.py:996] (3/4) Epoch 12, batch 11300, loss[loss=0.2074, simple_loss=0.2783, pruned_loss=0.0682, over 21194.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2924, pruned_loss=0.07489, over 4255221.61 frames. ], batch size: 159, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:07:47,598 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.74 vs. limit=15.0 2023-06-25 22:08:42,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2080566.0, ans=0.125 2023-06-25 22:09:00,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2080626.0, ans=0.0 2023-06-25 22:09:30,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2080686.0, ans=0.125 2023-06-25 22:09:31,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2080686.0, ans=0.2 2023-06-25 22:09:41,335 INFO [train.py:996] (3/4) Epoch 12, batch 11350, loss[loss=0.2605, simple_loss=0.3343, pruned_loss=0.0933, over 21774.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2936, pruned_loss=0.07389, over 4265761.02 frames. ], batch size: 118, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:09:49,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2080746.0, ans=0.125 2023-06-25 22:09:49,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2080746.0, ans=0.125 2023-06-25 22:09:54,125 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.747e+02 9.372e+02 1.256e+03 1.751e+03 3.011e+03, threshold=2.512e+03, percent-clipped=9.0 2023-06-25 22:10:10,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2080806.0, ans=0.0 2023-06-25 22:11:04,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2080926.0, ans=0.125 2023-06-25 22:11:36,885 INFO [train.py:996] (3/4) Epoch 12, batch 11400, loss[loss=0.2064, simple_loss=0.3058, pruned_loss=0.0535, over 21721.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2988, pruned_loss=0.07557, over 4270894.46 frames. ], batch size: 332, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:11:42,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2081046.0, ans=0.125 2023-06-25 22:11:45,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2081046.0, ans=0.0 2023-06-25 22:11:49,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-25 22:11:49,889 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.87 vs. limit=15.0 2023-06-25 22:12:32,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2081166.0, ans=0.125 2023-06-25 22:12:47,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2081226.0, ans=0.1 2023-06-25 22:12:53,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2081226.0, ans=0.125 2023-06-25 22:13:26,697 INFO [train.py:996] (3/4) Epoch 12, batch 11450, loss[loss=0.2109, simple_loss=0.2884, pruned_loss=0.0667, over 21468.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3023, pruned_loss=0.07542, over 4277461.58 frames. ], batch size: 211, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:13:46,239 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.691e+02 8.654e+02 1.291e+03 1.959e+03 4.523e+03, threshold=2.583e+03, percent-clipped=10.0 2023-06-25 22:13:48,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2081406.0, ans=0.125 2023-06-25 22:13:50,957 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=15.0 2023-06-25 22:14:03,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2081406.0, ans=0.04949747468305833 2023-06-25 22:14:50,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2081526.0, ans=0.125 2023-06-25 22:15:09,975 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.27 vs. limit=22.5 2023-06-25 22:15:15,764 INFO [train.py:996] (3/4) Epoch 12, batch 11500, loss[loss=0.2156, simple_loss=0.3107, pruned_loss=0.06025, over 21865.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3059, pruned_loss=0.07711, over 4273821.26 frames. ], batch size: 316, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:15:37,514 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-06-25 22:15:42,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2081706.0, ans=0.2 2023-06-25 22:15:43,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2081706.0, ans=0.125 2023-06-25 22:16:05,561 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=12.0 2023-06-25 22:17:10,025 INFO [train.py:996] (3/4) Epoch 12, batch 11550, loss[loss=0.2969, simple_loss=0.3965, pruned_loss=0.09862, over 21658.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3124, pruned_loss=0.07776, over 4277618.83 frames. ], batch size: 414, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:17:30,002 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.895e+02 1.016e+03 1.404e+03 2.173e+03 5.477e+03, threshold=2.808e+03, percent-clipped=17.0 2023-06-25 22:17:39,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2082006.0, ans=0.1 2023-06-25 22:17:55,869 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-25 22:18:34,317 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=22.5 2023-06-25 22:19:08,970 INFO [train.py:996] (3/4) Epoch 12, batch 11600, loss[loss=0.248, simple_loss=0.3352, pruned_loss=0.08038, over 21307.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3253, pruned_loss=0.0795, over 4271351.50 frames. ], batch size: 159, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:19:39,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2082306.0, ans=0.0 2023-06-25 22:19:55,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2082366.0, ans=0.0 2023-06-25 22:19:57,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2082366.0, ans=0.2 2023-06-25 22:20:32,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2082426.0, ans=0.0 2023-06-25 22:20:45,354 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=12.0 2023-06-25 22:20:58,712 INFO [train.py:996] (3/4) Epoch 12, batch 11650, loss[loss=0.2398, simple_loss=0.3196, pruned_loss=0.07999, over 21646.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3334, pruned_loss=0.08104, over 4277894.92 frames. ], batch size: 263, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:21:17,672 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.996e+02 1.003e+03 1.550e+03 2.140e+03 4.406e+03, threshold=3.101e+03, percent-clipped=13.0 2023-06-25 22:21:29,464 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-25 22:21:32,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2082606.0, ans=0.0 2023-06-25 22:21:59,627 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.38 vs. limit=15.0 2023-06-25 22:22:16,795 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-25 22:22:47,753 INFO [train.py:996] (3/4) Epoch 12, batch 11700, loss[loss=0.2319, simple_loss=0.2982, pruned_loss=0.08284, over 21750.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3267, pruned_loss=0.07945, over 4276634.37 frames. ], batch size: 112, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:23:07,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2082846.0, ans=0.1 2023-06-25 22:23:14,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2082906.0, ans=0.2 2023-06-25 22:23:36,429 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=22.5 2023-06-25 22:23:47,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2082966.0, ans=0.125 2023-06-25 22:24:38,039 INFO [train.py:996] (3/4) Epoch 12, batch 11750, loss[loss=0.2275, simple_loss=0.3016, pruned_loss=0.07666, over 21494.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3161, pruned_loss=0.0788, over 4267465.92 frames. ], batch size: 389, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:24:57,449 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.375e+02 1.055e+03 1.810e+03 2.598e+03 5.314e+03, threshold=3.620e+03, percent-clipped=16.0 2023-06-25 22:25:06,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2083206.0, ans=0.125 2023-06-25 22:25:19,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2083206.0, ans=0.125 2023-06-25 22:26:16,309 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-25 22:26:25,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2083446.0, ans=0.125 2023-06-25 22:26:32,960 INFO [train.py:996] (3/4) Epoch 12, batch 11800, loss[loss=0.2049, simple_loss=0.3009, pruned_loss=0.05441, over 21622.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3169, pruned_loss=0.08003, over 4269518.71 frames. ], batch size: 230, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:27:15,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2083566.0, ans=0.2 2023-06-25 22:27:28,280 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-25 22:27:51,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2083686.0, ans=0.0 2023-06-25 22:27:53,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2083686.0, ans=0.125 2023-06-25 22:27:54,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2083686.0, ans=0.125 2023-06-25 22:28:11,839 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-25 22:28:23,935 INFO [train.py:996] (3/4) Epoch 12, batch 11850, loss[loss=0.215, simple_loss=0.2923, pruned_loss=0.06883, over 21840.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3165, pruned_loss=0.07973, over 4279547.29 frames. ], batch size: 118, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:28:37,192 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.904e+02 9.109e+02 1.402e+03 2.284e+03 4.807e+03, threshold=2.803e+03, percent-clipped=4.0 2023-06-25 22:29:17,360 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=15.0 2023-06-25 22:29:28,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2083926.0, ans=0.125 2023-06-25 22:29:52,976 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-25 22:30:13,534 INFO [train.py:996] (3/4) Epoch 12, batch 11900, loss[loss=0.2196, simple_loss=0.3073, pruned_loss=0.066, over 21545.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3178, pruned_loss=0.07822, over 4275858.77 frames. ], batch size: 389, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:31:08,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2084166.0, ans=0.035 2023-06-25 22:31:46,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2084286.0, ans=0.2 2023-06-25 22:32:05,339 INFO [train.py:996] (3/4) Epoch 12, batch 11950, loss[loss=0.2315, simple_loss=0.3209, pruned_loss=0.07109, over 21806.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3186, pruned_loss=0.07579, over 4273785.40 frames. ], batch size: 371, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:32:24,556 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.025e+02 8.288e+02 1.263e+03 1.955e+03 4.768e+03, threshold=2.526e+03, percent-clipped=7.0 2023-06-25 22:32:32,905 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.42 vs. limit=12.0 2023-06-25 22:32:53,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2084466.0, ans=0.125 2023-06-25 22:33:35,726 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=22.5 2023-06-25 22:33:51,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2084586.0, ans=0.125 2023-06-25 22:33:54,619 INFO [train.py:996] (3/4) Epoch 12, batch 12000, loss[loss=0.2128, simple_loss=0.2722, pruned_loss=0.07667, over 21240.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3113, pruned_loss=0.07314, over 4271068.27 frames. ], batch size: 176, lr: 2.43e-03, grad_scale: 32.0 2023-06-25 22:33:54,619 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 22:34:17,924 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.2583, simple_loss=0.3504, pruned_loss=0.08306, over 1796401.00 frames. 2023-06-25 22:34:17,925 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-25 22:34:18,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2084646.0, ans=0.1 2023-06-25 22:35:03,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2084766.0, ans=0.2 2023-06-25 22:35:20,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2084826.0, ans=0.025 2023-06-25 22:35:30,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2084826.0, ans=0.0 2023-06-25 22:35:59,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2084886.0, ans=0.1 2023-06-25 22:36:03,678 INFO [train.py:996] (3/4) Epoch 12, batch 12050, loss[loss=0.2206, simple_loss=0.2847, pruned_loss=0.07826, over 21884.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3074, pruned_loss=0.07495, over 4283078.62 frames. ], batch size: 283, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:36:18,972 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.769e+02 8.043e+02 1.290e+03 1.841e+03 5.321e+03, threshold=2.580e+03, percent-clipped=14.0 2023-06-25 22:36:19,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2085006.0, ans=0.125 2023-06-25 22:36:21,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2085006.0, ans=0.125 2023-06-25 22:36:40,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2085006.0, ans=0.125 2023-06-25 22:36:53,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2085066.0, ans=0.1 2023-06-25 22:36:53,692 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.42 vs. limit=6.0 2023-06-25 22:37:08,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2085126.0, ans=0.0 2023-06-25 22:37:47,007 INFO [train.py:996] (3/4) Epoch 12, batch 12100, loss[loss=0.2555, simple_loss=0.3291, pruned_loss=0.09098, over 21757.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3108, pruned_loss=0.07819, over 4286246.02 frames. ], batch size: 332, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:39:33,546 INFO [train.py:996] (3/4) Epoch 12, batch 12150, loss[loss=0.2341, simple_loss=0.3372, pruned_loss=0.0655, over 19706.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.314, pruned_loss=0.07831, over 4282299.59 frames. ], batch size: 704, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:40:00,396 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.100e+02 9.375e+02 1.398e+03 2.052e+03 3.851e+03, threshold=2.797e+03, percent-clipped=12.0 2023-06-25 22:40:47,394 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-25 22:41:07,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2085786.0, ans=0.0 2023-06-25 22:41:26,876 INFO [train.py:996] (3/4) Epoch 12, batch 12200, loss[loss=0.2005, simple_loss=0.2599, pruned_loss=0.07053, over 21537.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3109, pruned_loss=0.07615, over 4280464.19 frames. ], batch size: 230, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:41:31,393 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=15.0 2023-06-25 22:41:53,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2085906.0, ans=0.0 2023-06-25 22:42:07,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2085966.0, ans=0.0 2023-06-25 22:42:24,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2085966.0, ans=0.0 2023-06-25 22:42:29,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2086026.0, ans=0.1 2023-06-25 22:42:38,663 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-25 22:42:43,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2086026.0, ans=0.2 2023-06-25 22:42:48,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2086086.0, ans=0.0 2023-06-25 22:42:57,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2086086.0, ans=0.1 2023-06-25 22:43:08,619 INFO [train.py:996] (3/4) Epoch 12, batch 12250, loss[loss=0.1411, simple_loss=0.2085, pruned_loss=0.03683, over 21768.00 frames. ], tot_loss[loss=0.226, simple_loss=0.304, pruned_loss=0.07405, over 4267224.19 frames. ], batch size: 107, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:43:13,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2086146.0, ans=0.035 2023-06-25 22:43:24,136 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.778e+02 9.164e+02 1.305e+03 1.876e+03 4.319e+03, threshold=2.609e+03, percent-clipped=7.0 2023-06-25 22:43:33,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2086206.0, ans=0.2 2023-06-25 22:43:47,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-06-25 22:44:17,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2086326.0, ans=0.04949747468305833 2023-06-25 22:44:24,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2086326.0, ans=0.2 2023-06-25 22:44:36,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2086386.0, ans=0.1 2023-06-25 22:44:56,852 INFO [train.py:996] (3/4) Epoch 12, batch 12300, loss[loss=0.2661, simple_loss=0.3581, pruned_loss=0.08707, over 21711.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2963, pruned_loss=0.06838, over 4268507.07 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:45:00,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2086446.0, ans=0.125 2023-06-25 22:45:04,881 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=8.0 2023-06-25 22:45:37,533 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=22.5 2023-06-25 22:45:52,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2086566.0, ans=0.125 2023-06-25 22:46:38,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2086746.0, ans=0.125 2023-06-25 22:46:39,944 INFO [train.py:996] (3/4) Epoch 12, batch 12350, loss[loss=0.2321, simple_loss=0.3112, pruned_loss=0.07655, over 21482.00 frames. ], tot_loss[loss=0.22, simple_loss=0.3016, pruned_loss=0.06921, over 4265536.33 frames. ], batch size: 131, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:46:50,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2086746.0, ans=0.0 2023-06-25 22:46:57,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=2086746.0, ans=0.05 2023-06-25 22:47:02,267 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.943e+02 9.460e+02 1.568e+03 2.170e+03 4.986e+03, threshold=3.136e+03, percent-clipped=17.0 2023-06-25 22:47:31,400 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.70 vs. limit=10.0 2023-06-25 22:48:24,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2086986.0, ans=0.125 2023-06-25 22:48:33,538 INFO [train.py:996] (3/4) Epoch 12, batch 12400, loss[loss=0.2161, simple_loss=0.2828, pruned_loss=0.0747, over 21606.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3037, pruned_loss=0.07186, over 4269370.42 frames. ], batch size: 230, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:48:43,110 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.63 vs. limit=15.0 2023-06-25 22:48:59,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2087106.0, ans=0.125 2023-06-25 22:49:26,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2087166.0, ans=0.125 2023-06-25 22:49:53,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2087286.0, ans=0.2 2023-06-25 22:50:21,706 INFO [train.py:996] (3/4) Epoch 12, batch 12450, loss[loss=0.2686, simple_loss=0.3376, pruned_loss=0.09982, over 21319.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3067, pruned_loss=0.07435, over 4275836.22 frames. ], batch size: 159, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:50:35,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2087346.0, ans=0.125 2023-06-25 22:50:37,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2087346.0, ans=0.025 2023-06-25 22:50:45,391 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.188e+02 9.401e+02 1.561e+03 2.264e+03 4.297e+03, threshold=3.121e+03, percent-clipped=10.0 2023-06-25 22:50:49,879 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.94 vs. limit=15.0 2023-06-25 22:50:50,766 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:51:18,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2087466.0, ans=0.0 2023-06-25 22:51:20,922 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.84 vs. limit=6.0 2023-06-25 22:51:55,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2087586.0, ans=0.125 2023-06-25 22:52:16,019 INFO [train.py:996] (3/4) Epoch 12, batch 12500, loss[loss=0.3328, simple_loss=0.4246, pruned_loss=0.1205, over 21405.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3174, pruned_loss=0.07778, over 4274609.09 frames. ], batch size: 507, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:52:38,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2087706.0, ans=0.0 2023-06-25 22:53:08,866 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=15.0 2023-06-25 22:53:35,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2087826.0, ans=0.0 2023-06-25 22:53:50,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2087886.0, ans=0.0 2023-06-25 22:53:56,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2087886.0, ans=0.0 2023-06-25 22:54:02,673 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.09 vs. limit=15.0 2023-06-25 22:54:03,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2087886.0, ans=0.125 2023-06-25 22:54:06,670 INFO [train.py:996] (3/4) Epoch 12, batch 12550, loss[loss=0.2814, simple_loss=0.3476, pruned_loss=0.1076, over 21393.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3237, pruned_loss=0.08093, over 4275183.39 frames. ], batch size: 549, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:54:32,461 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.421e+02 9.265e+02 1.193e+03 1.894e+03 3.118e+03, threshold=2.386e+03, percent-clipped=0.0 2023-06-25 22:54:32,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2088006.0, ans=0.125 2023-06-25 22:54:33,462 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-06-25 22:54:55,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2088066.0, ans=0.125 2023-06-25 22:56:05,541 INFO [train.py:996] (3/4) Epoch 12, batch 12600, loss[loss=0.1514, simple_loss=0.2249, pruned_loss=0.03894, over 16811.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3225, pruned_loss=0.07834, over 4271424.63 frames. ], batch size: 63, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:56:52,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2088366.0, ans=0.1 2023-06-25 22:57:21,559 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.80 vs. limit=10.0 2023-06-25 22:57:32,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2088486.0, ans=0.2 2023-06-25 22:57:51,788 INFO [train.py:996] (3/4) Epoch 12, batch 12650, loss[loss=0.2181, simple_loss=0.2859, pruned_loss=0.07517, over 21786.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.314, pruned_loss=0.07446, over 4272721.94 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:58:08,909 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.676e+02 8.577e+02 1.170e+03 1.835e+03 4.573e+03, threshold=2.341e+03, percent-clipped=16.0 2023-06-25 22:58:51,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2088666.0, ans=0.0 2023-06-25 22:58:52,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-25 22:59:32,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2088786.0, ans=0.125 2023-06-25 22:59:37,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2088786.0, ans=0.125 2023-06-25 22:59:38,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2088786.0, ans=0.125 2023-06-25 22:59:39,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2088846.0, ans=0.0 2023-06-25 22:59:40,825 INFO [train.py:996] (3/4) Epoch 12, batch 12700, loss[loss=0.2481, simple_loss=0.3162, pruned_loss=0.09004, over 21770.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3128, pruned_loss=0.07685, over 4282060.62 frames. ], batch size: 332, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:00:01,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2088906.0, ans=0.2 2023-06-25 23:00:05,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2088906.0, ans=0.0 2023-06-25 23:00:20,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2088906.0, ans=0.125 2023-06-25 23:00:29,238 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-25 23:00:33,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2088966.0, ans=0.2 2023-06-25 23:01:22,903 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.60 vs. limit=10.0 2023-06-25 23:01:24,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2089086.0, ans=0.07 2023-06-25 23:01:30,749 INFO [train.py:996] (3/4) Epoch 12, batch 12750, loss[loss=0.2268, simple_loss=0.3074, pruned_loss=0.07313, over 21806.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3135, pruned_loss=0.07741, over 4276442.90 frames. ], batch size: 351, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:01:51,848 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.914e+02 9.146e+02 1.295e+03 2.167e+03 3.972e+03, threshold=2.590e+03, percent-clipped=20.0 2023-06-25 23:01:54,548 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=22.5 2023-06-25 23:01:54,616 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=22.5 2023-06-25 23:03:18,693 INFO [train.py:996] (3/4) Epoch 12, batch 12800, loss[loss=0.2214, simple_loss=0.2901, pruned_loss=0.07633, over 21601.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3117, pruned_loss=0.07841, over 4275038.35 frames. ], batch size: 548, lr: 2.43e-03, grad_scale: 32.0 2023-06-25 23:03:38,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2089446.0, ans=0.125 2023-06-25 23:03:55,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2089506.0, ans=0.1 2023-06-25 23:04:00,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=2089506.0, ans=15.0 2023-06-25 23:04:13,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2089566.0, ans=0.0 2023-06-25 23:04:22,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2089566.0, ans=0.1 2023-06-25 23:04:43,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2089626.0, ans=0.1 2023-06-25 23:05:08,331 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-25 23:05:15,802 INFO [train.py:996] (3/4) Epoch 12, batch 12850, loss[loss=0.2339, simple_loss=0.3361, pruned_loss=0.0659, over 19910.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.313, pruned_loss=0.07974, over 4273570.26 frames. ], batch size: 702, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:05:41,764 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.120e+02 9.253e+02 1.228e+03 1.682e+03 4.174e+03, threshold=2.456e+03, percent-clipped=9.0 2023-06-25 23:05:53,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2089806.0, ans=0.025 2023-06-25 23:06:25,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2089866.0, ans=0.0 2023-06-25 23:06:41,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2089926.0, ans=0.0 2023-06-25 23:06:49,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2089986.0, ans=0.0 2023-06-25 23:07:15,408 INFO [train.py:996] (3/4) Epoch 12, batch 12900, loss[loss=0.2214, simple_loss=0.3172, pruned_loss=0.0628, over 20820.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3109, pruned_loss=0.07719, over 4273142.82 frames. ], batch size: 608, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:07:36,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2090106.0, ans=0.0 2023-06-25 23:08:12,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2090166.0, ans=0.2 2023-06-25 23:08:21,791 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=22.5 2023-06-25 23:09:05,314 INFO [train.py:996] (3/4) Epoch 12, batch 12950, loss[loss=0.352, simple_loss=0.4, pruned_loss=0.152, over 21433.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3095, pruned_loss=0.07545, over 4275193.33 frames. ], batch size: 509, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:09:24,092 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.45 vs. limit=15.0 2023-06-25 23:09:30,335 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.722e+02 9.455e+02 1.379e+03 2.078e+03 4.999e+03, threshold=2.758e+03, percent-clipped=14.0 2023-06-25 23:11:00,112 INFO [train.py:996] (3/4) Epoch 12, batch 13000, loss[loss=0.1878, simple_loss=0.2742, pruned_loss=0.05074, over 21691.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3109, pruned_loss=0.07568, over 4275491.56 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:11:09,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2090646.0, ans=0.1 2023-06-25 23:11:09,931 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-25 23:11:24,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2090706.0, ans=0.2 2023-06-25 23:11:28,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2090706.0, ans=0.2 2023-06-25 23:11:54,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2090766.0, ans=0.2 2023-06-25 23:12:21,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.36 vs. limit=15.0 2023-06-25 23:12:39,884 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-25 23:12:41,958 INFO [train.py:996] (3/4) Epoch 12, batch 13050, loss[loss=0.2375, simple_loss=0.3052, pruned_loss=0.08493, over 21900.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3063, pruned_loss=0.07377, over 4274800.90 frames. ], batch size: 316, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:13:04,121 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.72 vs. limit=8.0 2023-06-25 23:13:06,368 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.723e+02 9.569e+02 1.242e+03 1.863e+03 3.264e+03, threshold=2.484e+03, percent-clipped=2.0 2023-06-25 23:13:06,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2091006.0, ans=0.125 2023-06-25 23:13:43,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2091066.0, ans=0.1 2023-06-25 23:13:43,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2091066.0, ans=0.07 2023-06-25 23:14:38,534 INFO [train.py:996] (3/4) Epoch 12, batch 13100, loss[loss=0.2172, simple_loss=0.3018, pruned_loss=0.06627, over 21320.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3074, pruned_loss=0.0738, over 4283006.48 frames. ], batch size: 176, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:14:40,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2091246.0, ans=0.0 2023-06-25 23:14:42,747 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-25 23:15:30,924 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:16:03,767 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-25 23:16:22,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2091486.0, ans=0.125 2023-06-25 23:16:31,061 INFO [train.py:996] (3/4) Epoch 12, batch 13150, loss[loss=0.2318, simple_loss=0.3009, pruned_loss=0.08139, over 21753.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3105, pruned_loss=0.07674, over 4285457.92 frames. ], batch size: 351, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:16:32,067 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-25 23:16:42,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2091546.0, ans=0.125 2023-06-25 23:16:55,945 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.201e+02 8.745e+02 1.410e+03 2.124e+03 5.219e+03, threshold=2.820e+03, percent-clipped=16.0 2023-06-25 23:17:08,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2091606.0, ans=0.125 2023-06-25 23:17:59,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=2091726.0, ans=0.02 2023-06-25 23:18:01,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2091726.0, ans=0.0 2023-06-25 23:18:04,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2091786.0, ans=0.125 2023-06-25 23:18:28,201 INFO [train.py:996] (3/4) Epoch 12, batch 13200, loss[loss=0.2319, simple_loss=0.3083, pruned_loss=0.07779, over 21454.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3077, pruned_loss=0.07553, over 4273325.96 frames. ], batch size: 211, lr: 2.43e-03, grad_scale: 32.0 2023-06-25 23:20:16,879 INFO [train.py:996] (3/4) Epoch 12, batch 13250, loss[loss=0.2143, simple_loss=0.2902, pruned_loss=0.06927, over 21280.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3098, pruned_loss=0.07823, over 4274872.40 frames. ], batch size: 176, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 23:20:17,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2092146.0, ans=0.0 2023-06-25 23:20:27,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2092146.0, ans=0.0 2023-06-25 23:20:31,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2092146.0, ans=0.0 2023-06-25 23:20:39,022 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=22.5 2023-06-25 23:20:40,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2092206.0, ans=0.0 2023-06-25 23:20:43,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2092206.0, ans=0.07 2023-06-25 23:20:45,168 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.712e+02 9.106e+02 1.656e+03 2.561e+03 5.361e+03, threshold=3.313e+03, percent-clipped=20.0 2023-06-25 23:21:21,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2092266.0, ans=0.125 2023-06-25 23:21:23,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2092266.0, ans=0.125 2023-06-25 23:22:12,972 INFO [train.py:996] (3/4) Epoch 12, batch 13300, loss[loss=0.2731, simple_loss=0.3437, pruned_loss=0.1012, over 21414.00 frames. ], tot_loss[loss=0.235, simple_loss=0.315, pruned_loss=0.07747, over 4274512.30 frames. ], batch size: 131, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 23:24:01,781 INFO [train.py:996] (3/4) Epoch 12, batch 13350, loss[loss=0.2873, simple_loss=0.3617, pruned_loss=0.1065, over 21450.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3204, pruned_loss=0.08066, over 4275275.03 frames. ], batch size: 471, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 23:24:11,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2092746.0, ans=0.125 2023-06-25 23:24:37,304 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.381e+02 8.931e+02 1.432e+03 2.029e+03 4.000e+03, threshold=2.864e+03, percent-clipped=9.0 2023-06-25 23:25:31,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2092926.0, ans=0.125 2023-06-25 23:25:51,901 INFO [train.py:996] (3/4) Epoch 12, batch 13400, loss[loss=0.2357, simple_loss=0.3066, pruned_loss=0.08237, over 21833.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3203, pruned_loss=0.0821, over 4273618.91 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 23:25:53,028 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-06-25 23:25:57,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2093046.0, ans=0.125 2023-06-25 23:26:17,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2093046.0, ans=0.1 2023-06-25 23:26:38,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2093166.0, ans=0.2 2023-06-25 23:26:54,444 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.98 vs. limit=15.0 2023-06-25 23:27:24,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2093286.0, ans=0.125 2023-06-25 23:27:42,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2093346.0, ans=0.0 2023-06-25 23:27:50,023 INFO [train.py:996] (3/4) Epoch 12, batch 13450, loss[loss=0.2259, simple_loss=0.2909, pruned_loss=0.08048, over 21410.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3207, pruned_loss=0.08355, over 4268922.69 frames. ], batch size: 131, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:28:18,585 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.861e+02 9.284e+02 1.216e+03 1.797e+03 3.595e+03, threshold=2.431e+03, percent-clipped=4.0 2023-06-25 23:28:27,780 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:28:55,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2093466.0, ans=0.125 2023-06-25 23:29:17,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=15.0 2023-06-25 23:29:47,967 INFO [train.py:996] (3/4) Epoch 12, batch 13500, loss[loss=0.2497, simple_loss=0.3324, pruned_loss=0.08345, over 21349.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.311, pruned_loss=0.07994, over 4264653.81 frames. ], batch size: 549, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:30:43,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2093766.0, ans=0.125 2023-06-25 23:30:59,846 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=15.0 2023-06-25 23:31:09,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2093826.0, ans=0.125 2023-06-25 23:31:38,001 INFO [train.py:996] (3/4) Epoch 12, batch 13550, loss[loss=0.2579, simple_loss=0.3614, pruned_loss=0.07722, over 21783.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3156, pruned_loss=0.07965, over 4268014.25 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:32:07,685 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.160e+02 9.861e+02 1.411e+03 2.332e+03 4.219e+03, threshold=2.822e+03, percent-clipped=19.0 2023-06-25 23:32:11,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2094006.0, ans=0.1 2023-06-25 23:32:13,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2094006.0, ans=0.2 2023-06-25 23:32:32,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2094066.0, ans=0.2 2023-06-25 23:33:11,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2094186.0, ans=0.125 2023-06-25 23:33:21,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2094186.0, ans=0.2 2023-06-25 23:33:23,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2094186.0, ans=0.015 2023-06-25 23:33:29,537 INFO [train.py:996] (3/4) Epoch 12, batch 13600, loss[loss=0.2317, simple_loss=0.2992, pruned_loss=0.08213, over 21363.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3168, pruned_loss=0.07963, over 4275838.37 frames. ], batch size: 144, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:35:12,703 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=12.0 2023-06-25 23:35:20,521 INFO [train.py:996] (3/4) Epoch 12, batch 13650, loss[loss=0.2758, simple_loss=0.3179, pruned_loss=0.1168, over 21398.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3108, pruned_loss=0.07634, over 4278228.77 frames. ], batch size: 508, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:35:22,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2094546.0, ans=0.0 2023-06-25 23:35:50,272 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.803e+02 6.699e+02 9.977e+02 1.659e+03 4.040e+03, threshold=1.995e+03, percent-clipped=8.0 2023-06-25 23:36:56,264 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=12.0 2023-06-25 23:37:04,033 INFO [train.py:996] (3/4) Epoch 12, batch 13700, loss[loss=0.1787, simple_loss=0.2402, pruned_loss=0.05857, over 21215.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3054, pruned_loss=0.07628, over 4271691.63 frames. ], batch size: 159, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:38:09,624 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-25 23:38:25,017 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.66 vs. limit=10.0 2023-06-25 23:39:04,363 INFO [train.py:996] (3/4) Epoch 12, batch 13750, loss[loss=0.1718, simple_loss=0.2338, pruned_loss=0.0549, over 21788.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3028, pruned_loss=0.07598, over 4271132.11 frames. ], batch size: 118, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:39:06,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2095146.0, ans=0.1 2023-06-25 23:39:15,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2095146.0, ans=0.125 2023-06-25 23:39:33,877 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.606e+02 1.009e+03 1.585e+03 2.865e+03 5.412e+03, threshold=3.169e+03, percent-clipped=34.0 2023-06-25 23:39:35,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2095206.0, ans=0.125 2023-06-25 23:40:54,879 INFO [train.py:996] (3/4) Epoch 12, batch 13800, loss[loss=0.2524, simple_loss=0.3766, pruned_loss=0.06403, over 21245.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3094, pruned_loss=0.0754, over 4263697.45 frames. ], batch size: 549, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:41:11,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2095446.0, ans=0.0 2023-06-25 23:41:14,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2095446.0, ans=0.1 2023-06-25 23:41:47,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2095566.0, ans=0.2 2023-06-25 23:42:05,966 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-25 23:42:11,185 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=15.0 2023-06-25 23:42:40,384 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-25 23:42:43,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2095686.0, ans=0.125 2023-06-25 23:42:52,740 INFO [train.py:996] (3/4) Epoch 12, batch 13850, loss[loss=0.229, simple_loss=0.3217, pruned_loss=0.0682, over 21717.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3178, pruned_loss=0.07679, over 4264163.03 frames. ], batch size: 247, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:43:21,196 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.59 vs. limit=12.0 2023-06-25 23:43:27,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2095806.0, ans=0.2 2023-06-25 23:43:28,954 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.539e+02 1.083e+03 1.524e+03 2.080e+03 5.261e+03, threshold=3.047e+03, percent-clipped=9.0 2023-06-25 23:43:38,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2095866.0, ans=0.0 2023-06-25 23:43:47,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2095866.0, ans=0.09899494936611666 2023-06-25 23:43:48,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2095866.0, ans=0.125 2023-06-25 23:44:49,824 INFO [train.py:996] (3/4) Epoch 12, batch 13900, loss[loss=0.233, simple_loss=0.3044, pruned_loss=0.08084, over 21335.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3223, pruned_loss=0.08106, over 4265163.94 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:44:55,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2096046.0, ans=0.09899494936611666 2023-06-25 23:46:38,584 INFO [train.py:996] (3/4) Epoch 12, batch 13950, loss[loss=0.2513, simple_loss=0.3253, pruned_loss=0.08864, over 21862.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3216, pruned_loss=0.08295, over 4277777.36 frames. ], batch size: 414, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:47:08,271 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.165e+02 9.219e+02 1.176e+03 2.067e+03 4.872e+03, threshold=2.352e+03, percent-clipped=8.0 2023-06-25 23:47:08,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2096406.0, ans=0.1 2023-06-25 23:47:15,528 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.54 vs. limit=12.0 2023-06-25 23:48:04,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2096586.0, ans=0.05 2023-06-25 23:48:08,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2096586.0, ans=0.0 2023-06-25 23:48:24,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2096646.0, ans=0.0 2023-06-25 23:48:25,175 INFO [train.py:996] (3/4) Epoch 12, batch 14000, loss[loss=0.2083, simple_loss=0.3078, pruned_loss=0.0544, over 21585.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3169, pruned_loss=0.08013, over 4283633.45 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:48:26,276 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.61 vs. limit=12.0 2023-06-25 23:48:48,824 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-06-25 23:49:07,213 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-25 23:49:34,360 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=22.5 2023-06-25 23:49:45,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2096826.0, ans=0.04949747468305833 2023-06-25 23:50:00,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2096886.0, ans=0.125 2023-06-25 23:50:07,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2096886.0, ans=0.2 2023-06-25 23:50:09,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2096886.0, ans=0.0 2023-06-25 23:50:12,343 INFO [train.py:996] (3/4) Epoch 12, batch 14050, loss[loss=0.2157, simple_loss=0.2758, pruned_loss=0.07784, over 21299.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3104, pruned_loss=0.07603, over 4276686.98 frames. ], batch size: 471, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:50:29,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2096946.0, ans=0.125 2023-06-25 23:50:41,378 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.215e+02 8.595e+02 1.187e+03 1.606e+03 3.647e+03, threshold=2.374e+03, percent-clipped=9.0 2023-06-25 23:50:41,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2097006.0, ans=0.0 2023-06-25 23:50:57,196 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-25 23:51:21,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2097126.0, ans=0.0 2023-06-25 23:51:34,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2097186.0, ans=0.0 2023-06-25 23:51:34,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2097186.0, ans=0.1 2023-06-25 23:52:05,838 INFO [train.py:996] (3/4) Epoch 12, batch 14100, loss[loss=0.2204, simple_loss=0.288, pruned_loss=0.07643, over 21562.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3031, pruned_loss=0.07582, over 4267535.68 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:52:06,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2097246.0, ans=0.125 2023-06-25 23:52:11,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2097246.0, ans=0.2 2023-06-25 23:52:19,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2097246.0, ans=0.125 2023-06-25 23:52:24,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2097306.0, ans=0.0 2023-06-25 23:52:57,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2097426.0, ans=0.125 2023-06-25 23:52:58,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2097426.0, ans=0.125 2023-06-25 23:53:13,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2097426.0, ans=0.125 2023-06-25 23:53:21,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2097426.0, ans=0.125 2023-06-25 23:53:28,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2097486.0, ans=0.2 2023-06-25 23:53:39,658 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=15.0 2023-06-25 23:53:40,196 INFO [train.py:996] (3/4) Epoch 12, batch 14150, loss[loss=0.2271, simple_loss=0.3206, pruned_loss=0.06686, over 21780.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3087, pruned_loss=0.07813, over 4275056.83 frames. ], batch size: 282, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:54:17,194 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.654e+02 9.098e+02 1.301e+03 1.898e+03 3.994e+03, threshold=2.602e+03, percent-clipped=15.0 2023-06-25 23:54:49,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2097726.0, ans=0.125 2023-06-25 23:55:03,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2097786.0, ans=0.0 2023-06-25 23:55:22,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2097786.0, ans=0.125 2023-06-25 23:55:25,894 INFO [train.py:996] (3/4) Epoch 12, batch 14200, loss[loss=0.2268, simple_loss=0.2998, pruned_loss=0.07691, over 21356.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3082, pruned_loss=0.07695, over 4280739.12 frames. ], batch size: 159, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:55:49,877 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=12.0 2023-06-25 23:56:01,466 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:56:03,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2097906.0, ans=0.125 2023-06-25 23:57:09,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2098146.0, ans=0.125 2023-06-25 23:57:10,450 INFO [train.py:996] (3/4) Epoch 12, batch 14250, loss[loss=0.2174, simple_loss=0.2611, pruned_loss=0.08687, over 20838.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.303, pruned_loss=0.07658, over 4267416.51 frames. ], batch size: 608, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:57:19,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2098146.0, ans=0.125 2023-06-25 23:57:19,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2098146.0, ans=0.125 2023-06-25 23:57:23,780 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=22.5 2023-06-25 23:57:39,993 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=22.5 2023-06-25 23:57:53,447 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.950e+02 7.589e+02 1.025e+03 1.608e+03 3.154e+03, threshold=2.050e+03, percent-clipped=3.0 2023-06-25 23:57:56,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2098206.0, ans=0.0 2023-06-25 23:58:58,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2098386.0, ans=0.125 2023-06-25 23:59:01,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2098386.0, ans=0.0 2023-06-25 23:59:02,047 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.50 vs. limit=10.0 2023-06-25 23:59:05,017 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:59:06,511 INFO [train.py:996] (3/4) Epoch 12, batch 14300, loss[loss=0.3248, simple_loss=0.4103, pruned_loss=0.1196, over 21654.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3041, pruned_loss=0.07501, over 4264311.35 frames. ], batch size: 414, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:59:43,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2098506.0, ans=0.2 2023-06-25 23:59:50,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2098506.0, ans=0.0 2023-06-25 23:59:50,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2098506.0, ans=0.125 2023-06-25 23:59:55,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2098566.0, ans=0.04949747468305833 2023-06-26 00:00:41,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2098686.0, ans=0.1 2023-06-26 00:00:55,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=22.5 2023-06-26 00:00:57,823 INFO [train.py:996] (3/4) Epoch 12, batch 14350, loss[loss=0.1889, simple_loss=0.2592, pruned_loss=0.05926, over 21303.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3119, pruned_loss=0.07641, over 4258502.67 frames. ], batch size: 159, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:01:13,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-26 00:01:35,771 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.907e+02 9.524e+02 1.577e+03 2.595e+03 6.111e+03, threshold=3.154e+03, percent-clipped=35.0 2023-06-26 00:01:45,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2098866.0, ans=0.125 2023-06-26 00:02:37,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2098986.0, ans=0.125 2023-06-26 00:02:40,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2098986.0, ans=0.125 2023-06-26 00:02:51,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2099046.0, ans=0.2 2023-06-26 00:02:53,108 INFO [train.py:996] (3/4) Epoch 12, batch 14400, loss[loss=0.2102, simple_loss=0.2697, pruned_loss=0.07539, over 21859.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3089, pruned_loss=0.07693, over 4266940.47 frames. ], batch size: 98, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:03:10,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2099046.0, ans=0.0 2023-06-26 00:03:12,759 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2023-06-26 00:03:20,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2099106.0, ans=0.125 2023-06-26 00:03:29,299 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.80 vs. limit=15.0 2023-06-26 00:04:38,671 INFO [train.py:996] (3/4) Epoch 12, batch 14450, loss[loss=0.2343, simple_loss=0.3095, pruned_loss=0.07952, over 21868.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3028, pruned_loss=0.07681, over 4268056.28 frames. ], batch size: 107, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:04:39,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2099346.0, ans=0.1 2023-06-26 00:05:07,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2099406.0, ans=0.07 2023-06-26 00:05:09,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2099406.0, ans=0.125 2023-06-26 00:05:10,376 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.479e+02 8.096e+02 1.121e+03 1.806e+03 4.464e+03, threshold=2.243e+03, percent-clipped=9.0 2023-06-26 00:05:31,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2099466.0, ans=0.0 2023-06-26 00:05:58,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2099586.0, ans=0.0 2023-06-26 00:06:30,334 INFO [train.py:996] (3/4) Epoch 12, batch 14500, loss[loss=0.2492, simple_loss=0.3297, pruned_loss=0.08429, over 21593.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2985, pruned_loss=0.07621, over 4269552.56 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:06:58,225 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-26 00:07:45,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2099826.0, ans=0.125 2023-06-26 00:08:26,778 INFO [train.py:996] (3/4) Epoch 12, batch 14550, loss[loss=0.2388, simple_loss=0.3146, pruned_loss=0.08154, over 21420.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3033, pruned_loss=0.07778, over 4267771.74 frames. ], batch size: 211, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:08:40,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2099946.0, ans=0.0 2023-06-26 00:08:52,850 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.442e+02 9.003e+02 1.476e+03 2.430e+03 4.476e+03, threshold=2.953e+03, percent-clipped=26.0 2023-06-26 00:08:53,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2100006.0, ans=0.125 2023-06-26 00:09:05,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2100066.0, ans=0.0 2023-06-26 00:09:16,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2100066.0, ans=0.0 2023-06-26 00:09:27,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2100126.0, ans=0.1 2023-06-26 00:09:41,089 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=15.0 2023-06-26 00:09:56,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2100186.0, ans=0.125 2023-06-26 00:10:17,526 INFO [train.py:996] (3/4) Epoch 12, batch 14600, loss[loss=0.2551, simple_loss=0.3338, pruned_loss=0.08822, over 21433.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3108, pruned_loss=0.08171, over 4266349.06 frames. ], batch size: 131, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:10:30,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2100246.0, ans=0.125 2023-06-26 00:10:36,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2100306.0, ans=0.125 2023-06-26 00:10:45,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2100306.0, ans=0.0 2023-06-26 00:10:57,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2100366.0, ans=0.07 2023-06-26 00:11:16,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2100426.0, ans=0.0 2023-06-26 00:11:31,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2100426.0, ans=0.125 2023-06-26 00:11:32,482 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=15.0 2023-06-26 00:12:06,922 INFO [train.py:996] (3/4) Epoch 12, batch 14650, loss[loss=0.1974, simple_loss=0.2908, pruned_loss=0.05202, over 21615.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3144, pruned_loss=0.08107, over 4267247.12 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:12:32,483 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.972e+02 9.038e+02 1.259e+03 1.802e+03 4.365e+03, threshold=2.519e+03, percent-clipped=6.0 2023-06-26 00:12:43,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2100666.0, ans=0.1 2023-06-26 00:13:22,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2100726.0, ans=0.0 2023-06-26 00:13:25,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2100786.0, ans=0.125 2023-06-26 00:13:57,508 INFO [train.py:996] (3/4) Epoch 12, batch 14700, loss[loss=0.161, simple_loss=0.2389, pruned_loss=0.04156, over 21333.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3075, pruned_loss=0.07457, over 4262939.78 frames. ], batch size: 131, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:14:19,733 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.30 vs. limit=10.0 2023-06-26 00:15:49,398 INFO [train.py:996] (3/4) Epoch 12, batch 14750, loss[loss=0.3809, simple_loss=0.4476, pruned_loss=0.1571, over 21418.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3133, pruned_loss=0.07677, over 4265183.07 frames. ], batch size: 507, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:16:07,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2101206.0, ans=0.125 2023-06-26 00:16:07,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2101206.0, ans=0.0 2023-06-26 00:16:21,753 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.018e+02 8.627e+02 1.543e+03 2.178e+03 4.695e+03, threshold=3.085e+03, percent-clipped=15.0 2023-06-26 00:16:29,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2101266.0, ans=0.125 2023-06-26 00:16:56,849 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=22.5 2023-06-26 00:16:59,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2101326.0, ans=0.125 2023-06-26 00:17:17,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2101326.0, ans=0.125 2023-06-26 00:17:39,995 INFO [train.py:996] (3/4) Epoch 12, batch 14800, loss[loss=0.2325, simple_loss=0.3038, pruned_loss=0.08064, over 21582.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3279, pruned_loss=0.08374, over 4268481.21 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 32.0 2023-06-26 00:17:55,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2101506.0, ans=0.125 2023-06-26 00:18:05,247 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:19:06,429 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=12.0 2023-06-26 00:19:32,043 INFO [train.py:996] (3/4) Epoch 12, batch 14850, loss[loss=0.2613, simple_loss=0.3439, pruned_loss=0.08936, over 21570.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.322, pruned_loss=0.08324, over 4262447.24 frames. ], batch size: 414, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:19:33,421 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=22.5 2023-06-26 00:19:38,169 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-26 00:19:58,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2101806.0, ans=0.125 2023-06-26 00:20:16,285 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.961e+02 9.941e+02 1.371e+03 2.279e+03 6.206e+03, threshold=2.743e+03, percent-clipped=9.0 2023-06-26 00:20:51,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2101926.0, ans=0.0 2023-06-26 00:20:56,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2101926.0, ans=0.1 2023-06-26 00:20:57,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2101926.0, ans=0.0 2023-06-26 00:21:01,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2101926.0, ans=0.2 2023-06-26 00:21:20,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2101986.0, ans=0.125 2023-06-26 00:21:32,258 INFO [train.py:996] (3/4) Epoch 12, batch 14900, loss[loss=0.2635, simple_loss=0.3296, pruned_loss=0.09871, over 21365.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3235, pruned_loss=0.08378, over 4265047.88 frames. ], batch size: 194, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:22:11,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2102106.0, ans=0.125 2023-06-26 00:22:27,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2102166.0, ans=0.125 2023-06-26 00:22:53,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2102286.0, ans=0.125 2023-06-26 00:23:07,571 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.28 vs. limit=10.0 2023-06-26 00:23:28,297 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=15.0 2023-06-26 00:23:29,195 INFO [train.py:996] (3/4) Epoch 12, batch 14950, loss[loss=0.2868, simple_loss=0.3624, pruned_loss=0.1056, over 21386.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3221, pruned_loss=0.08287, over 4268721.83 frames. ], batch size: 507, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:23:31,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2102346.0, ans=0.125 2023-06-26 00:24:02,299 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.432e+02 8.420e+02 1.194e+03 1.609e+03 3.792e+03, threshold=2.388e+03, percent-clipped=5.0 2023-06-26 00:25:18,357 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-06-26 00:25:18,739 INFO [train.py:996] (3/4) Epoch 12, batch 15000, loss[loss=0.2758, simple_loss=0.3439, pruned_loss=0.1039, over 21502.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3247, pruned_loss=0.0846, over 4271629.45 frames. ], batch size: 211, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:25:18,740 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 00:25:28,410 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.6467, 3.6519, 2.1017, 1.5956], device='cuda:3') 2023-06-26 00:25:42,266 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.2582, simple_loss=0.348, pruned_loss=0.08425, over 1796401.00 frames. 2023-06-26 00:25:42,267 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-26 00:25:42,822 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:26:16,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2102766.0, ans=0.2 2023-06-26 00:27:29,927 INFO [train.py:996] (3/4) Epoch 12, batch 15050, loss[loss=0.2156, simple_loss=0.3016, pruned_loss=0.06476, over 21658.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3252, pruned_loss=0.08519, over 4272577.30 frames. ], batch size: 247, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:27:39,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2102946.0, ans=0.125 2023-06-26 00:27:49,576 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-26 00:27:50,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2103006.0, ans=0.0 2023-06-26 00:27:54,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2103006.0, ans=0.125 2023-06-26 00:27:56,942 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.042e+02 9.488e+02 1.374e+03 1.981e+03 5.080e+03, threshold=2.749e+03, percent-clipped=16.0 2023-06-26 00:28:52,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2103186.0, ans=0.1 2023-06-26 00:29:10,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2103186.0, ans=0.1 2023-06-26 00:29:16,837 INFO [train.py:996] (3/4) Epoch 12, batch 15100, loss[loss=0.2618, simple_loss=0.3423, pruned_loss=0.09061, over 21594.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3271, pruned_loss=0.08414, over 4273436.65 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:29:41,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2103306.0, ans=0.0 2023-06-26 00:30:27,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2103426.0, ans=0.0 2023-06-26 00:31:07,971 INFO [train.py:996] (3/4) Epoch 12, batch 15150, loss[loss=0.2589, simple_loss=0.3137, pruned_loss=0.102, over 21256.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3215, pruned_loss=0.08413, over 4276308.61 frames. ], batch size: 471, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:31:12,643 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-06-26 00:31:44,803 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.420e+02 7.677e+02 1.025e+03 1.452e+03 2.770e+03, threshold=2.050e+03, percent-clipped=1.0 2023-06-26 00:31:55,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2103666.0, ans=0.035 2023-06-26 00:32:34,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2103726.0, ans=0.05 2023-06-26 00:32:57,457 INFO [train.py:996] (3/4) Epoch 12, batch 15200, loss[loss=0.2203, simple_loss=0.3189, pruned_loss=0.06084, over 21549.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3114, pruned_loss=0.08021, over 4264376.99 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:33:09,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2103846.0, ans=0.125 2023-06-26 00:34:12,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2104026.0, ans=0.2 2023-06-26 00:34:13,089 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-26 00:34:45,756 INFO [train.py:996] (3/4) Epoch 12, batch 15250, loss[loss=0.2303, simple_loss=0.2909, pruned_loss=0.08483, over 21665.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3046, pruned_loss=0.07811, over 4262546.34 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:34:52,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2104146.0, ans=0.0 2023-06-26 00:35:13,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2104206.0, ans=0.2 2023-06-26 00:35:33,135 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.968e+02 8.853e+02 1.364e+03 1.967e+03 5.293e+03, threshold=2.727e+03, percent-clipped=20.0 2023-06-26 00:36:07,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2104326.0, ans=0.125 2023-06-26 00:36:35,454 INFO [train.py:996] (3/4) Epoch 12, batch 15300, loss[loss=0.2764, simple_loss=0.3413, pruned_loss=0.1058, over 21795.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3064, pruned_loss=0.08062, over 4269954.42 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:37:43,225 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-26 00:37:56,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2104626.0, ans=0.1 2023-06-26 00:38:08,652 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.598e-03 2023-06-26 00:38:09,182 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-26 00:38:22,855 INFO [train.py:996] (3/4) Epoch 12, batch 15350, loss[loss=0.2419, simple_loss=0.3463, pruned_loss=0.0687, over 21621.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3134, pruned_loss=0.08314, over 4273132.50 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:38:56,888 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=15.0 2023-06-26 00:38:58,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2104806.0, ans=0.125 2023-06-26 00:39:09,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2104806.0, ans=0.025 2023-06-26 00:39:10,390 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.965e+02 8.644e+02 1.167e+03 1.665e+03 4.882e+03, threshold=2.334e+03, percent-clipped=9.0 2023-06-26 00:39:17,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2104866.0, ans=0.0 2023-06-26 00:39:35,865 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-26 00:39:42,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2104926.0, ans=0.0 2023-06-26 00:39:47,629 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=22.5 2023-06-26 00:39:51,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2104986.0, ans=0.0 2023-06-26 00:40:09,310 INFO [train.py:996] (3/4) Epoch 12, batch 15400, loss[loss=0.2106, simple_loss=0.2945, pruned_loss=0.06337, over 21863.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3123, pruned_loss=0.08092, over 4261457.15 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:40:41,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2105106.0, ans=0.05 2023-06-26 00:41:12,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2105166.0, ans=0.125 2023-06-26 00:41:52,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2105286.0, ans=0.125 2023-06-26 00:41:58,326 INFO [train.py:996] (3/4) Epoch 12, batch 15450, loss[loss=0.216, simple_loss=0.286, pruned_loss=0.07297, over 21473.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3106, pruned_loss=0.08061, over 4254564.50 frames. ], batch size: 131, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:42:02,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2105346.0, ans=0.0 2023-06-26 00:42:05,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2105346.0, ans=0.0 2023-06-26 00:42:22,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2105406.0, ans=0.0 2023-06-26 00:42:40,782 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.072e+02 7.775e+02 1.085e+03 1.677e+03 3.243e+03, threshold=2.170e+03, percent-clipped=5.0 2023-06-26 00:42:42,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2105466.0, ans=0.125 2023-06-26 00:42:42,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2105466.0, ans=0.125 2023-06-26 00:42:50,721 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=12.0 2023-06-26 00:43:36,130 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-26 00:43:40,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2105586.0, ans=0.125 2023-06-26 00:43:46,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2105646.0, ans=0.0 2023-06-26 00:43:47,760 INFO [train.py:996] (3/4) Epoch 12, batch 15500, loss[loss=0.2769, simple_loss=0.3536, pruned_loss=0.1001, over 21745.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3133, pruned_loss=0.08036, over 4255301.33 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:44:49,399 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.22 vs. limit=10.0 2023-06-26 00:45:13,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2105826.0, ans=0.125 2023-06-26 00:45:23,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2105886.0, ans=0.125 2023-06-26 00:45:37,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2105886.0, ans=0.125 2023-06-26 00:45:43,178 INFO [train.py:996] (3/4) Epoch 12, batch 15550, loss[loss=0.2002, simple_loss=0.2876, pruned_loss=0.05639, over 21819.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3139, pruned_loss=0.07842, over 4256016.68 frames. ], batch size: 316, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:46:02,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2105946.0, ans=0.125 2023-06-26 00:46:21,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2106006.0, ans=0.125 2023-06-26 00:46:34,312 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.226e+02 1.052e+03 1.309e+03 1.870e+03 4.327e+03, threshold=2.618e+03, percent-clipped=17.0 2023-06-26 00:46:58,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2106126.0, ans=0.125 2023-06-26 00:47:02,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2106126.0, ans=0.0 2023-06-26 00:47:41,685 INFO [train.py:996] (3/4) Epoch 12, batch 15600, loss[loss=0.2243, simple_loss=0.3165, pruned_loss=0.06604, over 21505.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3072, pruned_loss=0.07705, over 4253398.13 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:48:05,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2106306.0, ans=0.1 2023-06-26 00:48:09,080 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=22.5 2023-06-26 00:48:37,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2106366.0, ans=0.0 2023-06-26 00:48:49,343 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=22.5 2023-06-26 00:49:20,585 INFO [train.py:996] (3/4) Epoch 12, batch 15650, loss[loss=0.2457, simple_loss=0.3012, pruned_loss=0.0951, over 21275.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3065, pruned_loss=0.07666, over 4252553.35 frames. ], batch size: 471, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:50:05,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2106606.0, ans=0.2 2023-06-26 00:50:07,692 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.319e+02 9.009e+02 1.257e+03 1.963e+03 4.655e+03, threshold=2.515e+03, percent-clipped=11.0 2023-06-26 00:50:09,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2106666.0, ans=0.0 2023-06-26 00:50:33,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2106726.0, ans=0.125 2023-06-26 00:51:12,647 INFO [train.py:996] (3/4) Epoch 12, batch 15700, loss[loss=0.2184, simple_loss=0.2828, pruned_loss=0.07697, over 21249.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3016, pruned_loss=0.07534, over 4258485.70 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:51:47,459 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-26 00:52:13,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2106966.0, ans=0.125 2023-06-26 00:52:23,553 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:52:55,622 INFO [train.py:996] (3/4) Epoch 12, batch 15750, loss[loss=0.227, simple_loss=0.2967, pruned_loss=0.07868, over 21885.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2989, pruned_loss=0.07594, over 4262657.51 frames. ], batch size: 107, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:53:33,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2107206.0, ans=0.0 2023-06-26 00:53:38,548 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.389e+02 8.722e+02 1.383e+03 2.030e+03 4.451e+03, threshold=2.767e+03, percent-clipped=16.0 2023-06-26 00:54:15,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2107386.0, ans=0.04949747468305833 2023-06-26 00:54:38,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2107386.0, ans=0.2 2023-06-26 00:54:41,585 INFO [train.py:996] (3/4) Epoch 12, batch 15800, loss[loss=0.2283, simple_loss=0.2829, pruned_loss=0.08685, over 21316.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2949, pruned_loss=0.07608, over 4252400.32 frames. ], batch size: 159, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:54:47,701 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.07 vs. limit=15.0 2023-06-26 00:55:27,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2107566.0, ans=0.0 2023-06-26 00:55:27,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2107566.0, ans=0.125 2023-06-26 00:55:43,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2107626.0, ans=0.2 2023-06-26 00:55:44,131 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=22.5 2023-06-26 00:55:45,941 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-26 00:56:30,866 INFO [train.py:996] (3/4) Epoch 12, batch 15850, loss[loss=0.2751, simple_loss=0.3372, pruned_loss=0.1065, over 21367.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.297, pruned_loss=0.07798, over 4260771.82 frames. ], batch size: 471, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:56:55,495 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:56:57,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2107806.0, ans=0.125 2023-06-26 00:57:10,174 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.897e+02 9.764e+02 1.470e+03 2.269e+03 4.632e+03, threshold=2.939e+03, percent-clipped=9.0 2023-06-26 00:57:10,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2107866.0, ans=0.0 2023-06-26 00:57:48,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2107986.0, ans=0.125 2023-06-26 00:57:48,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2107986.0, ans=0.125 2023-06-26 00:58:03,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2107986.0, ans=0.1 2023-06-26 00:58:14,209 INFO [train.py:996] (3/4) Epoch 12, batch 15900, loss[loss=0.2115, simple_loss=0.2952, pruned_loss=0.06391, over 21477.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2936, pruned_loss=0.07755, over 4261292.21 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:58:16,825 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-06-26 00:58:18,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2108046.0, ans=0.2 2023-06-26 00:58:55,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2108106.0, ans=0.125 2023-06-26 00:59:17,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2108226.0, ans=0.125 2023-06-26 00:59:22,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2108226.0, ans=0.2 2023-06-26 00:59:24,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2108226.0, ans=0.125 2023-06-26 00:59:32,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2108226.0, ans=0.125 2023-06-26 00:59:40,880 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-06-26 00:59:56,208 INFO [train.py:996] (3/4) Epoch 12, batch 15950, loss[loss=0.2135, simple_loss=0.3105, pruned_loss=0.05828, over 21673.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2967, pruned_loss=0.07575, over 4244315.57 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 01:00:21,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2108406.0, ans=0.125 2023-06-26 01:00:41,307 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.627e+02 7.587e+02 1.024e+03 1.341e+03 2.731e+03, threshold=2.049e+03, percent-clipped=0.0 2023-06-26 01:01:21,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2108586.0, ans=0.125 2023-06-26 01:01:32,765 INFO [train.py:996] (3/4) Epoch 12, batch 16000, loss[loss=0.2171, simple_loss=0.3127, pruned_loss=0.06078, over 21655.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2983, pruned_loss=0.07326, over 4260152.80 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:03:17,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2108886.0, ans=0.1 2023-06-26 01:03:26,065 INFO [train.py:996] (3/4) Epoch 12, batch 16050, loss[loss=0.2731, simple_loss=0.3789, pruned_loss=0.08363, over 21844.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3011, pruned_loss=0.07158, over 4266564.22 frames. ], batch size: 371, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:03:31,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2108946.0, ans=0.125 2023-06-26 01:03:48,208 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-06-26 01:03:49,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2109006.0, ans=0.0 2023-06-26 01:04:18,271 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.830e+02 8.237e+02 1.456e+03 2.962e+03 6.704e+03, threshold=2.913e+03, percent-clipped=34.0 2023-06-26 01:04:27,767 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-26 01:04:31,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2109066.0, ans=0.125 2023-06-26 01:04:33,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2109066.0, ans=0.125 2023-06-26 01:04:50,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2109126.0, ans=0.125 2023-06-26 01:04:52,990 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.05 vs. limit=15.0 2023-06-26 01:05:02,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2109186.0, ans=0.0 2023-06-26 01:05:15,989 INFO [train.py:996] (3/4) Epoch 12, batch 16100, loss[loss=0.219, simple_loss=0.2984, pruned_loss=0.06979, over 21880.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3061, pruned_loss=0.07304, over 4274632.67 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:05:34,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2109246.0, ans=0.125 2023-06-26 01:06:19,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2109366.0, ans=0.125 2023-06-26 01:07:03,525 INFO [train.py:996] (3/4) Epoch 12, batch 16150, loss[loss=0.2645, simple_loss=0.3303, pruned_loss=0.09939, over 21485.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3061, pruned_loss=0.07544, over 4281448.36 frames. ], batch size: 131, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:07:12,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2109546.0, ans=0.125 2023-06-26 01:07:46,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2109606.0, ans=0.0 2023-06-26 01:07:58,373 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.216e+02 8.957e+02 1.137e+03 1.849e+03 5.347e+03, threshold=2.275e+03, percent-clipped=8.0 2023-06-26 01:08:24,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2109726.0, ans=0.1 2023-06-26 01:08:24,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2109726.0, ans=0.0 2023-06-26 01:08:35,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2109786.0, ans=0.125 2023-06-26 01:08:36,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2109786.0, ans=0.125 2023-06-26 01:08:49,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2109786.0, ans=0.0 2023-06-26 01:08:56,276 INFO [train.py:996] (3/4) Epoch 12, batch 16200, loss[loss=0.289, simple_loss=0.3624, pruned_loss=0.1078, over 21503.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3105, pruned_loss=0.07628, over 4281607.18 frames. ], batch size: 471, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:09:23,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2109846.0, ans=0.125 2023-06-26 01:09:40,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2109906.0, ans=0.1 2023-06-26 01:09:51,826 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-26 01:09:58,318 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=15.0 2023-06-26 01:10:35,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2110086.0, ans=0.125 2023-06-26 01:10:52,252 INFO [train.py:996] (3/4) Epoch 12, batch 16250, loss[loss=0.1863, simple_loss=0.2586, pruned_loss=0.05703, over 21408.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3116, pruned_loss=0.07689, over 4279342.67 frames. ], batch size: 211, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:10:59,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2110146.0, ans=0.125 2023-06-26 01:11:20,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2110206.0, ans=0.0 2023-06-26 01:11:31,325 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.426e+02 1.023e+03 1.432e+03 2.136e+03 5.202e+03, threshold=2.864e+03, percent-clipped=19.0 2023-06-26 01:11:35,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2110266.0, ans=0.1 2023-06-26 01:11:45,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2110266.0, ans=0.0 2023-06-26 01:12:12,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2110386.0, ans=0.05 2023-06-26 01:12:19,198 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.37 vs. limit=10.0 2023-06-26 01:12:34,501 INFO [train.py:996] (3/4) Epoch 12, batch 16300, loss[loss=0.1975, simple_loss=0.2818, pruned_loss=0.05661, over 21271.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3048, pruned_loss=0.07362, over 4279080.55 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 01:12:38,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2110446.0, ans=0.035 2023-06-26 01:13:57,263 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.64 vs. limit=8.0 2023-06-26 01:14:00,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2110686.0, ans=0.2 2023-06-26 01:14:25,775 INFO [train.py:996] (3/4) Epoch 12, batch 16350, loss[loss=0.2124, simple_loss=0.2867, pruned_loss=0.06911, over 21546.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3037, pruned_loss=0.0744, over 4281969.72 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:14:36,111 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-26 01:15:06,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-26 01:15:07,539 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.769e+02 8.768e+02 1.143e+03 1.640e+03 3.455e+03, threshold=2.286e+03, percent-clipped=4.0 2023-06-26 01:15:08,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2110866.0, ans=0.2 2023-06-26 01:15:32,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2110926.0, ans=0.125 2023-06-26 01:15:35,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2110926.0, ans=0.125 2023-06-26 01:16:03,919 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.75 vs. limit=8.0 2023-06-26 01:16:21,176 INFO [train.py:996] (3/4) Epoch 12, batch 16400, loss[loss=0.2391, simple_loss=0.3113, pruned_loss=0.0835, over 21827.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3065, pruned_loss=0.07561, over 4285271.42 frames. ], batch size: 124, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:16:22,416 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-26 01:17:07,183 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2023-06-26 01:17:24,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.66 vs. limit=5.0 2023-06-26 01:17:25,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2111226.0, ans=0.1 2023-06-26 01:18:04,779 INFO [train.py:996] (3/4) Epoch 12, batch 16450, loss[loss=0.2089, simple_loss=0.2809, pruned_loss=0.06851, over 21910.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3055, pruned_loss=0.07652, over 4292993.36 frames. ], batch size: 316, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:18:28,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2111406.0, ans=0.125 2023-06-26 01:18:39,417 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.567e+02 7.458e+02 1.043e+03 1.518e+03 3.613e+03, threshold=2.086e+03, percent-clipped=5.0 2023-06-26 01:18:50,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2111466.0, ans=0.125 2023-06-26 01:19:15,921 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:19:41,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2111586.0, ans=0.2 2023-06-26 01:19:54,980 INFO [train.py:996] (3/4) Epoch 12, batch 16500, loss[loss=0.2482, simple_loss=0.3462, pruned_loss=0.07508, over 21215.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3059, pruned_loss=0.07703, over 4298963.22 frames. ], batch size: 548, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:20:50,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2111826.0, ans=0.0 2023-06-26 01:21:05,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2111826.0, ans=0.125 2023-06-26 01:21:46,598 INFO [train.py:996] (3/4) Epoch 12, batch 16550, loss[loss=0.2406, simple_loss=0.3344, pruned_loss=0.07342, over 21494.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.306, pruned_loss=0.07442, over 4296445.81 frames. ], batch size: 471, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:22:28,594 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.316e+02 1.128e+03 1.805e+03 2.872e+03 7.168e+03, threshold=3.610e+03, percent-clipped=40.0 2023-06-26 01:23:39,436 INFO [train.py:996] (3/4) Epoch 12, batch 16600, loss[loss=0.3126, simple_loss=0.4132, pruned_loss=0.106, over 21674.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.315, pruned_loss=0.07831, over 4293097.82 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:24:14,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2112306.0, ans=0.125 2023-06-26 01:24:24,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2112366.0, ans=10.0 2023-06-26 01:25:29,282 INFO [train.py:996] (3/4) Epoch 12, batch 16650, loss[loss=0.2476, simple_loss=0.3247, pruned_loss=0.08523, over 22018.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3256, pruned_loss=0.08095, over 4292803.68 frames. ], batch size: 317, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:26:29,252 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.918e+02 9.386e+02 1.441e+03 2.110e+03 3.541e+03, threshold=2.881e+03, percent-clipped=0.0 2023-06-26 01:26:53,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2112726.0, ans=0.125 2023-06-26 01:27:27,172 INFO [train.py:996] (3/4) Epoch 12, batch 16700, loss[loss=0.2567, simple_loss=0.3622, pruned_loss=0.07559, over 21179.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3287, pruned_loss=0.08273, over 4287209.59 frames. ], batch size: 549, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:28:09,489 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=15.0 2023-06-26 01:28:14,960 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.03 vs. limit=12.0 2023-06-26 01:28:27,593 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.70 vs. limit=6.0 2023-06-26 01:28:47,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2113026.0, ans=0.125 2023-06-26 01:28:51,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2113026.0, ans=0.0 2023-06-26 01:29:00,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2113086.0, ans=0.1 2023-06-26 01:29:14,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-26 01:29:38,423 INFO [train.py:996] (3/4) Epoch 12, batch 16750, loss[loss=0.2641, simple_loss=0.3433, pruned_loss=0.09242, over 21775.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3286, pruned_loss=0.08395, over 4280821.68 frames. ], batch size: 332, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:30:21,347 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.824e+02 1.154e+03 1.795e+03 2.444e+03 4.443e+03, threshold=3.590e+03, percent-clipped=18.0 2023-06-26 01:30:28,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2113266.0, ans=0.0 2023-06-26 01:30:51,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2113326.0, ans=0.04949747468305833 2023-06-26 01:31:18,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2113386.0, ans=0.0 2023-06-26 01:31:23,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2113386.0, ans=0.1 2023-06-26 01:31:30,344 INFO [train.py:996] (3/4) Epoch 12, batch 16800, loss[loss=0.1965, simple_loss=0.2729, pruned_loss=0.06009, over 21769.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3327, pruned_loss=0.08343, over 4283184.31 frames. ], batch size: 247, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:31:53,851 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.86 vs. limit=10.0 2023-06-26 01:32:37,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2113626.0, ans=0.125 2023-06-26 01:32:38,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2113626.0, ans=10.0 2023-06-26 01:33:08,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2113686.0, ans=0.1 2023-06-26 01:33:10,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2113686.0, ans=0.125 2023-06-26 01:33:14,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2113686.0, ans=0.125 2023-06-26 01:33:17,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2113746.0, ans=0.125 2023-06-26 01:33:18,812 INFO [train.py:996] (3/4) Epoch 12, batch 16850, loss[loss=0.2262, simple_loss=0.2974, pruned_loss=0.07752, over 21868.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3272, pruned_loss=0.08355, over 4291584.36 frames. ], batch size: 371, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:33:43,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2113806.0, ans=0.0 2023-06-26 01:34:00,627 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.998e+02 1.125e+03 1.896e+03 2.663e+03 4.313e+03, threshold=3.792e+03, percent-clipped=10.0 2023-06-26 01:34:22,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2113926.0, ans=0.015 2023-06-26 01:34:37,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2113926.0, ans=0.0 2023-06-26 01:34:51,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2113986.0, ans=0.125 2023-06-26 01:35:07,195 INFO [train.py:996] (3/4) Epoch 12, batch 16900, loss[loss=0.2018, simple_loss=0.2779, pruned_loss=0.06284, over 21785.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3217, pruned_loss=0.08197, over 4290655.08 frames. ], batch size: 316, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:35:39,940 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-26 01:36:51,880 INFO [train.py:996] (3/4) Epoch 12, batch 16950, loss[loss=0.2085, simple_loss=0.2885, pruned_loss=0.0643, over 21422.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.314, pruned_loss=0.08023, over 4286665.93 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:37:25,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2114406.0, ans=0.125 2023-06-26 01:37:32,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2114466.0, ans=0.2 2023-06-26 01:37:33,320 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.589e+02 8.373e+02 1.013e+03 1.291e+03 3.071e+03, threshold=2.026e+03, percent-clipped=0.0 2023-06-26 01:38:19,017 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.37 vs. limit=10.0 2023-06-26 01:38:32,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2114586.0, ans=0.0 2023-06-26 01:38:41,241 INFO [train.py:996] (3/4) Epoch 12, batch 17000, loss[loss=0.2497, simple_loss=0.3182, pruned_loss=0.09059, over 21876.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3099, pruned_loss=0.08089, over 4289806.32 frames. ], batch size: 118, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:39:09,919 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.54 vs. limit=10.0 2023-06-26 01:39:45,187 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:40:01,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2114826.0, ans=0.125 2023-06-26 01:40:12,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2114886.0, ans=0.125 2023-06-26 01:40:14,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2114886.0, ans=0.125 2023-06-26 01:40:25,751 INFO [train.py:996] (3/4) Epoch 12, batch 17050, loss[loss=0.2811, simple_loss=0.3965, pruned_loss=0.08285, over 20894.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3169, pruned_loss=0.08282, over 4292262.23 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:40:45,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=2114946.0, ans=0.05 2023-06-26 01:41:05,970 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.789e+02 8.713e+02 1.354e+03 1.958e+03 3.911e+03, threshold=2.708e+03, percent-clipped=23.0 2023-06-26 01:41:21,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2115066.0, ans=0.125 2023-06-26 01:42:13,565 INFO [train.py:996] (3/4) Epoch 12, batch 17100, loss[loss=0.2353, simple_loss=0.297, pruned_loss=0.08683, over 21699.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3148, pruned_loss=0.08294, over 4298128.58 frames. ], batch size: 230, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:44:02,117 INFO [train.py:996] (3/4) Epoch 12, batch 17150, loss[loss=0.2039, simple_loss=0.2888, pruned_loss=0.05948, over 21381.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3111, pruned_loss=0.08298, over 4297246.32 frames. ], batch size: 548, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:44:43,615 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.448e+02 7.663e+02 1.094e+03 1.326e+03 2.492e+03, threshold=2.188e+03, percent-clipped=0.0 2023-06-26 01:45:05,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2115666.0, ans=0.125 2023-06-26 01:45:05,977 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-06-26 01:45:52,520 INFO [train.py:996] (3/4) Epoch 12, batch 17200, loss[loss=0.2399, simple_loss=0.3101, pruned_loss=0.08488, over 21434.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3103, pruned_loss=0.08148, over 4293020.45 frames. ], batch size: 211, lr: 2.41e-03, grad_scale: 32.0 2023-06-26 01:45:58,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2115846.0, ans=0.1 2023-06-26 01:46:00,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2115846.0, ans=0.125 2023-06-26 01:46:06,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=2115846.0, ans=22.5 2023-06-26 01:46:07,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2115846.0, ans=0.2 2023-06-26 01:46:15,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2115906.0, ans=0.0 2023-06-26 01:46:28,437 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=22.5 2023-06-26 01:46:56,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2115966.0, ans=0.125 2023-06-26 01:47:00,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2115966.0, ans=0.1 2023-06-26 01:47:16,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2116026.0, ans=0.0 2023-06-26 01:47:25,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2116026.0, ans=0.125 2023-06-26 01:47:45,178 INFO [train.py:996] (3/4) Epoch 12, batch 17250, loss[loss=0.2383, simple_loss=0.3071, pruned_loss=0.0847, over 19926.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3141, pruned_loss=0.08307, over 4284366.17 frames. ], batch size: 702, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:48:19,097 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=12.0 2023-06-26 01:48:38,398 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-06-26 01:48:42,514 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.045e+02 8.687e+02 1.216e+03 1.753e+03 5.268e+03, threshold=2.433e+03, percent-clipped=15.0 2023-06-26 01:48:58,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2116326.0, ans=0.2 2023-06-26 01:49:06,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2116326.0, ans=0.2 2023-06-26 01:49:35,533 INFO [train.py:996] (3/4) Epoch 12, batch 17300, loss[loss=0.2968, simple_loss=0.3677, pruned_loss=0.113, over 21478.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3222, pruned_loss=0.0861, over 4283065.54 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:50:33,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2116566.0, ans=0.0 2023-06-26 01:50:39,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=12.0 2023-06-26 01:51:43,224 INFO [train.py:996] (3/4) Epoch 12, batch 17350, loss[loss=0.3272, simple_loss=0.3986, pruned_loss=0.1279, over 21486.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3227, pruned_loss=0.08566, over 4285557.11 frames. ], batch size: 508, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:51:51,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2116746.0, ans=0.125 2023-06-26 01:52:33,982 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.805e+02 1.057e+03 1.442e+03 1.846e+03 4.357e+03, threshold=2.883e+03, percent-clipped=11.0 2023-06-26 01:52:58,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2116926.0, ans=0.0 2023-06-26 01:53:00,946 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-26 01:53:13,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2116986.0, ans=0.125 2023-06-26 01:53:21,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2116986.0, ans=0.2 2023-06-26 01:53:38,840 INFO [train.py:996] (3/4) Epoch 12, batch 17400, loss[loss=0.2171, simple_loss=0.3005, pruned_loss=0.06681, over 21739.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.319, pruned_loss=0.08213, over 4269277.35 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:53:41,535 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=15.0 2023-06-26 01:53:56,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2117106.0, ans=0.0 2023-06-26 01:54:37,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2117226.0, ans=0.1 2023-06-26 01:54:57,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2117226.0, ans=0.2 2023-06-26 01:55:00,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2117226.0, ans=0.125 2023-06-26 01:55:11,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2117286.0, ans=0.125 2023-06-26 01:55:26,387 INFO [train.py:996] (3/4) Epoch 12, batch 17450, loss[loss=0.1937, simple_loss=0.2461, pruned_loss=0.07066, over 19941.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3149, pruned_loss=0.07927, over 4272951.65 frames. ], batch size: 704, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:56:11,456 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.427e+02 9.633e+02 1.716e+03 2.627e+03 5.192e+03, threshold=3.432e+03, percent-clipped=16.0 2023-06-26 01:56:54,810 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-26 01:57:13,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2117586.0, ans=0.1 2023-06-26 01:57:16,246 INFO [train.py:996] (3/4) Epoch 12, batch 17500, loss[loss=0.225, simple_loss=0.2917, pruned_loss=0.07913, over 21397.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3123, pruned_loss=0.07803, over 4281961.91 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:57:18,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2117646.0, ans=0.1 2023-06-26 01:57:31,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2117646.0, ans=0.125 2023-06-26 01:57:55,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2117766.0, ans=0.0 2023-06-26 01:58:07,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2117766.0, ans=0.125 2023-06-26 01:58:36,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2117826.0, ans=0.0 2023-06-26 01:58:48,041 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=15.0 2023-06-26 01:58:50,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2117886.0, ans=0.0 2023-06-26 01:59:02,042 INFO [train.py:996] (3/4) Epoch 12, batch 17550, loss[loss=0.2362, simple_loss=0.3176, pruned_loss=0.07741, over 16417.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3119, pruned_loss=0.0768, over 4276058.70 frames. ], batch size: 65, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:59:44,811 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.370e+02 8.396e+02 1.240e+03 1.763e+03 3.484e+03, threshold=2.480e+03, percent-clipped=3.0 2023-06-26 02:00:07,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2118126.0, ans=0.0 2023-06-26 02:00:47,353 INFO [train.py:996] (3/4) Epoch 12, batch 17600, loss[loss=0.2427, simple_loss=0.3218, pruned_loss=0.08181, over 21380.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3135, pruned_loss=0.07676, over 4267567.59 frames. ], batch size: 159, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:00:52,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2118246.0, ans=0.1 2023-06-26 02:00:54,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2118246.0, ans=0.125 2023-06-26 02:01:10,135 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-26 02:01:11,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2118306.0, ans=0.0 2023-06-26 02:01:26,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2118366.0, ans=0.0 2023-06-26 02:01:49,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2118426.0, ans=0.125 2023-06-26 02:02:36,271 INFO [train.py:996] (3/4) Epoch 12, batch 17650, loss[loss=0.2185, simple_loss=0.3012, pruned_loss=0.06794, over 21699.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3121, pruned_loss=0.07643, over 4258360.57 frames. ], batch size: 415, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:02:52,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2118546.0, ans=0.035 2023-06-26 02:03:27,852 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.781e+02 8.671e+02 1.388e+03 2.220e+03 4.878e+03, threshold=2.775e+03, percent-clipped=22.0 2023-06-26 02:04:07,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=12.0 2023-06-26 02:04:31,193 INFO [train.py:996] (3/4) Epoch 12, batch 17700, loss[loss=0.2446, simple_loss=0.3307, pruned_loss=0.07928, over 21472.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3051, pruned_loss=0.07383, over 4252896.02 frames. ], batch size: 194, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:04:32,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.29 vs. limit=22.5 2023-06-26 02:04:43,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2118846.0, ans=0.1 2023-06-26 02:04:47,348 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-26 02:05:18,232 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:05:33,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2118966.0, ans=0.125 2023-06-26 02:05:56,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2119086.0, ans=0.2 2023-06-26 02:05:58,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2119086.0, ans=0.125 2023-06-26 02:06:20,338 INFO [train.py:996] (3/4) Epoch 12, batch 17750, loss[loss=0.2872, simple_loss=0.3608, pruned_loss=0.1068, over 21718.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3118, pruned_loss=0.07711, over 4250231.12 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:07:18,701 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.321e+02 9.910e+02 1.512e+03 2.052e+03 5.083e+03, threshold=3.025e+03, percent-clipped=13.0 2023-06-26 02:07:35,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2119326.0, ans=0.125 2023-06-26 02:07:56,295 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=7.719e-03 2023-06-26 02:08:11,432 INFO [train.py:996] (3/4) Epoch 12, batch 17800, loss[loss=0.2437, simple_loss=0.3336, pruned_loss=0.07693, over 21627.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3149, pruned_loss=0.07833, over 4258422.15 frames. ], batch size: 389, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:10:07,311 INFO [train.py:996] (3/4) Epoch 12, batch 17850, loss[loss=0.2446, simple_loss=0.313, pruned_loss=0.08813, over 21329.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3158, pruned_loss=0.07847, over 4259679.51 frames. ], batch size: 159, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:10:13,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2119746.0, ans=0.1 2023-06-26 02:10:58,026 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.213e+02 1.072e+03 1.677e+03 2.667e+03 5.853e+03, threshold=3.353e+03, percent-clipped=21.0 2023-06-26 02:12:03,541 INFO [train.py:996] (3/4) Epoch 12, batch 17900, loss[loss=0.2442, simple_loss=0.3342, pruned_loss=0.07707, over 21633.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3203, pruned_loss=0.08002, over 4262706.95 frames. ], batch size: 263, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:13:59,304 INFO [train.py:996] (3/4) Epoch 12, batch 17950, loss[loss=0.173, simple_loss=0.2589, pruned_loss=0.04352, over 21796.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3178, pruned_loss=0.07644, over 4251348.85 frames. ], batch size: 118, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:14:06,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2120346.0, ans=0.125 2023-06-26 02:14:19,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2120406.0, ans=0.1 2023-06-26 02:14:44,277 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.194e+02 1.009e+03 1.518e+03 2.027e+03 4.283e+03, threshold=3.036e+03, percent-clipped=1.0 2023-06-26 02:14:45,249 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.29 vs. limit=22.5 2023-06-26 02:15:40,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2120586.0, ans=0.1 2023-06-26 02:15:49,544 INFO [train.py:996] (3/4) Epoch 12, batch 18000, loss[loss=0.1834, simple_loss=0.2547, pruned_loss=0.05608, over 21619.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3106, pruned_loss=0.07527, over 4252125.87 frames. ], batch size: 247, lr: 2.41e-03, grad_scale: 32.0 2023-06-26 02:15:49,545 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 02:16:07,682 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.258, simple_loss=0.3529, pruned_loss=0.08158, over 1796401.00 frames. 2023-06-26 02:16:07,683 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-26 02:17:06,714 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-26 02:17:17,196 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-26 02:17:35,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2120826.0, ans=0.07 2023-06-26 02:17:36,113 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2023-06-26 02:17:55,789 INFO [train.py:996] (3/4) Epoch 12, batch 18050, loss[loss=0.2096, simple_loss=0.2655, pruned_loss=0.07679, over 15035.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3047, pruned_loss=0.07416, over 4248167.84 frames. ], batch size: 61, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:17:56,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2120946.0, ans=0.0 2023-06-26 02:18:29,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2121006.0, ans=0.125 2023-06-26 02:18:35,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2121006.0, ans=0.1 2023-06-26 02:18:37,200 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-06-26 02:18:55,869 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.179e+02 8.208e+02 1.156e+03 1.748e+03 3.501e+03, threshold=2.312e+03, percent-clipped=3.0 2023-06-26 02:19:20,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2121126.0, ans=0.125 2023-06-26 02:19:20,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2121126.0, ans=0.0 2023-06-26 02:19:50,056 INFO [train.py:996] (3/4) Epoch 12, batch 18100, loss[loss=0.2632, simple_loss=0.3447, pruned_loss=0.0908, over 21217.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3099, pruned_loss=0.07727, over 4252155.06 frames. ], batch size: 143, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:20:12,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2121306.0, ans=0.125 2023-06-26 02:20:26,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2121306.0, ans=0.125 2023-06-26 02:21:36,963 INFO [train.py:996] (3/4) Epoch 12, batch 18150, loss[loss=0.23, simple_loss=0.2992, pruned_loss=0.08039, over 21818.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3117, pruned_loss=0.07693, over 4264037.70 frames. ], batch size: 317, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:21:55,849 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=22.5 2023-06-26 02:22:36,063 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.496e+02 9.455e+02 1.549e+03 2.118e+03 3.915e+03, threshold=3.098e+03, percent-clipped=17.0 2023-06-26 02:22:36,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2121666.0, ans=0.0 2023-06-26 02:23:01,286 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-26 02:23:07,937 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.63 vs. limit=12.0 2023-06-26 02:23:14,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2121786.0, ans=0.125 2023-06-26 02:23:24,008 INFO [train.py:996] (3/4) Epoch 12, batch 18200, loss[loss=0.2158, simple_loss=0.2801, pruned_loss=0.07573, over 21498.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3057, pruned_loss=0.07639, over 4247909.83 frames. ], batch size: 391, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:23:28,557 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-26 02:23:50,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2121906.0, ans=0.125 2023-06-26 02:24:06,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2121966.0, ans=0.125 2023-06-26 02:24:26,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2122026.0, ans=0.0 2023-06-26 02:24:44,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2122086.0, ans=0.125 2023-06-26 02:24:49,110 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=15.0 2023-06-26 02:25:01,783 INFO [train.py:996] (3/4) Epoch 12, batch 18250, loss[loss=0.2485, simple_loss=0.3091, pruned_loss=0.09392, over 21928.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2989, pruned_loss=0.07443, over 4252969.23 frames. ], batch size: 333, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:25:20,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2122146.0, ans=0.125 2023-06-26 02:25:25,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2122206.0, ans=0.125 2023-06-26 02:25:54,155 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.441e+02 8.526e+02 1.132e+03 1.532e+03 3.016e+03, threshold=2.265e+03, percent-clipped=0.0 2023-06-26 02:26:02,094 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-06-26 02:26:21,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2122326.0, ans=0.05 2023-06-26 02:26:26,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2122326.0, ans=0.1 2023-06-26 02:26:48,425 INFO [train.py:996] (3/4) Epoch 12, batch 18300, loss[loss=0.2318, simple_loss=0.3616, pruned_loss=0.05107, over 20847.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3004, pruned_loss=0.07493, over 4264822.69 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:27:25,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=2122506.0, ans=10.0 2023-06-26 02:28:14,059 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.26 vs. limit=22.5 2023-06-26 02:28:18,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2122686.0, ans=0.05 2023-06-26 02:28:34,401 INFO [train.py:996] (3/4) Epoch 12, batch 18350, loss[loss=0.2591, simple_loss=0.3334, pruned_loss=0.09239, over 21611.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3049, pruned_loss=0.07523, over 4261037.97 frames. ], batch size: 414, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:28:34,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2122746.0, ans=0.0 2023-06-26 02:29:21,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2122866.0, ans=0.0 2023-06-26 02:29:31,202 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.251e+02 1.279e+03 1.909e+03 2.961e+03 4.815e+03, threshold=3.819e+03, percent-clipped=39.0 2023-06-26 02:29:57,092 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.65 vs. limit=10.0 2023-06-26 02:30:21,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2122986.0, ans=0.125 2023-06-26 02:30:26,200 INFO [train.py:996] (3/4) Epoch 12, batch 18400, loss[loss=0.2476, simple_loss=0.3402, pruned_loss=0.0775, over 21304.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3001, pruned_loss=0.07328, over 4259658.67 frames. ], batch size: 551, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:30:45,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2123046.0, ans=0.125 2023-06-26 02:31:00,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2123106.0, ans=0.125 2023-06-26 02:31:02,898 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.80 vs. limit=10.0 2023-06-26 02:31:16,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2123166.0, ans=0.0 2023-06-26 02:31:58,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2123286.0, ans=0.2 2023-06-26 02:32:13,315 INFO [train.py:996] (3/4) Epoch 12, batch 18450, loss[loss=0.1882, simple_loss=0.2715, pruned_loss=0.05249, over 21287.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2973, pruned_loss=0.07019, over 4262205.41 frames. ], batch size: 551, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:32:26,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2123346.0, ans=0.125 2023-06-26 02:32:30,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-26 02:33:07,628 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.751e+02 8.419e+02 1.219e+03 1.812e+03 4.554e+03, threshold=2.437e+03, percent-clipped=1.0 2023-06-26 02:33:34,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2123526.0, ans=0.125 2023-06-26 02:33:58,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2123646.0, ans=0.2 2023-06-26 02:34:00,004 INFO [train.py:996] (3/4) Epoch 12, batch 18500, loss[loss=0.2014, simple_loss=0.2671, pruned_loss=0.06791, over 21298.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.292, pruned_loss=0.06893, over 4264702.10 frames. ], batch size: 144, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:34:38,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2123706.0, ans=0.1 2023-06-26 02:34:40,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2023-06-26 02:35:36,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2123886.0, ans=0.1 2023-06-26 02:35:50,191 INFO [train.py:996] (3/4) Epoch 12, batch 18550, loss[loss=0.2031, simple_loss=0.2695, pruned_loss=0.06829, over 21873.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2904, pruned_loss=0.06789, over 4247956.45 frames. ], batch size: 107, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:36:29,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2124006.0, ans=0.1 2023-06-26 02:36:57,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2124066.0, ans=0.125 2023-06-26 02:36:59,188 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.320e+02 1.196e+03 2.047e+03 2.769e+03 5.158e+03, threshold=4.094e+03, percent-clipped=37.0 2023-06-26 02:37:43,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2124186.0, ans=0.125 2023-06-26 02:37:47,727 INFO [train.py:996] (3/4) Epoch 12, batch 18600, loss[loss=0.2639, simple_loss=0.3476, pruned_loss=0.09012, over 21877.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2887, pruned_loss=0.06873, over 4239392.12 frames. ], batch size: 373, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:37:58,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2124246.0, ans=0.1 2023-06-26 02:38:09,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2124306.0, ans=0.1 2023-06-26 02:39:03,692 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.68 vs. limit=15.0 2023-06-26 02:39:24,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2124486.0, ans=0.125 2023-06-26 02:39:30,974 INFO [train.py:996] (3/4) Epoch 12, batch 18650, loss[loss=0.2209, simple_loss=0.2876, pruned_loss=0.07714, over 21526.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2885, pruned_loss=0.0689, over 4248005.35 frames. ], batch size: 195, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:39:42,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2124546.0, ans=0.0 2023-06-26 02:40:09,719 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:40:30,390 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.144e+02 8.452e+02 1.273e+03 1.829e+03 4.021e+03, threshold=2.546e+03, percent-clipped=0.0 2023-06-26 02:40:57,742 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.77 vs. limit=15.0 2023-06-26 02:41:10,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2124786.0, ans=0.125 2023-06-26 02:41:20,330 INFO [train.py:996] (3/4) Epoch 12, batch 18700, loss[loss=0.219, simple_loss=0.2912, pruned_loss=0.07338, over 21986.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2868, pruned_loss=0.07076, over 4253904.25 frames. ], batch size: 113, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:41:59,348 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.98 vs. limit=6.0 2023-06-26 02:42:02,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2124906.0, ans=0.125 2023-06-26 02:43:09,789 INFO [train.py:996] (3/4) Epoch 12, batch 18750, loss[loss=0.2231, simple_loss=0.2866, pruned_loss=0.07976, over 21313.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.289, pruned_loss=0.07326, over 4250752.84 frames. ], batch size: 144, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:43:17,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2125146.0, ans=0.125 2023-06-26 02:43:30,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2125206.0, ans=0.0 2023-06-26 02:44:08,696 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.434e+02 9.819e+02 1.428e+03 2.632e+03 5.661e+03, threshold=2.856e+03, percent-clipped=25.0 2023-06-26 02:44:27,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2125326.0, ans=0.125 2023-06-26 02:44:57,190 INFO [train.py:996] (3/4) Epoch 12, batch 18800, loss[loss=0.23, simple_loss=0.3201, pruned_loss=0.06993, over 21747.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2941, pruned_loss=0.07429, over 4250373.83 frames. ], batch size: 351, lr: 2.41e-03, grad_scale: 32.0 2023-06-26 02:45:18,597 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.95 vs. limit=15.0 2023-06-26 02:45:52,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2125566.0, ans=0.125 2023-06-26 02:45:57,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2125566.0, ans=0.125 2023-06-26 02:46:44,201 INFO [train.py:996] (3/4) Epoch 12, batch 18850, loss[loss=0.1854, simple_loss=0.2832, pruned_loss=0.04379, over 21803.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2928, pruned_loss=0.07006, over 4254462.99 frames. ], batch size: 282, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:47:11,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2125806.0, ans=0.125 2023-06-26 02:47:14,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2125806.0, ans=0.1 2023-06-26 02:47:45,902 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.700e+02 8.171e+02 1.119e+03 1.966e+03 5.674e+03, threshold=2.238e+03, percent-clipped=7.0 2023-06-26 02:48:10,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2125926.0, ans=0.0 2023-06-26 02:48:20,993 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=22.5 2023-06-26 02:48:23,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2125986.0, ans=0.0 2023-06-26 02:48:31,723 INFO [train.py:996] (3/4) Epoch 12, batch 18900, loss[loss=0.2385, simple_loss=0.2974, pruned_loss=0.08975, over 21819.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2883, pruned_loss=0.06945, over 4241995.49 frames. ], batch size: 351, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:48:32,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2126046.0, ans=0.125 2023-06-26 02:48:47,242 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.84 vs. limit=10.0 2023-06-26 02:49:39,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2126226.0, ans=0.1 2023-06-26 02:49:41,623 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-26 02:49:50,365 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-26 02:50:19,369 INFO [train.py:996] (3/4) Epoch 12, batch 18950, loss[loss=0.2996, simple_loss=0.3522, pruned_loss=0.1235, over 21725.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2901, pruned_loss=0.07219, over 4247890.68 frames. ], batch size: 508, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:51:08,044 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-26 02:51:21,179 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.969e+02 7.601e+02 1.025e+03 1.566e+03 4.291e+03, threshold=2.050e+03, percent-clipped=8.0 2023-06-26 02:51:48,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2126586.0, ans=0.125 2023-06-26 02:51:52,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2126586.0, ans=0.0 2023-06-26 02:51:59,683 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2023-06-26 02:52:07,259 INFO [train.py:996] (3/4) Epoch 12, batch 19000, loss[loss=0.2509, simple_loss=0.3262, pruned_loss=0.08775, over 21483.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2995, pruned_loss=0.07409, over 4255613.62 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:52:30,530 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-26 02:52:59,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2126766.0, ans=0.0 2023-06-26 02:53:17,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2126826.0, ans=0.0 2023-06-26 02:53:37,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2126886.0, ans=0.2 2023-06-26 02:53:55,081 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.84 vs. limit=6.0 2023-06-26 02:53:55,635 INFO [train.py:996] (3/4) Epoch 12, batch 19050, loss[loss=0.2341, simple_loss=0.2953, pruned_loss=0.08642, over 20042.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3033, pruned_loss=0.0769, over 4262299.35 frames. ], batch size: 703, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:54:16,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2127006.0, ans=0.2 2023-06-26 02:54:42,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=2127006.0, ans=0.05 2023-06-26 02:54:51,024 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=12.0 2023-06-26 02:54:53,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2127066.0, ans=0.0 2023-06-26 02:54:56,347 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-26 02:55:00,800 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.866e+02 8.291e+02 1.142e+03 1.622e+03 3.399e+03, threshold=2.283e+03, percent-clipped=17.0 2023-06-26 02:55:29,228 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.80 vs. limit=15.0 2023-06-26 02:55:46,471 INFO [train.py:996] (3/4) Epoch 12, batch 19100, loss[loss=0.2029, simple_loss=0.269, pruned_loss=0.06843, over 21251.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3015, pruned_loss=0.0775, over 4268249.20 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:56:09,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2127246.0, ans=0.125 2023-06-26 02:56:13,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2127306.0, ans=0.0 2023-06-26 02:56:38,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2127366.0, ans=0.125 2023-06-26 02:57:14,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2127426.0, ans=10.0 2023-06-26 02:57:49,107 INFO [train.py:996] (3/4) Epoch 12, batch 19150, loss[loss=0.2441, simple_loss=0.3412, pruned_loss=0.07351, over 21684.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3048, pruned_loss=0.07891, over 4267153.85 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:57:51,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2127546.0, ans=0.0 2023-06-26 02:58:16,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2127606.0, ans=0.1 2023-06-26 02:58:50,832 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.298e+02 9.303e+02 1.285e+03 2.071e+03 6.086e+03, threshold=2.570e+03, percent-clipped=20.0 2023-06-26 02:59:48,648 INFO [train.py:996] (3/4) Epoch 12, batch 19200, loss[loss=0.2963, simple_loss=0.4126, pruned_loss=0.08994, over 20765.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3161, pruned_loss=0.08001, over 4264426.13 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:59:56,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.21 vs. limit=6.0 2023-06-26 03:00:33,263 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=12.0 2023-06-26 03:00:39,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2127966.0, ans=0.125 2023-06-26 03:00:40,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2127966.0, ans=0.5 2023-06-26 03:01:09,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2128026.0, ans=0.125 2023-06-26 03:01:36,755 INFO [train.py:996] (3/4) Epoch 12, batch 19250, loss[loss=0.2297, simple_loss=0.297, pruned_loss=0.08122, over 21529.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3155, pruned_loss=0.07505, over 4267662.11 frames. ], batch size: 144, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 03:02:17,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2128266.0, ans=0.0 2023-06-26 03:02:28,882 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.722e+02 8.401e+02 1.172e+03 1.980e+03 3.719e+03, threshold=2.345e+03, percent-clipped=11.0 2023-06-26 03:02:56,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2128386.0, ans=0.2 2023-06-26 03:03:18,850 INFO [train.py:996] (3/4) Epoch 12, batch 19300, loss[loss=0.1648, simple_loss=0.2616, pruned_loss=0.03397, over 21581.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.312, pruned_loss=0.0743, over 4275773.25 frames. ], batch size: 230, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:03:22,612 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:03:48,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2128506.0, ans=0.125 2023-06-26 03:03:52,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2128506.0, ans=0.0 2023-06-26 03:04:00,973 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=22.5 2023-06-26 03:04:13,771 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=22.5 2023-06-26 03:04:34,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2128626.0, ans=10.0 2023-06-26 03:05:09,191 INFO [train.py:996] (3/4) Epoch 12, batch 19350, loss[loss=0.2527, simple_loss=0.34, pruned_loss=0.08267, over 21536.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3075, pruned_loss=0.0716, over 4276887.67 frames. ], batch size: 473, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:05:18,212 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:06:02,460 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.957e+02 9.080e+02 1.347e+03 2.318e+03 4.849e+03, threshold=2.694e+03, percent-clipped=24.0 2023-06-26 03:06:20,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2128926.0, ans=0.2 2023-06-26 03:06:22,400 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.65 vs. limit=10.0 2023-06-26 03:06:57,153 INFO [train.py:996] (3/4) Epoch 12, batch 19400, loss[loss=0.1734, simple_loss=0.2559, pruned_loss=0.04547, over 21620.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3044, pruned_loss=0.07075, over 4272787.10 frames. ], batch size: 230, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:06:58,113 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=22.5 2023-06-26 03:07:22,522 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-26 03:07:23,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2129106.0, ans=0.0 2023-06-26 03:07:31,072 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-26 03:08:02,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2129226.0, ans=0.1 2023-06-26 03:08:44,871 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-06-26 03:08:45,408 INFO [train.py:996] (3/4) Epoch 12, batch 19450, loss[loss=0.2165, simple_loss=0.2756, pruned_loss=0.07875, over 21269.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.301, pruned_loss=0.07177, over 4280614.64 frames. ], batch size: 176, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:08:55,066 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-26 03:09:38,480 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.528e+02 8.932e+02 1.244e+03 1.603e+03 3.427e+03, threshold=2.488e+03, percent-clipped=5.0 2023-06-26 03:10:08,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2129586.0, ans=0.125 2023-06-26 03:10:29,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2129586.0, ans=0.125 2023-06-26 03:10:32,556 INFO [train.py:996] (3/4) Epoch 12, batch 19500, loss[loss=0.326, simple_loss=0.3833, pruned_loss=0.1343, over 21453.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2984, pruned_loss=0.07357, over 4285814.28 frames. ], batch size: 507, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:10:32,894 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:10:33,488 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2023-06-26 03:11:00,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2129706.0, ans=0.04949747468305833 2023-06-26 03:11:29,470 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-26 03:12:15,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2129886.0, ans=0.125 2023-06-26 03:12:21,290 INFO [train.py:996] (3/4) Epoch 12, batch 19550, loss[loss=0.2117, simple_loss=0.3125, pruned_loss=0.05541, over 21850.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2939, pruned_loss=0.07149, over 4277423.38 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:12:54,535 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.45 vs. limit=5.0 2023-06-26 03:12:56,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2130006.0, ans=0.125 2023-06-26 03:13:15,137 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.254e+02 9.637e+02 1.286e+03 1.805e+03 3.756e+03, threshold=2.572e+03, percent-clipped=14.0 2023-06-26 03:13:30,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2130126.0, ans=0.125 2023-06-26 03:13:51,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.97 vs. limit=10.0 2023-06-26 03:14:03,889 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-06-26 03:14:05,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2130186.0, ans=0.0 2023-06-26 03:14:09,950 INFO [train.py:996] (3/4) Epoch 12, batch 19600, loss[loss=0.2165, simple_loss=0.2951, pruned_loss=0.06894, over 21095.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2965, pruned_loss=0.07285, over 4280673.49 frames. ], batch size: 608, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:14:57,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2130366.0, ans=0.015 2023-06-26 03:15:21,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2130426.0, ans=0.125 2023-06-26 03:15:52,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2130486.0, ans=0.125 2023-06-26 03:16:00,201 INFO [train.py:996] (3/4) Epoch 12, batch 19650, loss[loss=0.2213, simple_loss=0.2967, pruned_loss=0.0729, over 21797.00 frames. ], tot_loss[loss=0.226, simple_loss=0.301, pruned_loss=0.07552, over 4274956.06 frames. ], batch size: 332, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:16:42,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2130666.0, ans=0.0 2023-06-26 03:17:06,804 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.323e+02 8.252e+02 1.350e+03 1.732e+03 4.354e+03, threshold=2.700e+03, percent-clipped=5.0 2023-06-26 03:17:09,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2130726.0, ans=0.0 2023-06-26 03:17:52,267 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:18:00,741 INFO [train.py:996] (3/4) Epoch 12, batch 19700, loss[loss=0.216, simple_loss=0.3153, pruned_loss=0.05836, over 21765.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3048, pruned_loss=0.07773, over 4277015.32 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:18:15,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2130846.0, ans=0.125 2023-06-26 03:18:38,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2130906.0, ans=0.1 2023-06-26 03:19:20,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2131026.0, ans=0.0 2023-06-26 03:19:46,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2131086.0, ans=0.125 2023-06-26 03:19:51,686 INFO [train.py:996] (3/4) Epoch 12, batch 19750, loss[loss=0.2365, simple_loss=0.3427, pruned_loss=0.06519, over 21655.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3112, pruned_loss=0.07766, over 4269027.29 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:19:57,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2131146.0, ans=0.125 2023-06-26 03:20:32,315 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.65 vs. limit=8.0 2023-06-26 03:20:59,635 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.834e+02 9.206e+02 1.478e+03 2.437e+03 4.883e+03, threshold=2.956e+03, percent-clipped=21.0 2023-06-26 03:21:07,653 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=22.5 2023-06-26 03:21:12,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2131326.0, ans=0.1 2023-06-26 03:21:15,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2131326.0, ans=0.125 2023-06-26 03:21:41,924 INFO [train.py:996] (3/4) Epoch 12, batch 19800, loss[loss=0.2137, simple_loss=0.2803, pruned_loss=0.07358, over 21443.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3112, pruned_loss=0.07891, over 4282511.62 frames. ], batch size: 211, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:21:51,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2131446.0, ans=0.125 2023-06-26 03:22:37,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2131566.0, ans=0.125 2023-06-26 03:23:20,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2131686.0, ans=0.2 2023-06-26 03:23:33,418 INFO [train.py:996] (3/4) Epoch 12, batch 19850, loss[loss=0.155, simple_loss=0.2168, pruned_loss=0.0466, over 15930.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3048, pruned_loss=0.07403, over 4276536.15 frames. ], batch size: 60, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:23:41,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2131746.0, ans=0.0 2023-06-26 03:23:45,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2131746.0, ans=0.05 2023-06-26 03:24:41,347 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.737e+02 9.499e+02 1.416e+03 2.010e+03 4.711e+03, threshold=2.833e+03, percent-clipped=4.0 2023-06-26 03:25:04,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2131986.0, ans=0.0 2023-06-26 03:25:29,020 INFO [train.py:996] (3/4) Epoch 12, batch 19900, loss[loss=0.2209, simple_loss=0.2884, pruned_loss=0.0767, over 21355.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3064, pruned_loss=0.07257, over 4278439.19 frames. ], batch size: 144, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:26:02,575 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-26 03:26:23,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2132166.0, ans=0.025 2023-06-26 03:26:51,379 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.50 vs. limit=5.0 2023-06-26 03:27:16,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2132286.0, ans=0.1 2023-06-26 03:27:24,332 INFO [train.py:996] (3/4) Epoch 12, batch 19950, loss[loss=0.2165, simple_loss=0.2938, pruned_loss=0.06964, over 21629.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2998, pruned_loss=0.07201, over 4280963.16 frames. ], batch size: 230, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:27:41,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2132346.0, ans=0.2 2023-06-26 03:28:15,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2132466.0, ans=0.125 2023-06-26 03:28:26,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2132466.0, ans=0.125 2023-06-26 03:28:29,298 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.346e+02 8.702e+02 1.204e+03 1.763e+03 4.092e+03, threshold=2.408e+03, percent-clipped=5.0 2023-06-26 03:28:42,707 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=12.0 2023-06-26 03:29:08,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2132586.0, ans=0.05 2023-06-26 03:29:17,263 INFO [train.py:996] (3/4) Epoch 12, batch 20000, loss[loss=0.2309, simple_loss=0.3101, pruned_loss=0.07588, over 21523.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3, pruned_loss=0.07228, over 4275654.19 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:29:45,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2132706.0, ans=0.0 2023-06-26 03:31:06,156 INFO [train.py:996] (3/4) Epoch 12, batch 20050, loss[loss=0.2425, simple_loss=0.3143, pruned_loss=0.08532, over 21767.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3029, pruned_loss=0.07458, over 4280943.48 frames. ], batch size: 441, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:31:08,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2132946.0, ans=0.125 2023-06-26 03:31:36,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2133006.0, ans=0.125 2023-06-26 03:31:58,033 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.24 vs. limit=15.0 2023-06-26 03:32:07,816 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.733e+02 9.408e+02 1.346e+03 1.718e+03 3.911e+03, threshold=2.692e+03, percent-clipped=11.0 2023-06-26 03:32:23,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2133126.0, ans=0.125 2023-06-26 03:32:55,544 INFO [train.py:996] (3/4) Epoch 12, batch 20100, loss[loss=0.2564, simple_loss=0.3436, pruned_loss=0.08456, over 21817.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3064, pruned_loss=0.0774, over 4284694.39 frames. ], batch size: 282, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:33:03,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2133246.0, ans=0.025 2023-06-26 03:33:04,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2133246.0, ans=0.05 2023-06-26 03:33:06,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2133246.0, ans=0.125 2023-06-26 03:34:27,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2133486.0, ans=0.125 2023-06-26 03:34:43,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2133486.0, ans=0.125 2023-06-26 03:34:46,174 INFO [train.py:996] (3/4) Epoch 12, batch 20150, loss[loss=0.2588, simple_loss=0.3349, pruned_loss=0.09134, over 21736.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3153, pruned_loss=0.08069, over 4284978.09 frames. ], batch size: 332, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:34:48,927 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=15.0 2023-06-26 03:34:57,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2133546.0, ans=0.125 2023-06-26 03:35:32,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2133606.0, ans=0.05 2023-06-26 03:36:04,682 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.317e+02 8.895e+02 1.198e+03 1.726e+03 5.010e+03, threshold=2.397e+03, percent-clipped=8.0 2023-06-26 03:36:53,586 INFO [train.py:996] (3/4) Epoch 12, batch 20200, loss[loss=0.2391, simple_loss=0.3389, pruned_loss=0.06962, over 21805.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3216, pruned_loss=0.08226, over 4277077.38 frames. ], batch size: 316, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:36:58,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2133846.0, ans=0.1 2023-06-26 03:37:03,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2133846.0, ans=0.125 2023-06-26 03:38:46,321 INFO [train.py:996] (3/4) Epoch 12, batch 20250, loss[loss=0.2205, simple_loss=0.3052, pruned_loss=0.06787, over 21767.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3227, pruned_loss=0.08143, over 4279781.05 frames. ], batch size: 298, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:38:48,815 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.602e-03 2023-06-26 03:38:50,960 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=15.0 2023-06-26 03:38:53,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2134146.0, ans=0.125 2023-06-26 03:39:32,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2134266.0, ans=0.1 2023-06-26 03:39:46,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2134266.0, ans=0.0 2023-06-26 03:39:49,445 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.636e+02 8.204e+02 1.224e+03 2.058e+03 5.091e+03, threshold=2.449e+03, percent-clipped=18.0 2023-06-26 03:40:07,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2134326.0, ans=0.125 2023-06-26 03:40:38,140 INFO [train.py:996] (3/4) Epoch 12, batch 20300, loss[loss=0.2211, simple_loss=0.3028, pruned_loss=0.06972, over 21555.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3209, pruned_loss=0.07822, over 4275965.57 frames. ], batch size: 212, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:40:50,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2134446.0, ans=0.125 2023-06-26 03:41:12,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2134506.0, ans=0.2 2023-06-26 03:41:31,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2134566.0, ans=0.0 2023-06-26 03:41:39,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2134626.0, ans=0.125 2023-06-26 03:41:43,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2134626.0, ans=0.125 2023-06-26 03:41:43,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2134626.0, ans=0.125 2023-06-26 03:42:12,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2134686.0, ans=0.0 2023-06-26 03:42:26,168 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-06-26 03:42:28,304 INFO [train.py:996] (3/4) Epoch 12, batch 20350, loss[loss=0.2323, simple_loss=0.3124, pruned_loss=0.07612, over 21939.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3184, pruned_loss=0.07777, over 4267450.67 frames. ], batch size: 372, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:42:42,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2134746.0, ans=0.0 2023-06-26 03:43:02,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2134806.0, ans=0.125 2023-06-26 03:43:31,512 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.507e+02 1.038e+03 1.419e+03 2.103e+03 3.160e+03, threshold=2.839e+03, percent-clipped=11.0 2023-06-26 03:43:43,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2134926.0, ans=0.1 2023-06-26 03:44:10,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2134986.0, ans=0.1 2023-06-26 03:44:17,346 INFO [train.py:996] (3/4) Epoch 12, batch 20400, loss[loss=0.2259, simple_loss=0.3075, pruned_loss=0.07211, over 21715.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3215, pruned_loss=0.08103, over 4269434.55 frames. ], batch size: 247, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:45:31,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2135226.0, ans=0.1 2023-06-26 03:46:02,260 INFO [train.py:996] (3/4) Epoch 12, batch 20450, loss[loss=0.2587, simple_loss=0.3077, pruned_loss=0.1048, over 21954.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3216, pruned_loss=0.08309, over 4267210.31 frames. ], batch size: 113, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:46:25,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2135406.0, ans=0.025 2023-06-26 03:47:05,908 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.613e+02 8.799e+02 1.280e+03 2.012e+03 4.043e+03, threshold=2.560e+03, percent-clipped=9.0 2023-06-26 03:47:08,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2135526.0, ans=0.0 2023-06-26 03:47:45,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-26 03:47:48,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2135586.0, ans=0.1 2023-06-26 03:47:52,235 INFO [train.py:996] (3/4) Epoch 12, batch 20500, loss[loss=0.2172, simple_loss=0.2766, pruned_loss=0.07885, over 21732.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3172, pruned_loss=0.08311, over 4265943.83 frames. ], batch size: 247, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:48:07,185 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-26 03:48:46,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2135766.0, ans=0.0 2023-06-26 03:49:31,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2135886.0, ans=0.125 2023-06-26 03:49:33,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2135886.0, ans=0.125 2023-06-26 03:49:33,828 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-26 03:49:35,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2135886.0, ans=0.125 2023-06-26 03:49:41,350 INFO [train.py:996] (3/4) Epoch 12, batch 20550, loss[loss=0.2914, simple_loss=0.3757, pruned_loss=0.1036, over 21514.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3106, pruned_loss=0.08151, over 4264187.67 frames. ], batch size: 509, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:50:08,588 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:50:18,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=2136006.0, ans=15.0 2023-06-26 03:50:24,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2136006.0, ans=0.125 2023-06-26 03:50:31,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2136066.0, ans=0.125 2023-06-26 03:50:38,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2136066.0, ans=0.125 2023-06-26 03:50:48,916 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.549e+02 9.305e+02 1.211e+03 1.898e+03 4.893e+03, threshold=2.421e+03, percent-clipped=7.0 2023-06-26 03:51:00,181 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:51:32,354 INFO [train.py:996] (3/4) Epoch 12, batch 20600, loss[loss=0.2437, simple_loss=0.3049, pruned_loss=0.09122, over 21467.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3132, pruned_loss=0.08057, over 4264392.79 frames. ], batch size: 211, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:51:32,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2136246.0, ans=0.0 2023-06-26 03:53:06,921 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-06-26 03:53:19,363 INFO [train.py:996] (3/4) Epoch 12, batch 20650, loss[loss=0.2153, simple_loss=0.2774, pruned_loss=0.07661, over 21609.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3083, pruned_loss=0.08033, over 4257184.83 frames. ], batch size: 441, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:54:07,243 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-06-26 03:54:17,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2136666.0, ans=0.2 2023-06-26 03:54:23,259 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.638e+02 7.911e+02 1.093e+03 1.392e+03 2.795e+03, threshold=2.187e+03, percent-clipped=3.0 2023-06-26 03:55:00,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2136786.0, ans=0.2 2023-06-26 03:55:06,903 INFO [train.py:996] (3/4) Epoch 12, batch 20700, loss[loss=0.3162, simple_loss=0.3886, pruned_loss=0.1219, over 21523.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3023, pruned_loss=0.07763, over 4253291.27 frames. ], batch size: 508, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:56:34,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2137026.0, ans=0.0 2023-06-26 03:56:57,169 INFO [train.py:996] (3/4) Epoch 12, batch 20750, loss[loss=0.3005, simple_loss=0.3963, pruned_loss=0.1023, over 21658.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3054, pruned_loss=0.07682, over 4254387.64 frames. ], batch size: 414, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:57:15,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2137146.0, ans=0.125 2023-06-26 03:57:29,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2137206.0, ans=0.2 2023-06-26 03:57:36,243 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-26 03:57:53,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2137266.0, ans=0.125 2023-06-26 03:58:14,459 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.764e+02 1.037e+03 1.665e+03 2.328e+03 7.151e+03, threshold=3.329e+03, percent-clipped=27.0 2023-06-26 03:58:51,604 INFO [train.py:996] (3/4) Epoch 12, batch 20800, loss[loss=0.2086, simple_loss=0.2681, pruned_loss=0.07452, over 21443.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3071, pruned_loss=0.0773, over 4265266.59 frames. ], batch size: 195, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:59:12,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2137506.0, ans=0.0 2023-06-26 03:59:24,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2137506.0, ans=0.025 2023-06-26 03:59:41,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2137566.0, ans=0.125 2023-06-26 03:59:45,222 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:00:39,497 INFO [train.py:996] (3/4) Epoch 12, batch 20850, loss[loss=0.159, simple_loss=0.2333, pruned_loss=0.04232, over 21461.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2992, pruned_loss=0.07488, over 4261865.45 frames. ], batch size: 212, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:01:02,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2137806.0, ans=0.0 2023-06-26 04:01:48,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2137926.0, ans=0.0 2023-06-26 04:01:49,777 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.795e+02 7.660e+02 1.067e+03 1.526e+03 3.659e+03, threshold=2.133e+03, percent-clipped=1.0 2023-06-26 04:01:57,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2137926.0, ans=0.125 2023-06-26 04:02:27,148 INFO [train.py:996] (3/4) Epoch 12, batch 20900, loss[loss=0.2012, simple_loss=0.2802, pruned_loss=0.06114, over 21512.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3005, pruned_loss=0.07583, over 4269580.71 frames. ], batch size: 212, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:03:33,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2138226.0, ans=0.125 2023-06-26 04:04:00,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2138286.0, ans=0.125 2023-06-26 04:04:06,621 INFO [train.py:996] (3/4) Epoch 12, batch 20950, loss[loss=0.1881, simple_loss=0.2587, pruned_loss=0.05879, over 21364.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2967, pruned_loss=0.07199, over 4264270.67 frames. ], batch size: 194, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:04:28,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2138346.0, ans=0.05 2023-06-26 04:04:56,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2138466.0, ans=0.07 2023-06-26 04:05:17,590 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=12.0 2023-06-26 04:05:18,153 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.308e+02 8.469e+02 1.535e+03 2.193e+03 7.053e+03, threshold=3.069e+03, percent-clipped=28.0 2023-06-26 04:05:53,916 INFO [train.py:996] (3/4) Epoch 12, batch 21000, loss[loss=0.242, simple_loss=0.3197, pruned_loss=0.0821, over 21882.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2968, pruned_loss=0.07301, over 4267037.76 frames. ], batch size: 107, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 04:05:53,916 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 04:06:14,413 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.7307, 2.8955, 2.1678, 3.1575, 1.7273, 3.0721, 2.7053, 2.4707], device='cuda:3') 2023-06-26 04:06:16,506 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.2617, simple_loss=0.359, pruned_loss=0.08218, over 1796401.00 frames. 2023-06-26 04:06:16,507 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-26 04:06:22,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2138646.0, ans=0.0 2023-06-26 04:06:31,728 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-26 04:07:18,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2138826.0, ans=0.0 2023-06-26 04:07:36,264 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.61 vs. limit=5.0 2023-06-26 04:07:37,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2138886.0, ans=0.0 2023-06-26 04:07:52,132 INFO [train.py:996] (3/4) Epoch 12, batch 21050, loss[loss=0.2019, simple_loss=0.2712, pruned_loss=0.06632, over 21814.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2951, pruned_loss=0.07306, over 4270537.77 frames. ], batch size: 118, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 04:08:32,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2139066.0, ans=0.0 2023-06-26 04:08:56,401 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.279e+02 7.328e+02 1.060e+03 1.380e+03 3.297e+03, threshold=2.119e+03, percent-clipped=1.0 2023-06-26 04:09:37,024 INFO [train.py:996] (3/4) Epoch 12, batch 21100, loss[loss=0.2185, simple_loss=0.2802, pruned_loss=0.07835, over 21380.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2916, pruned_loss=0.07304, over 4260154.31 frames. ], batch size: 160, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 04:10:16,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2139366.0, ans=0.1 2023-06-26 04:10:30,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2139366.0, ans=0.125 2023-06-26 04:10:42,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2139426.0, ans=0.125 2023-06-26 04:11:22,944 INFO [train.py:996] (3/4) Epoch 12, batch 21150, loss[loss=0.2469, simple_loss=0.2881, pruned_loss=0.1029, over 21528.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2872, pruned_loss=0.0737, over 4260106.44 frames. ], batch size: 512, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 04:11:24,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2139546.0, ans=0.0 2023-06-26 04:12:12,842 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-06-26 04:12:14,475 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.79 vs. limit=15.0 2023-06-26 04:12:23,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2139726.0, ans=0.125 2023-06-26 04:12:28,502 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.534e+02 8.733e+02 1.101e+03 1.484e+03 2.918e+03, threshold=2.203e+03, percent-clipped=8.0 2023-06-26 04:12:28,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2139726.0, ans=0.04949747468305833 2023-06-26 04:12:44,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2139726.0, ans=0.2 2023-06-26 04:13:08,685 INFO [train.py:996] (3/4) Epoch 12, batch 21200, loss[loss=0.2243, simple_loss=0.2905, pruned_loss=0.07903, over 21576.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2841, pruned_loss=0.07235, over 4255956.23 frames. ], batch size: 414, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:13:14,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2139846.0, ans=0.05 2023-06-26 04:13:42,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2139906.0, ans=0.125 2023-06-26 04:13:46,933 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-06-26 04:14:29,855 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=12.0 2023-06-26 04:14:51,517 INFO [train.py:996] (3/4) Epoch 12, batch 21250, loss[loss=0.2076, simple_loss=0.2866, pruned_loss=0.06428, over 21400.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2819, pruned_loss=0.07248, over 4252809.29 frames. ], batch size: 194, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:15:35,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2140266.0, ans=0.125 2023-06-26 04:15:37,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2140266.0, ans=0.2 2023-06-26 04:16:07,972 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.686e+02 9.766e+02 1.414e+03 1.888e+03 3.901e+03, threshold=2.827e+03, percent-clipped=19.0 2023-06-26 04:16:38,866 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=22.5 2023-06-26 04:16:39,721 INFO [train.py:996] (3/4) Epoch 12, batch 21300, loss[loss=0.2162, simple_loss=0.2959, pruned_loss=0.06825, over 21482.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2885, pruned_loss=0.0747, over 4259679.68 frames. ], batch size: 212, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:17:05,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2140446.0, ans=0.0 2023-06-26 04:17:06,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2140446.0, ans=0.04949747468305833 2023-06-26 04:17:11,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2140506.0, ans=0.125 2023-06-26 04:18:37,756 INFO [train.py:996] (3/4) Epoch 12, batch 21350, loss[loss=0.1834, simple_loss=0.274, pruned_loss=0.04642, over 21412.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2937, pruned_loss=0.07598, over 4258364.70 frames. ], batch size: 211, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:19:08,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2140806.0, ans=0.125 2023-06-26 04:19:27,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2140866.0, ans=0.125 2023-06-26 04:19:52,121 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.340e+02 8.083e+02 1.206e+03 1.998e+03 5.884e+03, threshold=2.412e+03, percent-clipped=11.0 2023-06-26 04:20:05,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2140986.0, ans=0.125 2023-06-26 04:20:34,814 INFO [train.py:996] (3/4) Epoch 12, batch 21400, loss[loss=0.2715, simple_loss=0.35, pruned_loss=0.09655, over 21813.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2975, pruned_loss=0.07528, over 4262860.21 frames. ], batch size: 118, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:20:48,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2141046.0, ans=0.125 2023-06-26 04:20:53,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2141106.0, ans=0.2 2023-06-26 04:21:12,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2141106.0, ans=0.0 2023-06-26 04:21:34,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2141166.0, ans=0.1 2023-06-26 04:21:38,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2141226.0, ans=0.125 2023-06-26 04:22:16,623 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:22:23,053 INFO [train.py:996] (3/4) Epoch 12, batch 21450, loss[loss=0.2033, simple_loss=0.2759, pruned_loss=0.06536, over 21676.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3004, pruned_loss=0.07652, over 4272128.61 frames. ], batch size: 230, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:22:35,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2141346.0, ans=0.1 2023-06-26 04:23:05,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2141466.0, ans=0.0 2023-06-26 04:23:28,951 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.921e+02 8.614e+02 1.088e+03 1.616e+03 2.799e+03, threshold=2.175e+03, percent-clipped=3.0 2023-06-26 04:24:11,491 INFO [train.py:996] (3/4) Epoch 12, batch 21500, loss[loss=0.2089, simple_loss=0.2693, pruned_loss=0.07428, over 21611.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.299, pruned_loss=0.0782, over 4269587.10 frames. ], batch size: 231, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:24:16,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2141646.0, ans=0.0 2023-06-26 04:25:14,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2141826.0, ans=0.05 2023-06-26 04:25:47,040 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-26 04:25:55,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2141886.0, ans=0.0 2023-06-26 04:25:58,251 INFO [train.py:996] (3/4) Epoch 12, batch 21550, loss[loss=0.218, simple_loss=0.2822, pruned_loss=0.0769, over 21484.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2911, pruned_loss=0.07523, over 4276872.30 frames. ], batch size: 441, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:26:15,592 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=22.5 2023-06-26 04:26:27,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2142006.0, ans=0.0 2023-06-26 04:27:09,917 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.662e+02 8.216e+02 1.112e+03 1.423e+03 3.148e+03, threshold=2.223e+03, percent-clipped=7.0 2023-06-26 04:27:50,952 INFO [train.py:996] (3/4) Epoch 12, batch 21600, loss[loss=0.1954, simple_loss=0.2695, pruned_loss=0.06063, over 21582.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2863, pruned_loss=0.0734, over 4272464.24 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:28:10,234 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-26 04:28:52,356 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-26 04:29:08,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2142426.0, ans=0.07 2023-06-26 04:29:21,850 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-06-26 04:29:43,320 INFO [train.py:996] (3/4) Epoch 12, batch 21650, loss[loss=0.2082, simple_loss=0.278, pruned_loss=0.06924, over 21815.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2913, pruned_loss=0.07098, over 4270113.26 frames. ], batch size: 98, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:29:52,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2142546.0, ans=0.1 2023-06-26 04:29:58,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2142546.0, ans=0.0 2023-06-26 04:30:34,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2142666.0, ans=0.125 2023-06-26 04:30:56,277 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.299e+02 7.832e+02 1.347e+03 2.277e+03 5.515e+03, threshold=2.694e+03, percent-clipped=27.0 2023-06-26 04:31:19,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2142786.0, ans=0.125 2023-06-26 04:31:30,506 INFO [train.py:996] (3/4) Epoch 12, batch 21700, loss[loss=0.1898, simple_loss=0.2815, pruned_loss=0.04901, over 21669.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2914, pruned_loss=0.06943, over 4265319.14 frames. ], batch size: 230, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:32:08,073 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:32:21,328 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:33:20,549 INFO [train.py:996] (3/4) Epoch 12, batch 21750, loss[loss=0.2193, simple_loss=0.2825, pruned_loss=0.07807, over 21306.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.288, pruned_loss=0.0706, over 4273762.12 frames. ], batch size: 144, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:33:22,988 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-26 04:33:24,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2143146.0, ans=0.2 2023-06-26 04:33:26,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2143146.0, ans=0.1 2023-06-26 04:34:12,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2143266.0, ans=0.1 2023-06-26 04:34:26,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2143326.0, ans=0.1 2023-06-26 04:34:26,882 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.96 vs. limit=22.5 2023-06-26 04:34:30,621 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.472e+02 7.650e+02 1.064e+03 1.573e+03 4.038e+03, threshold=2.129e+03, percent-clipped=2.0 2023-06-26 04:34:44,520 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-26 04:35:12,906 INFO [train.py:996] (3/4) Epoch 12, batch 21800, loss[loss=0.2842, simple_loss=0.351, pruned_loss=0.1087, over 21668.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2867, pruned_loss=0.07147, over 4272451.79 frames. ], batch size: 415, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:35:27,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2143446.0, ans=0.2 2023-06-26 04:35:34,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2143506.0, ans=0.0 2023-06-26 04:37:09,665 INFO [train.py:996] (3/4) Epoch 12, batch 21850, loss[loss=0.2336, simple_loss=0.3036, pruned_loss=0.08175, over 21551.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2955, pruned_loss=0.07189, over 4275802.69 frames. ], batch size: 212, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:37:10,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2143746.0, ans=0.035 2023-06-26 04:37:20,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2143746.0, ans=0.125 2023-06-26 04:37:26,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-26 04:37:54,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=2143866.0, ans=15.0 2023-06-26 04:37:55,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2143866.0, ans=0.0 2023-06-26 04:38:17,398 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.530e+02 9.705e+02 1.349e+03 2.023e+03 4.101e+03, threshold=2.697e+03, percent-clipped=20.0 2023-06-26 04:38:26,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2143926.0, ans=0.125 2023-06-26 04:38:30,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2143926.0, ans=0.125 2023-06-26 04:38:34,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2143986.0, ans=0.125 2023-06-26 04:38:59,614 INFO [train.py:996] (3/4) Epoch 12, batch 21900, loss[loss=0.2062, simple_loss=0.272, pruned_loss=0.07019, over 21750.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2951, pruned_loss=0.07339, over 4272331.92 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:39:12,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2144046.0, ans=0.0 2023-06-26 04:39:39,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2144166.0, ans=0.2 2023-06-26 04:40:03,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2144226.0, ans=0.1 2023-06-26 04:40:41,096 INFO [train.py:996] (3/4) Epoch 12, batch 21950, loss[loss=0.1679, simple_loss=0.2479, pruned_loss=0.04397, over 21521.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2885, pruned_loss=0.07203, over 4266395.18 frames. ], batch size: 212, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:40:43,770 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.68 vs. limit=6.0 2023-06-26 04:40:56,097 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=12.0 2023-06-26 04:41:05,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2144406.0, ans=0.07 2023-06-26 04:41:11,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2144406.0, ans=0.0 2023-06-26 04:41:57,269 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.471e+02 7.504e+02 1.022e+03 1.599e+03 3.109e+03, threshold=2.043e+03, percent-clipped=2.0 2023-06-26 04:42:03,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2144526.0, ans=0.0 2023-06-26 04:42:15,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2144586.0, ans=0.1 2023-06-26 04:42:18,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2144586.0, ans=0.0 2023-06-26 04:42:27,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2144586.0, ans=0.2 2023-06-26 04:42:32,990 INFO [train.py:996] (3/4) Epoch 12, batch 22000, loss[loss=0.2029, simple_loss=0.2663, pruned_loss=0.06977, over 21265.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.282, pruned_loss=0.06811, over 4254791.53 frames. ], batch size: 551, lr: 2.40e-03, grad_scale: 32.0 2023-06-26 04:43:28,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2144766.0, ans=0.125 2023-06-26 04:43:30,623 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-06-26 04:44:30,592 INFO [train.py:996] (3/4) Epoch 12, batch 22050, loss[loss=0.2446, simple_loss=0.3282, pruned_loss=0.08046, over 21787.00 frames. ], tot_loss[loss=0.215, simple_loss=0.29, pruned_loss=0.07, over 4260267.52 frames. ], batch size: 282, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:44:43,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2144946.0, ans=0.2 2023-06-26 04:45:02,009 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-26 04:45:47,959 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.424e+02 9.779e+02 1.604e+03 2.153e+03 5.995e+03, threshold=3.207e+03, percent-clipped=28.0 2023-06-26 04:46:00,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2145186.0, ans=0.125 2023-06-26 04:46:15,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2145186.0, ans=0.0 2023-06-26 04:46:21,812 INFO [train.py:996] (3/4) Epoch 12, batch 22100, loss[loss=0.2407, simple_loss=0.3196, pruned_loss=0.08092, over 21871.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2981, pruned_loss=0.07394, over 4256845.24 frames. ], batch size: 124, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:46:34,797 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=15.0 2023-06-26 04:47:04,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2145306.0, ans=0.2 2023-06-26 04:47:17,739 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=15.0 2023-06-26 04:47:36,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2145426.0, ans=0.125 2023-06-26 04:47:51,353 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.29 vs. limit=12.0 2023-06-26 04:48:11,831 INFO [train.py:996] (3/4) Epoch 12, batch 22150, loss[loss=0.2271, simple_loss=0.3014, pruned_loss=0.07636, over 21434.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3003, pruned_loss=0.07629, over 4263947.80 frames. ], batch size: 177, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:48:13,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2145546.0, ans=0.0 2023-06-26 04:48:52,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2145606.0, ans=0.125 2023-06-26 04:48:55,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2145666.0, ans=0.0 2023-06-26 04:49:04,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2145666.0, ans=0.125 2023-06-26 04:49:15,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2145726.0, ans=0.125 2023-06-26 04:49:27,196 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.601e+02 7.424e+02 1.002e+03 1.530e+03 2.924e+03, threshold=2.004e+03, percent-clipped=0.0 2023-06-26 04:49:40,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2145786.0, ans=0.125 2023-06-26 04:50:01,404 INFO [train.py:996] (3/4) Epoch 12, batch 22200, loss[loss=0.2511, simple_loss=0.3384, pruned_loss=0.08186, over 21797.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.303, pruned_loss=0.07725, over 4265670.07 frames. ], batch size: 282, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:50:03,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2145846.0, ans=0.125 2023-06-26 04:51:26,630 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-26 04:51:40,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2146086.0, ans=0.125 2023-06-26 04:51:52,873 INFO [train.py:996] (3/4) Epoch 12, batch 22250, loss[loss=0.2246, simple_loss=0.3157, pruned_loss=0.0667, over 17134.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3116, pruned_loss=0.07962, over 4271065.04 frames. ], batch size: 60, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 04:52:23,880 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.87 vs. limit=15.0 2023-06-26 04:52:31,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2146206.0, ans=0.05 2023-06-26 04:53:08,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2146326.0, ans=0.0 2023-06-26 04:53:10,010 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.929e+02 1.009e+03 1.467e+03 2.182e+03 5.502e+03, threshold=2.934e+03, percent-clipped=31.0 2023-06-26 04:53:10,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=2146326.0, ans=15.0 2023-06-26 04:53:15,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2146326.0, ans=0.125 2023-06-26 04:53:42,554 INFO [train.py:996] (3/4) Epoch 12, batch 22300, loss[loss=0.2283, simple_loss=0.2999, pruned_loss=0.07832, over 21931.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3131, pruned_loss=0.08169, over 4268214.21 frames. ], batch size: 351, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 04:53:44,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2146446.0, ans=0.125 2023-06-26 04:53:49,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2146446.0, ans=0.2 2023-06-26 04:54:14,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2146506.0, ans=0.1 2023-06-26 04:54:47,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2146566.0, ans=0.125 2023-06-26 04:55:23,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2146686.0, ans=0.0 2023-06-26 04:55:32,902 INFO [train.py:996] (3/4) Epoch 12, batch 22350, loss[loss=0.2232, simple_loss=0.2883, pruned_loss=0.0791, over 21589.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3101, pruned_loss=0.0819, over 4279163.33 frames. ], batch size: 212, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 04:55:35,521 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=15.0 2023-06-26 04:55:41,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2146746.0, ans=0.125 2023-06-26 04:56:33,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2146866.0, ans=0.0 2023-06-26 04:56:42,938 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-26 04:56:57,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.431e+02 9.155e+02 1.132e+03 1.676e+03 3.302e+03, threshold=2.265e+03, percent-clipped=3.0 2023-06-26 04:57:23,236 INFO [train.py:996] (3/4) Epoch 12, batch 22400, loss[loss=0.1914, simple_loss=0.2742, pruned_loss=0.05429, over 21642.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.307, pruned_loss=0.07809, over 4282934.70 frames. ], batch size: 298, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 04:57:56,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2147106.0, ans=0.0 2023-06-26 04:58:44,745 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=15.0 2023-06-26 04:58:51,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=2147226.0, ans=0.2 2023-06-26 04:58:59,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2147286.0, ans=0.125 2023-06-26 04:59:01,897 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=22.5 2023-06-26 04:59:13,354 INFO [train.py:996] (3/4) Epoch 12, batch 22450, loss[loss=0.1908, simple_loss=0.2615, pruned_loss=0.06006, over 21671.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3001, pruned_loss=0.07689, over 4285615.74 frames. ], batch size: 333, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:00:13,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2147466.0, ans=0.125 2023-06-26 05:00:18,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2147466.0, ans=0.125 2023-06-26 05:00:23,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2147526.0, ans=0.1 2023-06-26 05:00:37,261 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.261e+02 8.888e+02 1.140e+03 1.642e+03 4.602e+03, threshold=2.279e+03, percent-clipped=11.0 2023-06-26 05:00:37,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2147526.0, ans=0.1 2023-06-26 05:00:55,438 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-26 05:01:13,383 INFO [train.py:996] (3/4) Epoch 12, batch 22500, loss[loss=0.2132, simple_loss=0.308, pruned_loss=0.05915, over 21396.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2969, pruned_loss=0.0767, over 4275249.31 frames. ], batch size: 211, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:01:19,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2147646.0, ans=0.0 2023-06-26 05:03:04,488 INFO [train.py:996] (3/4) Epoch 12, batch 22550, loss[loss=0.2491, simple_loss=0.3144, pruned_loss=0.09188, over 21278.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3, pruned_loss=0.07711, over 4280058.47 frames. ], batch size: 176, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:03:18,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2147946.0, ans=0.125 2023-06-26 05:04:06,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2148066.0, ans=10.0 2023-06-26 05:04:18,082 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.902e+02 9.532e+02 1.362e+03 1.911e+03 4.517e+03, threshold=2.723e+03, percent-clipped=17.0 2023-06-26 05:04:22,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2148126.0, ans=0.125 2023-06-26 05:05:00,721 INFO [train.py:996] (3/4) Epoch 12, batch 22600, loss[loss=0.262, simple_loss=0.3332, pruned_loss=0.09535, over 21688.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3036, pruned_loss=0.07762, over 4285948.46 frames. ], batch size: 298, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:05:18,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2148306.0, ans=0.2 2023-06-26 05:05:50,067 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.21 vs. limit=15.0 2023-06-26 05:05:51,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2148366.0, ans=0.125 2023-06-26 05:06:13,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2148426.0, ans=0.0 2023-06-26 05:06:27,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-26 05:06:45,091 INFO [train.py:996] (3/4) Epoch 12, batch 22650, loss[loss=0.2403, simple_loss=0.3615, pruned_loss=0.0595, over 19752.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2997, pruned_loss=0.07705, over 4268255.63 frames. ], batch size: 703, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:06:50,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2148546.0, ans=0.0 2023-06-26 05:07:04,851 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=12.0 2023-06-26 05:07:52,286 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.795e+02 9.798e+02 1.388e+03 1.950e+03 5.687e+03, threshold=2.777e+03, percent-clipped=13.0 2023-06-26 05:08:07,070 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=15.0 2023-06-26 05:08:13,980 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.79 vs. limit=15.0 2023-06-26 05:08:31,893 INFO [train.py:996] (3/4) Epoch 12, batch 22700, loss[loss=0.2234, simple_loss=0.2932, pruned_loss=0.07674, over 21870.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2936, pruned_loss=0.07669, over 4266734.12 frames. ], batch size: 107, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:08:52,375 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=15.0 2023-06-26 05:09:08,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2148906.0, ans=0.2 2023-06-26 05:09:30,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2149026.0, ans=0.125 2023-06-26 05:10:24,727 INFO [train.py:996] (3/4) Epoch 12, batch 22750, loss[loss=0.2493, simple_loss=0.3178, pruned_loss=0.09036, over 21744.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2941, pruned_loss=0.07832, over 4251647.46 frames. ], batch size: 113, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:10:43,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2149206.0, ans=0.125 2023-06-26 05:10:44,451 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=22.5 2023-06-26 05:10:54,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2149206.0, ans=0.0 2023-06-26 05:11:03,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2149266.0, ans=0.1 2023-06-26 05:11:11,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2149266.0, ans=0.125 2023-06-26 05:11:44,534 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.150e+02 7.737e+02 1.012e+03 1.483e+03 2.915e+03, threshold=2.025e+03, percent-clipped=0.0 2023-06-26 05:11:45,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2149326.0, ans=0.0 2023-06-26 05:11:50,805 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.75 vs. limit=6.0 2023-06-26 05:12:05,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2149386.0, ans=0.125 2023-06-26 05:12:10,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2149386.0, ans=0.95 2023-06-26 05:12:14,839 INFO [train.py:996] (3/4) Epoch 12, batch 22800, loss[loss=0.2309, simple_loss=0.2981, pruned_loss=0.08183, over 21241.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.2986, pruned_loss=0.08008, over 4259374.23 frames. ], batch size: 176, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:13:03,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2149566.0, ans=0.0 2023-06-26 05:13:08,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2149566.0, ans=0.125 2023-06-26 05:13:09,489 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.97 vs. limit=15.0 2023-06-26 05:13:39,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2149686.0, ans=0.0 2023-06-26 05:14:03,875 INFO [train.py:996] (3/4) Epoch 12, batch 22850, loss[loss=0.2921, simple_loss=0.3264, pruned_loss=0.1289, over 21391.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2949, pruned_loss=0.07948, over 4269043.07 frames. ], batch size: 508, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:14:54,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2149866.0, ans=0.125 2023-06-26 05:14:57,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=2149866.0, ans=0.05 2023-06-26 05:15:00,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2149926.0, ans=0.2 2023-06-26 05:15:23,325 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.516e+02 9.685e+02 1.570e+03 2.544e+03 4.880e+03, threshold=3.139e+03, percent-clipped=35.0 2023-06-26 05:15:54,399 INFO [train.py:996] (3/4) Epoch 12, batch 22900, loss[loss=0.2979, simple_loss=0.3844, pruned_loss=0.1057, over 21458.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2978, pruned_loss=0.0786, over 4259823.48 frames. ], batch size: 507, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:15:59,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2150046.0, ans=0.04949747468305833 2023-06-26 05:16:03,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2150046.0, ans=0.125 2023-06-26 05:17:19,048 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.24 vs. limit=15.0 2023-06-26 05:17:20,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2150226.0, ans=0.125 2023-06-26 05:17:20,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2150226.0, ans=0.1 2023-06-26 05:17:38,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2150286.0, ans=0.1 2023-06-26 05:17:53,763 INFO [train.py:996] (3/4) Epoch 12, batch 22950, loss[loss=0.2151, simple_loss=0.3252, pruned_loss=0.05255, over 21616.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3113, pruned_loss=0.07739, over 4256883.42 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:18:33,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=2150466.0, ans=8.0 2023-06-26 05:19:07,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2150526.0, ans=0.125 2023-06-26 05:19:12,499 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.963e+02 8.459e+02 1.367e+03 2.050e+03 4.078e+03, threshold=2.734e+03, percent-clipped=4.0 2023-06-26 05:19:42,484 INFO [train.py:996] (3/4) Epoch 12, batch 23000, loss[loss=0.2492, simple_loss=0.3115, pruned_loss=0.09345, over 21639.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3103, pruned_loss=0.07508, over 4254452.27 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:20:02,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2150706.0, ans=0.1 2023-06-26 05:20:18,754 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.25 vs. limit=10.0 2023-06-26 05:20:36,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2150766.0, ans=0.125 2023-06-26 05:20:45,164 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-26 05:20:56,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2150826.0, ans=0.125 2023-06-26 05:21:14,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2150886.0, ans=0.2 2023-06-26 05:21:26,335 INFO [train.py:996] (3/4) Epoch 12, batch 23050, loss[loss=0.2679, simple_loss=0.3391, pruned_loss=0.09832, over 21501.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3116, pruned_loss=0.07764, over 4259265.93 frames. ], batch size: 471, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:21:29,154 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.42 vs. limit=15.0 2023-06-26 05:21:32,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2150946.0, ans=0.125 2023-06-26 05:21:49,791 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=22.5 2023-06-26 05:22:53,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2151126.0, ans=0.125 2023-06-26 05:22:54,538 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.118e+02 9.048e+02 1.306e+03 1.787e+03 2.952e+03, threshold=2.611e+03, percent-clipped=5.0 2023-06-26 05:23:10,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2151186.0, ans=0.0 2023-06-26 05:23:19,178 INFO [train.py:996] (3/4) Epoch 12, batch 23100, loss[loss=0.238, simple_loss=0.2866, pruned_loss=0.09466, over 21161.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3075, pruned_loss=0.07833, over 4268561.24 frames. ], batch size: 176, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:23:22,304 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-26 05:24:01,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2151306.0, ans=10.0 2023-06-26 05:25:08,264 INFO [train.py:996] (3/4) Epoch 12, batch 23150, loss[loss=0.2057, simple_loss=0.2717, pruned_loss=0.06986, over 21572.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3011, pruned_loss=0.07733, over 4264380.54 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:25:12,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2151546.0, ans=0.2 2023-06-26 05:26:05,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2151666.0, ans=0.125 2023-06-26 05:26:10,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2151666.0, ans=0.0 2023-06-26 05:26:21,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2151726.0, ans=0.125 2023-06-26 05:26:25,821 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.357e+02 7.610e+02 9.726e+02 1.363e+03 3.124e+03, threshold=1.945e+03, percent-clipped=3.0 2023-06-26 05:26:46,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2151786.0, ans=0.0 2023-06-26 05:26:55,429 INFO [train.py:996] (3/4) Epoch 12, batch 23200, loss[loss=0.2141, simple_loss=0.2831, pruned_loss=0.0725, over 21593.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2997, pruned_loss=0.07735, over 4266188.17 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 32.0 2023-06-26 05:27:59,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2151966.0, ans=0.0 2023-06-26 05:28:05,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2152026.0, ans=0.0 2023-06-26 05:28:30,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2152086.0, ans=0.2 2023-06-26 05:28:41,934 INFO [train.py:996] (3/4) Epoch 12, batch 23250, loss[loss=0.2398, simple_loss=0.3073, pruned_loss=0.08615, over 21911.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3016, pruned_loss=0.07893, over 4273453.21 frames. ], batch size: 371, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:29:22,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2152206.0, ans=0.125 2023-06-26 05:29:58,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2152326.0, ans=0.125 2023-06-26 05:30:12,611 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.710e+02 8.107e+02 1.074e+03 1.814e+03 3.794e+03, threshold=2.148e+03, percent-clipped=19.0 2023-06-26 05:30:18,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2152386.0, ans=0.2 2023-06-26 05:30:20,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2152386.0, ans=0.125 2023-06-26 05:30:25,266 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=22.5 2023-06-26 05:30:35,129 INFO [train.py:996] (3/4) Epoch 12, batch 23300, loss[loss=0.2319, simple_loss=0.3356, pruned_loss=0.06415, over 21277.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3095, pruned_loss=0.08083, over 4282257.61 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:30:35,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2152446.0, ans=0.125 2023-06-26 05:30:42,991 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-26 05:31:16,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2152506.0, ans=0.125 2023-06-26 05:31:16,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2152506.0, ans=0.0 2023-06-26 05:31:24,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2152566.0, ans=0.2 2023-06-26 05:31:59,246 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:32:31,063 INFO [train.py:996] (3/4) Epoch 12, batch 23350, loss[loss=0.1818, simple_loss=0.272, pruned_loss=0.04582, over 21732.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3137, pruned_loss=0.07948, over 4276862.18 frames. ], batch size: 332, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:32:31,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2152746.0, ans=0.2 2023-06-26 05:33:05,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2152806.0, ans=0.125 2023-06-26 05:33:42,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2152926.0, ans=0.125 2023-06-26 05:33:49,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2152926.0, ans=0.2 2023-06-26 05:33:54,018 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.865e+02 9.116e+02 1.311e+03 1.870e+03 4.347e+03, threshold=2.623e+03, percent-clipped=16.0 2023-06-26 05:34:20,370 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-06-26 05:34:21,106 INFO [train.py:996] (3/4) Epoch 12, batch 23400, loss[loss=0.2037, simple_loss=0.3042, pruned_loss=0.05162, over 20765.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3077, pruned_loss=0.07659, over 4280057.94 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:35:02,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2153106.0, ans=0.0 2023-06-26 05:35:52,061 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=22.5 2023-06-26 05:36:02,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2153286.0, ans=0.1 2023-06-26 05:36:11,423 INFO [train.py:996] (3/4) Epoch 12, batch 23450, loss[loss=0.2447, simple_loss=0.3138, pruned_loss=0.0878, over 21381.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3088, pruned_loss=0.07938, over 4287692.33 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:37:04,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2153466.0, ans=0.125 2023-06-26 05:37:09,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2153466.0, ans=0.1 2023-06-26 05:37:31,631 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.599e+02 8.250e+02 1.125e+03 1.414e+03 2.942e+03, threshold=2.251e+03, percent-clipped=1.0 2023-06-26 05:37:32,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2153526.0, ans=0.125 2023-06-26 05:37:44,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2153586.0, ans=0.04949747468305833 2023-06-26 05:38:03,775 INFO [train.py:996] (3/4) Epoch 12, batch 23500, loss[loss=0.217, simple_loss=0.2884, pruned_loss=0.0728, over 21835.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3078, pruned_loss=0.08062, over 4292743.34 frames. ], batch size: 298, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:38:31,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2153706.0, ans=0.125 2023-06-26 05:39:00,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2153766.0, ans=0.125 2023-06-26 05:39:11,544 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-26 05:39:52,407 INFO [train.py:996] (3/4) Epoch 12, batch 23550, loss[loss=0.2163, simple_loss=0.2821, pruned_loss=0.07524, over 21780.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.302, pruned_loss=0.08013, over 4295115.75 frames. ], batch size: 112, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:40:03,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2153946.0, ans=0.0 2023-06-26 05:40:59,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2154126.0, ans=0.125 2023-06-26 05:41:08,598 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.24 vs. limit=15.0 2023-06-26 05:41:09,208 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.684e+02 8.231e+02 1.204e+03 1.979e+03 6.408e+03, threshold=2.407e+03, percent-clipped=19.0 2023-06-26 05:41:09,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2154126.0, ans=0.0 2023-06-26 05:41:13,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2154186.0, ans=0.1 2023-06-26 05:41:24,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2154186.0, ans=0.125 2023-06-26 05:41:48,750 INFO [train.py:996] (3/4) Epoch 12, batch 23600, loss[loss=0.2265, simple_loss=0.3006, pruned_loss=0.07626, over 21274.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3033, pruned_loss=0.08001, over 4288385.17 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 32.0 2023-06-26 05:42:49,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2154426.0, ans=0.125 2023-06-26 05:43:40,723 INFO [train.py:996] (3/4) Epoch 12, batch 23650, loss[loss=0.2495, simple_loss=0.3217, pruned_loss=0.08862, over 21459.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3043, pruned_loss=0.07873, over 4287407.28 frames. ], batch size: 194, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:44:01,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2154546.0, ans=0.09899494936611666 2023-06-26 05:44:13,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2154606.0, ans=0.125 2023-06-26 05:44:45,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2154726.0, ans=0.07 2023-06-26 05:45:11,509 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.083e+02 1.007e+03 1.364e+03 1.897e+03 4.250e+03, threshold=2.728e+03, percent-clipped=14.0 2023-06-26 05:45:36,447 INFO [train.py:996] (3/4) Epoch 12, batch 23700, loss[loss=0.2762, simple_loss=0.3597, pruned_loss=0.09637, over 19876.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3064, pruned_loss=0.07807, over 4284904.69 frames. ], batch size: 704, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:45:48,373 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-26 05:47:26,983 INFO [train.py:996] (3/4) Epoch 12, batch 23750, loss[loss=0.1981, simple_loss=0.3055, pruned_loss=0.04532, over 20662.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3096, pruned_loss=0.07872, over 4282788.44 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:47:52,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2155206.0, ans=0.125 2023-06-26 05:47:54,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2155206.0, ans=0.2 2023-06-26 05:48:10,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2155266.0, ans=0.125 2023-06-26 05:48:23,528 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-26 05:48:37,219 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=12.0 2023-06-26 05:48:52,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2155326.0, ans=0.0 2023-06-26 05:48:55,961 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.447e+02 9.334e+02 1.253e+03 1.718e+03 3.362e+03, threshold=2.506e+03, percent-clipped=5.0 2023-06-26 05:49:21,834 INFO [train.py:996] (3/4) Epoch 12, batch 23800, loss[loss=0.2907, simple_loss=0.3831, pruned_loss=0.09914, over 21616.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3059, pruned_loss=0.07605, over 4282200.83 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:50:29,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2155566.0, ans=0.125 2023-06-26 05:51:16,620 INFO [train.py:996] (3/4) Epoch 12, batch 23850, loss[loss=0.3076, simple_loss=0.4177, pruned_loss=0.09877, over 19788.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3158, pruned_loss=0.078, over 4280439.48 frames. ], batch size: 702, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:52:14,415 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=22.5 2023-06-26 05:52:15,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2155866.0, ans=0.125 2023-06-26 05:52:49,983 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.758e+02 1.136e+03 1.834e+03 2.559e+03 6.160e+03, threshold=3.668e+03, percent-clipped=28.0 2023-06-26 05:53:14,963 INFO [train.py:996] (3/4) Epoch 12, batch 23900, loss[loss=0.2158, simple_loss=0.2964, pruned_loss=0.06758, over 20738.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3239, pruned_loss=0.08041, over 4279963.20 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:53:55,874 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.12 vs. limit=15.0 2023-06-26 05:54:20,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2156166.0, ans=0.2 2023-06-26 05:54:45,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2156286.0, ans=0.2 2023-06-26 05:54:59,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2156286.0, ans=0.125 2023-06-26 05:55:04,315 INFO [train.py:996] (3/4) Epoch 12, batch 23950, loss[loss=0.2358, simple_loss=0.3068, pruned_loss=0.08241, over 21486.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3172, pruned_loss=0.08046, over 4269175.04 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:55:34,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2156406.0, ans=10.0 2023-06-26 05:56:06,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2156466.0, ans=0.1 2023-06-26 05:56:06,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2156466.0, ans=0.125 2023-06-26 05:56:32,151 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.839e+02 7.798e+02 1.081e+03 1.467e+03 3.120e+03, threshold=2.162e+03, percent-clipped=0.0 2023-06-26 05:56:32,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2156586.0, ans=0.09899494936611666 2023-06-26 05:56:55,459 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.91 vs. limit=10.0 2023-06-26 05:57:04,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2156586.0, ans=0.125 2023-06-26 05:57:07,631 INFO [train.py:996] (3/4) Epoch 12, batch 24000, loss[loss=0.2943, simple_loss=0.3596, pruned_loss=0.1144, over 21677.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3186, pruned_loss=0.08348, over 4270445.41 frames. ], batch size: 391, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:57:07,631 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 05:57:25,649 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.2659, simple_loss=0.36, pruned_loss=0.08593, over 1796401.00 frames. 2023-06-26 05:57:25,650 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-26 05:57:26,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=2156646.0, ans=22.5 2023-06-26 05:57:30,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2156646.0, ans=0.1 2023-06-26 05:57:32,363 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-26 05:58:06,907 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=22.5 2023-06-26 05:58:16,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2156766.0, ans=0.1 2023-06-26 05:58:30,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2156826.0, ans=0.0 2023-06-26 05:58:54,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2156826.0, ans=0.0 2023-06-26 05:59:11,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2156886.0, ans=0.2 2023-06-26 05:59:15,047 INFO [train.py:996] (3/4) Epoch 12, batch 24050, loss[loss=0.2039, simple_loss=0.2968, pruned_loss=0.05547, over 21754.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3195, pruned_loss=0.08367, over 4268067.79 frames. ], batch size: 332, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:59:29,941 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=12.0 2023-06-26 05:59:40,619 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.65 vs. limit=6.0 2023-06-26 05:59:52,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2157006.0, ans=0.125 2023-06-26 06:00:00,949 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-26 06:00:04,833 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.03 vs. limit=15.0 2023-06-26 06:00:19,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2157126.0, ans=0.05 2023-06-26 06:00:43,290 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-06-26 06:00:45,851 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.641e+02 9.752e+02 1.281e+03 1.752e+03 4.034e+03, threshold=2.563e+03, percent-clipped=15.0 2023-06-26 06:00:48,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2157186.0, ans=0.2 2023-06-26 06:00:58,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2157186.0, ans=0.125 2023-06-26 06:01:05,403 INFO [train.py:996] (3/4) Epoch 12, batch 24100, loss[loss=0.2314, simple_loss=0.3213, pruned_loss=0.07074, over 21657.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3211, pruned_loss=0.08237, over 4271640.92 frames. ], batch size: 263, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:01:21,724 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:02:19,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2157426.0, ans=0.125 2023-06-26 06:02:57,787 INFO [train.py:996] (3/4) Epoch 12, batch 24150, loss[loss=0.2723, simple_loss=0.3433, pruned_loss=0.1006, over 21889.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3217, pruned_loss=0.0845, over 4279584.49 frames. ], batch size: 107, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:03:00,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=2157546.0, ans=0.1 2023-06-26 06:03:19,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2157606.0, ans=0.0 2023-06-26 06:03:23,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2157606.0, ans=0.2 2023-06-26 06:04:01,919 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.95 vs. limit=22.5 2023-06-26 06:04:30,161 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.135e+02 1.050e+03 1.414e+03 1.892e+03 4.392e+03, threshold=2.829e+03, percent-clipped=9.0 2023-06-26 06:04:34,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2157786.0, ans=0.125 2023-06-26 06:04:55,045 INFO [train.py:996] (3/4) Epoch 12, batch 24200, loss[loss=0.2764, simple_loss=0.3537, pruned_loss=0.09952, over 21785.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3236, pruned_loss=0.08617, over 4277502.13 frames. ], batch size: 316, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:05:50,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2157966.0, ans=0.1 2023-06-26 06:06:00,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=2157966.0, ans=0.5 2023-06-26 06:06:42,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2158086.0, ans=0.2 2023-06-26 06:06:47,655 INFO [train.py:996] (3/4) Epoch 12, batch 24250, loss[loss=0.2015, simple_loss=0.3037, pruned_loss=0.04961, over 21659.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.32, pruned_loss=0.07935, over 4277804.38 frames. ], batch size: 441, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:07:22,445 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.78 vs. limit=12.0 2023-06-26 06:08:15,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2158326.0, ans=0.125 2023-06-26 06:08:18,618 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.845e+02 9.840e+02 1.673e+03 2.724e+03 4.672e+03, threshold=3.346e+03, percent-clipped=24.0 2023-06-26 06:08:37,625 INFO [train.py:996] (3/4) Epoch 12, batch 24300, loss[loss=0.1891, simple_loss=0.2703, pruned_loss=0.05393, over 21782.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.314, pruned_loss=0.0741, over 4276963.64 frames. ], batch size: 282, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:09:36,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2158566.0, ans=0.1 2023-06-26 06:09:36,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2158566.0, ans=0.0 2023-06-26 06:09:36,818 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=12.0 2023-06-26 06:09:53,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2158626.0, ans=0.125 2023-06-26 06:10:08,625 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.17 vs. limit=15.0 2023-06-26 06:10:13,527 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=12.0 2023-06-26 06:10:14,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2158686.0, ans=0.125 2023-06-26 06:10:28,335 INFO [train.py:996] (3/4) Epoch 12, batch 24350, loss[loss=0.2871, simple_loss=0.3588, pruned_loss=0.1077, over 21795.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3113, pruned_loss=0.07474, over 4284046.97 frames. ], batch size: 124, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:10:58,592 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-26 06:12:01,473 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.100e+02 8.199e+02 1.178e+03 1.878e+03 3.422e+03, threshold=2.355e+03, percent-clipped=1.0 2023-06-26 06:12:03,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2158986.0, ans=0.0 2023-06-26 06:12:04,349 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.23 vs. limit=15.0 2023-06-26 06:12:24,448 INFO [train.py:996] (3/4) Epoch 12, batch 24400, loss[loss=0.2199, simple_loss=0.2915, pruned_loss=0.07411, over 20072.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3153, pruned_loss=0.07843, over 4284263.57 frames. ], batch size: 702, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:12:24,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2159046.0, ans=0.125 2023-06-26 06:12:49,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2159046.0, ans=0.1 2023-06-26 06:13:00,548 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.02 vs. limit=10.0 2023-06-26 06:13:21,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2159166.0, ans=0.125 2023-06-26 06:13:31,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2159166.0, ans=0.125 2023-06-26 06:14:22,181 INFO [train.py:996] (3/4) Epoch 12, batch 24450, loss[loss=0.2373, simple_loss=0.3265, pruned_loss=0.07404, over 21622.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3161, pruned_loss=0.07955, over 4278017.62 frames. ], batch size: 263, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:14:38,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2159346.0, ans=0.1 2023-06-26 06:15:38,624 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.18 vs. limit=15.0 2023-06-26 06:15:42,747 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.744e+02 8.987e+02 1.393e+03 2.049e+03 5.528e+03, threshold=2.786e+03, percent-clipped=20.0 2023-06-26 06:16:10,787 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=15.0 2023-06-26 06:16:11,931 INFO [train.py:996] (3/4) Epoch 12, batch 24500, loss[loss=0.2175, simple_loss=0.2952, pruned_loss=0.06992, over 21939.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3155, pruned_loss=0.07894, over 4280342.86 frames. ], batch size: 118, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:16:38,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2159706.0, ans=0.0 2023-06-26 06:17:13,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2159766.0, ans=0.0 2023-06-26 06:17:22,925 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=15.0 2023-06-26 06:18:10,455 INFO [train.py:996] (3/4) Epoch 12, batch 24550, loss[loss=0.2403, simple_loss=0.3141, pruned_loss=0.08324, over 21419.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3175, pruned_loss=0.08069, over 4280177.68 frames. ], batch size: 159, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:18:19,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2159946.0, ans=0.125 2023-06-26 06:18:23,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=2159946.0, ans=10.0 2023-06-26 06:18:42,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2160006.0, ans=0.125 2023-06-26 06:19:21,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2160126.0, ans=0.125 2023-06-26 06:19:40,390 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.015e+02 1.028e+03 1.361e+03 2.117e+03 3.876e+03, threshold=2.722e+03, percent-clipped=8.0 2023-06-26 06:19:45,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2160186.0, ans=0.0 2023-06-26 06:19:58,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2160186.0, ans=0.07 2023-06-26 06:20:03,140 INFO [train.py:996] (3/4) Epoch 12, batch 24600, loss[loss=0.2291, simple_loss=0.2964, pruned_loss=0.08093, over 21803.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3132, pruned_loss=0.08061, over 4279149.96 frames. ], batch size: 352, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:21:26,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2160426.0, ans=0.125 2023-06-26 06:21:53,004 INFO [train.py:996] (3/4) Epoch 12, batch 24650, loss[loss=0.1884, simple_loss=0.2463, pruned_loss=0.06523, over 21412.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3057, pruned_loss=0.07847, over 4276644.16 frames. ], batch size: 212, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:22:43,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2160666.0, ans=0.0 2023-06-26 06:23:00,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2160726.0, ans=0.0 2023-06-26 06:23:21,779 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.831e+02 9.882e+02 1.634e+03 2.648e+03 5.082e+03, threshold=3.268e+03, percent-clipped=24.0 2023-06-26 06:23:45,412 INFO [train.py:996] (3/4) Epoch 12, batch 24700, loss[loss=0.2058, simple_loss=0.2889, pruned_loss=0.06133, over 21451.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3032, pruned_loss=0.07663, over 4276608.14 frames. ], batch size: 194, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:23:53,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2160846.0, ans=0.125 2023-06-26 06:25:34,263 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-26 06:25:34,900 INFO [train.py:996] (3/4) Epoch 12, batch 24750, loss[loss=0.2299, simple_loss=0.2921, pruned_loss=0.08386, over 21836.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2969, pruned_loss=0.07457, over 4268068.25 frames. ], batch size: 107, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:26:17,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2161266.0, ans=0.125 2023-06-26 06:26:55,420 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.337e+02 7.152e+02 1.055e+03 1.533e+03 3.030e+03, threshold=2.111e+03, percent-clipped=0.0 2023-06-26 06:27:05,108 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2023-06-26 06:27:24,926 INFO [train.py:996] (3/4) Epoch 12, batch 24800, loss[loss=0.2455, simple_loss=0.3116, pruned_loss=0.08968, over 21823.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.292, pruned_loss=0.07434, over 4272660.40 frames. ], batch size: 391, lr: 2.39e-03, grad_scale: 32.0 2023-06-26 06:27:40,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2161506.0, ans=0.125 2023-06-26 06:27:51,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2161506.0, ans=0.1 2023-06-26 06:27:58,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2161506.0, ans=0.2 2023-06-26 06:28:38,900 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-26 06:28:40,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2161626.0, ans=0.125 2023-06-26 06:28:49,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2161686.0, ans=0.0 2023-06-26 06:29:05,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2161686.0, ans=0.125 2023-06-26 06:29:09,406 INFO [train.py:996] (3/4) Epoch 12, batch 24850, loss[loss=0.2354, simple_loss=0.3186, pruned_loss=0.07609, over 21692.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2925, pruned_loss=0.07613, over 4280496.86 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:29:10,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=2161746.0, ans=15.0 2023-06-26 06:29:23,047 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.10 vs. limit=12.0 2023-06-26 06:29:27,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2161746.0, ans=0.025 2023-06-26 06:29:29,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2161746.0, ans=0.04949747468305833 2023-06-26 06:30:31,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2161926.0, ans=0.125 2023-06-26 06:30:45,748 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.468e+02 9.714e+02 1.514e+03 2.212e+03 4.137e+03, threshold=3.028e+03, percent-clipped=27.0 2023-06-26 06:30:47,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2161986.0, ans=0.0 2023-06-26 06:31:07,650 INFO [train.py:996] (3/4) Epoch 12, batch 24900, loss[loss=0.2358, simple_loss=0.3112, pruned_loss=0.0802, over 21459.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2965, pruned_loss=0.07731, over 4281766.17 frames. ], batch size: 211, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:31:30,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2162046.0, ans=0.1 2023-06-26 06:31:45,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2162106.0, ans=0.07 2023-06-26 06:31:54,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2162166.0, ans=0.125 2023-06-26 06:32:49,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=2162286.0, ans=15.0 2023-06-26 06:33:07,609 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=22.5 2023-06-26 06:33:08,346 INFO [train.py:996] (3/4) Epoch 12, batch 24950, loss[loss=0.2512, simple_loss=0.327, pruned_loss=0.08769, over 21609.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3035, pruned_loss=0.0807, over 4278668.07 frames. ], batch size: 263, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:34:43,960 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.936e+02 9.070e+02 1.289e+03 1.993e+03 4.042e+03, threshold=2.579e+03, percent-clipped=8.0 2023-06-26 06:34:59,503 INFO [train.py:996] (3/4) Epoch 12, batch 25000, loss[loss=0.186, simple_loss=0.2391, pruned_loss=0.06639, over 20300.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3084, pruned_loss=0.0825, over 4278124.25 frames. ], batch size: 703, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:35:29,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2162706.0, ans=0.125 2023-06-26 06:36:15,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=22.5 2023-06-26 06:36:17,341 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-26 06:36:28,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=2162886.0, ans=10.0 2023-06-26 06:36:49,044 INFO [train.py:996] (3/4) Epoch 12, batch 25050, loss[loss=0.1948, simple_loss=0.2524, pruned_loss=0.06861, over 20656.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3016, pruned_loss=0.08093, over 4275462.89 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:36:58,323 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.31 vs. limit=15.0 2023-06-26 06:38:01,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2163126.0, ans=0.2 2023-06-26 06:38:14,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2163126.0, ans=0.0 2023-06-26 06:38:26,465 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.221e+02 8.095e+02 1.236e+03 1.556e+03 3.258e+03, threshold=2.471e+03, percent-clipped=4.0 2023-06-26 06:38:31,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2163186.0, ans=0.1 2023-06-26 06:38:41,997 INFO [train.py:996] (3/4) Epoch 12, batch 25100, loss[loss=0.207, simple_loss=0.2841, pruned_loss=0.06494, over 21681.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.296, pruned_loss=0.07961, over 4275088.09 frames. ], batch size: 282, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:38:50,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2163246.0, ans=0.0 2023-06-26 06:39:10,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2163306.0, ans=0.125 2023-06-26 06:39:35,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2163366.0, ans=0.0 2023-06-26 06:40:23,730 INFO [train.py:996] (3/4) Epoch 12, batch 25150, loss[loss=0.2403, simple_loss=0.3159, pruned_loss=0.08234, over 21621.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3004, pruned_loss=0.07794, over 4258884.32 frames. ], batch size: 471, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 06:40:35,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2163546.0, ans=0.125 2023-06-26 06:41:05,643 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.79 vs. limit=6.0 2023-06-26 06:41:10,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2163666.0, ans=0.0 2023-06-26 06:41:26,816 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2023-06-26 06:41:48,270 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.357e+02 7.684e+02 1.120e+03 1.527e+03 4.461e+03, threshold=2.241e+03, percent-clipped=8.0 2023-06-26 06:42:06,938 INFO [train.py:996] (3/4) Epoch 12, batch 25200, loss[loss=0.1896, simple_loss=0.2723, pruned_loss=0.05344, over 21455.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3003, pruned_loss=0.07598, over 4260965.50 frames. ], batch size: 131, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:42:10,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2163846.0, ans=0.0 2023-06-26 06:42:23,543 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.71 vs. limit=10.0 2023-06-26 06:42:51,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2163966.0, ans=0.2 2023-06-26 06:42:58,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=2163966.0, ans=10.0 2023-06-26 06:43:33,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2164086.0, ans=10.0 2023-06-26 06:43:56,872 INFO [train.py:996] (3/4) Epoch 12, batch 25250, loss[loss=0.2556, simple_loss=0.297, pruned_loss=0.1071, over 21326.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2997, pruned_loss=0.07443, over 4257995.51 frames. ], batch size: 508, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:43:59,337 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=15.0 2023-06-26 06:45:25,553 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.169e+02 7.134e+02 9.127e+02 1.634e+03 5.272e+03, threshold=1.825e+03, percent-clipped=13.0 2023-06-26 06:45:37,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2164386.0, ans=0.1 2023-06-26 06:45:45,401 INFO [train.py:996] (3/4) Epoch 12, batch 25300, loss[loss=0.2345, simple_loss=0.3169, pruned_loss=0.07604, over 21314.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2975, pruned_loss=0.07458, over 4253316.94 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:45:54,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2164446.0, ans=0.1 2023-06-26 06:46:22,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2164506.0, ans=0.0 2023-06-26 06:46:36,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2164566.0, ans=0.0 2023-06-26 06:46:40,109 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=15.0 2023-06-26 06:46:56,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2164626.0, ans=0.125 2023-06-26 06:47:29,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2164686.0, ans=0.0 2023-06-26 06:47:35,396 INFO [train.py:996] (3/4) Epoch 12, batch 25350, loss[loss=0.2153, simple_loss=0.2981, pruned_loss=0.06624, over 21600.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3001, pruned_loss=0.07443, over 4255896.36 frames. ], batch size: 414, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:47:35,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2164746.0, ans=0.07 2023-06-26 06:47:55,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2164806.0, ans=0.125 2023-06-26 06:48:22,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2164866.0, ans=0.0 2023-06-26 06:48:29,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2164866.0, ans=0.0 2023-06-26 06:48:51,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2164926.0, ans=0.125 2023-06-26 06:48:57,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2023-06-26 06:49:04,252 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.341e+02 9.859e+02 1.527e+03 2.457e+03 4.731e+03, threshold=3.054e+03, percent-clipped=38.0 2023-06-26 06:49:17,657 INFO [train.py:996] (3/4) Epoch 12, batch 25400, loss[loss=0.265, simple_loss=0.3103, pruned_loss=0.1099, over 21242.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2948, pruned_loss=0.07321, over 4256493.30 frames. ], batch size: 471, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:49:20,354 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-26 06:49:26,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2165046.0, ans=0.125 2023-06-26 06:49:35,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2165046.0, ans=0.1 2023-06-26 06:50:01,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=22.5 2023-06-26 06:50:11,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2165166.0, ans=0.125 2023-06-26 06:51:05,416 INFO [train.py:996] (3/4) Epoch 12, batch 25450, loss[loss=0.2461, simple_loss=0.3046, pruned_loss=0.09377, over 15470.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2953, pruned_loss=0.07487, over 4254054.32 frames. ], batch size: 62, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:51:11,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2165346.0, ans=0.125 2023-06-26 06:51:12,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2165346.0, ans=0.0 2023-06-26 06:51:48,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2165466.0, ans=0.04949747468305833 2023-06-26 06:52:05,331 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-26 06:52:43,153 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.859e+02 7.438e+02 1.024e+03 1.496e+03 3.770e+03, threshold=2.047e+03, percent-clipped=1.0 2023-06-26 06:52:55,754 INFO [train.py:996] (3/4) Epoch 12, batch 25500, loss[loss=0.1349, simple_loss=0.203, pruned_loss=0.03341, over 16859.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2941, pruned_loss=0.07126, over 4249165.84 frames. ], batch size: 60, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 06:53:08,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2165646.0, ans=0.125 2023-06-26 06:53:40,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2165766.0, ans=0.2 2023-06-26 06:54:02,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2165826.0, ans=0.125 2023-06-26 06:54:15,464 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=22.5 2023-06-26 06:54:35,433 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.52 vs. limit=15.0 2023-06-26 06:54:38,938 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=15.0 2023-06-26 06:54:53,031 INFO [train.py:996] (3/4) Epoch 12, batch 25550, loss[loss=0.2577, simple_loss=0.3664, pruned_loss=0.07447, over 20768.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.302, pruned_loss=0.0713, over 4254466.16 frames. ], batch size: 607, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 06:55:03,357 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.54 vs. limit=10.0 2023-06-26 06:55:30,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2166066.0, ans=0.125 2023-06-26 06:55:43,855 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=22.5 2023-06-26 06:56:31,791 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.571e+02 9.605e+02 1.487e+03 2.203e+03 4.525e+03, threshold=2.973e+03, percent-clipped=31.0 2023-06-26 06:56:37,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2166186.0, ans=0.0 2023-06-26 06:56:43,668 INFO [train.py:996] (3/4) Epoch 12, batch 25600, loss[loss=0.2075, simple_loss=0.2807, pruned_loss=0.06715, over 19981.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3066, pruned_loss=0.07269, over 4253464.48 frames. ], batch size: 702, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:57:07,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2166306.0, ans=0.1 2023-06-26 06:58:05,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2166426.0, ans=0.125 2023-06-26 06:58:14,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2166486.0, ans=0.125 2023-06-26 06:58:27,607 INFO [train.py:996] (3/4) Epoch 12, batch 25650, loss[loss=0.2051, simple_loss=0.2792, pruned_loss=0.06554, over 21757.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3067, pruned_loss=0.07448, over 4249509.47 frames. ], batch size: 124, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:58:51,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2166606.0, ans=0.0 2023-06-26 07:00:02,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2166786.0, ans=0.0 2023-06-26 07:00:05,368 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.624e+02 9.508e+02 1.344e+03 1.911e+03 3.919e+03, threshold=2.688e+03, percent-clipped=6.0 2023-06-26 07:00:11,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2166786.0, ans=0.125 2023-06-26 07:00:16,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2166846.0, ans=0.1 2023-06-26 07:00:17,951 INFO [train.py:996] (3/4) Epoch 12, batch 25700, loss[loss=0.2621, simple_loss=0.3194, pruned_loss=0.1024, over 21810.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3042, pruned_loss=0.07619, over 4259752.25 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:01:31,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2167026.0, ans=0.125 2023-06-26 07:01:43,121 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:01:49,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2167086.0, ans=0.035 2023-06-26 07:01:49,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2167086.0, ans=0.0 2023-06-26 07:02:06,687 INFO [train.py:996] (3/4) Epoch 12, batch 25750, loss[loss=0.2957, simple_loss=0.3836, pruned_loss=0.1039, over 21856.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3095, pruned_loss=0.07902, over 4266265.25 frames. ], batch size: 371, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:02:12,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2167146.0, ans=0.0 2023-06-26 07:02:53,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2167206.0, ans=0.125 2023-06-26 07:03:34,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2167326.0, ans=0.0 2023-06-26 07:03:36,228 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-26 07:03:47,661 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.983e+02 9.621e+02 1.292e+03 1.996e+03 6.312e+03, threshold=2.583e+03, percent-clipped=12.0 2023-06-26 07:03:50,779 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-26 07:04:10,185 INFO [train.py:996] (3/4) Epoch 12, batch 25800, loss[loss=0.2577, simple_loss=0.3382, pruned_loss=0.08861, over 21510.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3202, pruned_loss=0.08239, over 4268680.60 frames. ], batch size: 194, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:05:11,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2167626.0, ans=0.2 2023-06-26 07:05:20,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2167626.0, ans=0.025 2023-06-26 07:06:05,632 INFO [train.py:996] (3/4) Epoch 12, batch 25850, loss[loss=0.2151, simple_loss=0.2917, pruned_loss=0.06923, over 21895.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3224, pruned_loss=0.08204, over 4272744.82 frames. ], batch size: 332, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:06:11,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2167746.0, ans=0.125 2023-06-26 07:06:30,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2167806.0, ans=0.125 2023-06-26 07:07:48,443 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.401e+02 9.971e+02 1.375e+03 1.928e+03 5.111e+03, threshold=2.750e+03, percent-clipped=7.0 2023-06-26 07:07:50,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2167986.0, ans=0.125 2023-06-26 07:07:51,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2167986.0, ans=0.025 2023-06-26 07:08:04,003 INFO [train.py:996] (3/4) Epoch 12, batch 25900, loss[loss=0.2345, simple_loss=0.3212, pruned_loss=0.07396, over 21317.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3232, pruned_loss=0.08221, over 4279535.62 frames. ], batch size: 159, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:08:34,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2168106.0, ans=0.0 2023-06-26 07:09:55,076 INFO [train.py:996] (3/4) Epoch 12, batch 25950, loss[loss=0.282, simple_loss=0.3596, pruned_loss=0.1022, over 21576.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3295, pruned_loss=0.08572, over 4279124.01 frames. ], batch size: 414, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:10:16,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2168406.0, ans=0.1 2023-06-26 07:10:52,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2168466.0, ans=0.1 2023-06-26 07:11:09,606 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-06-26 07:11:27,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.61 vs. limit=22.5 2023-06-26 07:11:35,001 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.821e+02 8.955e+02 1.246e+03 1.921e+03 4.535e+03, threshold=2.491e+03, percent-clipped=11.0 2023-06-26 07:11:45,157 INFO [train.py:996] (3/4) Epoch 12, batch 26000, loss[loss=0.2449, simple_loss=0.3202, pruned_loss=0.08487, over 21439.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3291, pruned_loss=0.08426, over 4280537.32 frames. ], batch size: 211, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:12:35,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2168766.0, ans=0.1 2023-06-26 07:13:30,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2168886.0, ans=0.125 2023-06-26 07:13:34,685 INFO [train.py:996] (3/4) Epoch 12, batch 26050, loss[loss=0.2211, simple_loss=0.2913, pruned_loss=0.07545, over 21890.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3294, pruned_loss=0.08487, over 4278976.61 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:13:59,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=15.0 2023-06-26 07:14:27,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2169066.0, ans=0.1 2023-06-26 07:14:47,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2169126.0, ans=10.0 2023-06-26 07:15:00,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2169126.0, ans=0.125 2023-06-26 07:15:11,920 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.629e+02 1.039e+03 1.480e+03 2.039e+03 5.924e+03, threshold=2.960e+03, percent-clipped=13.0 2023-06-26 07:15:22,443 INFO [train.py:996] (3/4) Epoch 12, batch 26100, loss[loss=0.1915, simple_loss=0.2664, pruned_loss=0.05832, over 20972.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3228, pruned_loss=0.08423, over 4280375.12 frames. ], batch size: 607, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:15:36,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2169246.0, ans=0.0 2023-06-26 07:16:02,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=12.0 2023-06-26 07:16:14,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2169366.0, ans=0.125 2023-06-26 07:17:13,269 INFO [train.py:996] (3/4) Epoch 12, batch 26150, loss[loss=0.2552, simple_loss=0.3164, pruned_loss=0.09696, over 20063.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3194, pruned_loss=0.085, over 4277190.00 frames. ], batch size: 702, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:17:56,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2169606.0, ans=0.0 2023-06-26 07:18:44,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2169786.0, ans=0.125 2023-06-26 07:18:51,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2169786.0, ans=0.0 2023-06-26 07:18:52,959 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.447e+02 8.636e+02 1.260e+03 1.715e+03 2.544e+03, threshold=2.520e+03, percent-clipped=0.0 2023-06-26 07:19:03,296 INFO [train.py:996] (3/4) Epoch 12, batch 26200, loss[loss=0.2337, simple_loss=0.3362, pruned_loss=0.06562, over 21868.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3212, pruned_loss=0.08316, over 4275510.11 frames. ], batch size: 316, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:19:43,057 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-26 07:20:16,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-06-26 07:20:24,257 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-26 07:20:52,518 INFO [train.py:996] (3/4) Epoch 12, batch 26250, loss[loss=0.2356, simple_loss=0.3057, pruned_loss=0.08272, over 21831.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3242, pruned_loss=0.08158, over 4273079.47 frames. ], batch size: 247, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:21:28,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2170206.0, ans=0.125 2023-06-26 07:22:09,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2170326.0, ans=0.125 2023-06-26 07:22:20,982 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-26 07:22:31,922 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.498e+02 1.014e+03 1.360e+03 1.996e+03 4.754e+03, threshold=2.720e+03, percent-clipped=15.0 2023-06-26 07:22:42,508 INFO [train.py:996] (3/4) Epoch 12, batch 26300, loss[loss=0.2389, simple_loss=0.3057, pruned_loss=0.08607, over 22017.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3208, pruned_loss=0.0818, over 4275113.07 frames. ], batch size: 416, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:23:10,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2170446.0, ans=0.0 2023-06-26 07:23:13,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2170506.0, ans=0.1 2023-06-26 07:23:15,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2170506.0, ans=0.125 2023-06-26 07:23:43,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=2170566.0, ans=22.5 2023-06-26 07:24:10,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2170626.0, ans=0.125 2023-06-26 07:24:24,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2170686.0, ans=0.125 2023-06-26 07:24:40,553 INFO [train.py:996] (3/4) Epoch 12, batch 26350, loss[loss=0.3036, simple_loss=0.3636, pruned_loss=0.1218, over 21263.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3192, pruned_loss=0.08278, over 4282666.07 frames. ], batch size: 143, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:26:10,252 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.697e+02 8.952e+02 1.093e+03 1.459e+03 3.186e+03, threshold=2.186e+03, percent-clipped=0.0 2023-06-26 07:26:25,955 INFO [train.py:996] (3/4) Epoch 12, batch 26400, loss[loss=0.187, simple_loss=0.2509, pruned_loss=0.06157, over 21623.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3129, pruned_loss=0.08287, over 4275619.02 frames. ], batch size: 231, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:27:09,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-26 07:27:23,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2171166.0, ans=0.125 2023-06-26 07:27:32,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2171166.0, ans=0.0 2023-06-26 07:27:47,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2171226.0, ans=0.0 2023-06-26 07:28:30,888 INFO [train.py:996] (3/4) Epoch 12, batch 26450, loss[loss=0.2351, simple_loss=0.3447, pruned_loss=0.06273, over 20743.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3114, pruned_loss=0.08198, over 4247641.72 frames. ], batch size: 607, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:28:48,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=2171406.0, ans=15.0 2023-06-26 07:29:06,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.15 vs. limit=22.5 2023-06-26 07:30:11,345 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.179e+02 1.055e+03 2.037e+03 2.907e+03 6.247e+03, threshold=4.074e+03, percent-clipped=46.0 2023-06-26 07:30:20,908 INFO [train.py:996] (3/4) Epoch 12, batch 26500, loss[loss=0.1966, simple_loss=0.2659, pruned_loss=0.06365, over 21398.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3133, pruned_loss=0.08061, over 4241100.86 frames. ], batch size: 194, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:30:21,947 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-26 07:31:16,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2171766.0, ans=0.125 2023-06-26 07:31:31,746 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=12.0 2023-06-26 07:32:07,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2171886.0, ans=0.0 2023-06-26 07:32:13,330 INFO [train.py:996] (3/4) Epoch 12, batch 26550, loss[loss=0.1842, simple_loss=0.2725, pruned_loss=0.04789, over 21609.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3114, pruned_loss=0.07746, over 4248595.12 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:32:18,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2171946.0, ans=0.125 2023-06-26 07:32:23,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2171946.0, ans=0.025 2023-06-26 07:33:46,249 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-26 07:33:48,922 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.879e+02 9.613e+02 1.344e+03 2.252e+03 5.344e+03, threshold=2.687e+03, percent-clipped=4.0 2023-06-26 07:33:57,141 INFO [train.py:996] (3/4) Epoch 12, batch 26600, loss[loss=0.2524, simple_loss=0.3124, pruned_loss=0.09626, over 21388.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3121, pruned_loss=0.07534, over 4253142.81 frames. ], batch size: 508, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:34:17,348 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.56 vs. limit=12.0 2023-06-26 07:34:18,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2172306.0, ans=0.07 2023-06-26 07:34:19,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2172306.0, ans=0.125 2023-06-26 07:35:33,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2172486.0, ans=0.125 2023-06-26 07:35:45,152 INFO [train.py:996] (3/4) Epoch 12, batch 26650, loss[loss=0.1582, simple_loss=0.2441, pruned_loss=0.03618, over 21749.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3047, pruned_loss=0.07403, over 4250618.26 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:35:57,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2172546.0, ans=0.07 2023-06-26 07:36:01,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2172606.0, ans=0.125 2023-06-26 07:37:06,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2172726.0, ans=0.125 2023-06-26 07:37:20,210 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.879e+02 7.426e+02 9.644e+02 1.380e+03 2.955e+03, threshold=1.929e+03, percent-clipped=1.0 2023-06-26 07:37:24,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2172786.0, ans=0.07 2023-06-26 07:37:27,387 INFO [train.py:996] (3/4) Epoch 12, batch 26700, loss[loss=0.235, simple_loss=0.3063, pruned_loss=0.08181, over 21873.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2977, pruned_loss=0.07136, over 4251399.70 frames. ], batch size: 107, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:37:34,344 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.43 vs. limit=15.0 2023-06-26 07:37:42,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2172846.0, ans=0.1 2023-06-26 07:38:20,376 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=15.0 2023-06-26 07:38:41,263 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-26 07:39:00,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2173086.0, ans=0.125 2023-06-26 07:39:09,499 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:39:17,597 INFO [train.py:996] (3/4) Epoch 12, batch 26750, loss[loss=0.2825, simple_loss=0.359, pruned_loss=0.103, over 21811.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2984, pruned_loss=0.07059, over 4256196.41 frames. ], batch size: 124, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:40:39,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2173326.0, ans=0.0 2023-06-26 07:41:00,868 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.608e+02 7.962e+02 9.923e+02 1.450e+03 3.695e+03, threshold=1.985e+03, percent-clipped=8.0 2023-06-26 07:41:18,471 INFO [train.py:996] (3/4) Epoch 12, batch 26800, loss[loss=0.236, simple_loss=0.3094, pruned_loss=0.08136, over 21763.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.306, pruned_loss=0.07528, over 4269084.99 frames. ], batch size: 332, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:41:33,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2173446.0, ans=0.125 2023-06-26 07:41:42,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2173506.0, ans=0.0 2023-06-26 07:41:57,012 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.96 vs. limit=15.0 2023-06-26 07:42:07,960 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:42:21,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2173626.0, ans=0.125 2023-06-26 07:43:09,724 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=22.5 2023-06-26 07:43:10,544 INFO [train.py:996] (3/4) Epoch 12, batch 26850, loss[loss=0.1933, simple_loss=0.2601, pruned_loss=0.06319, over 21587.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.307, pruned_loss=0.07809, over 4267616.43 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:43:25,082 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.51 vs. limit=12.0 2023-06-26 07:44:06,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2173926.0, ans=0.0 2023-06-26 07:44:06,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2173926.0, ans=0.1 2023-06-26 07:44:32,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2173986.0, ans=0.1 2023-06-26 07:44:40,216 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.975e+02 7.931e+02 1.142e+03 1.522e+03 3.683e+03, threshold=2.283e+03, percent-clipped=9.0 2023-06-26 07:44:51,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2174046.0, ans=0.0 2023-06-26 07:44:53,079 INFO [train.py:996] (3/4) Epoch 12, batch 26900, loss[loss=0.201, simple_loss=0.2829, pruned_loss=0.05957, over 19998.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2989, pruned_loss=0.07734, over 4267629.87 frames. ], batch size: 702, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:45:01,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2174046.0, ans=0.0 2023-06-26 07:45:37,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2174166.0, ans=0.125 2023-06-26 07:46:08,550 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.48 vs. limit=10.0 2023-06-26 07:46:41,252 INFO [train.py:996] (3/4) Epoch 12, batch 26950, loss[loss=0.255, simple_loss=0.3322, pruned_loss=0.08889, over 21324.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2996, pruned_loss=0.07752, over 4268367.96 frames. ], batch size: 176, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:47:17,140 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-26 07:47:50,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2174586.0, ans=0.125 2023-06-26 07:48:17,121 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.465e+02 7.709e+02 1.449e+03 2.208e+03 6.166e+03, threshold=2.897e+03, percent-clipped=23.0 2023-06-26 07:48:17,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2174586.0, ans=0.125 2023-06-26 07:48:19,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2174586.0, ans=0.2 2023-06-26 07:48:22,398 INFO [train.py:996] (3/4) Epoch 12, batch 27000, loss[loss=0.2135, simple_loss=0.3256, pruned_loss=0.0507, over 20845.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2995, pruned_loss=0.07478, over 4252940.18 frames. ], batch size: 608, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:48:22,398 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 07:48:40,346 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.2401, simple_loss=0.3367, pruned_loss=0.07176, over 1796401.00 frames. 2023-06-26 07:48:40,347 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-26 07:48:49,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2174646.0, ans=0.2 2023-06-26 07:48:58,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2174706.0, ans=0.125 2023-06-26 07:50:06,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2174826.0, ans=0.125 2023-06-26 07:50:26,712 INFO [train.py:996] (3/4) Epoch 12, batch 27050, loss[loss=0.2069, simple_loss=0.3001, pruned_loss=0.05687, over 21643.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3025, pruned_loss=0.0724, over 4252978.05 frames. ], batch size: 247, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:50:47,615 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=22.5 2023-06-26 07:51:56,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2175186.0, ans=0.0 2023-06-26 07:52:06,700 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.688e+02 1.022e+03 1.429e+03 2.477e+03 5.037e+03, threshold=2.858e+03, percent-clipped=17.0 2023-06-26 07:52:12,146 INFO [train.py:996] (3/4) Epoch 12, batch 27100, loss[loss=0.2495, simple_loss=0.3396, pruned_loss=0.07972, over 21607.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3035, pruned_loss=0.0731, over 4267616.82 frames. ], batch size: 471, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:52:30,180 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:52:32,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=22.5 2023-06-26 07:52:37,497 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=22.5 2023-06-26 07:52:40,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2175306.0, ans=0.125 2023-06-26 07:52:45,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2175366.0, ans=0.0 2023-06-26 07:52:47,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2175366.0, ans=0.125 2023-06-26 07:53:11,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2175366.0, ans=0.125 2023-06-26 07:53:35,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2175426.0, ans=0.035 2023-06-26 07:53:41,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2175426.0, ans=0.0 2023-06-26 07:54:02,039 INFO [train.py:996] (3/4) Epoch 12, batch 27150, loss[loss=0.2602, simple_loss=0.3509, pruned_loss=0.08474, over 21829.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3169, pruned_loss=0.07701, over 4272954.77 frames. ], batch size: 316, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:54:02,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2175546.0, ans=0.0 2023-06-26 07:54:04,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2175546.0, ans=0.125 2023-06-26 07:54:05,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2175546.0, ans=0.125 2023-06-26 07:54:16,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2175546.0, ans=0.125 2023-06-26 07:54:17,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2175606.0, ans=0.0 2023-06-26 07:54:17,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2175606.0, ans=0.2 2023-06-26 07:54:30,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2175606.0, ans=0.125 2023-06-26 07:54:30,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2175606.0, ans=0.125 2023-06-26 07:55:22,012 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.55 vs. limit=15.0 2023-06-26 07:55:40,268 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.180e+02 9.529e+02 1.614e+03 2.368e+03 4.440e+03, threshold=3.227e+03, percent-clipped=12.0 2023-06-26 07:55:45,318 INFO [train.py:996] (3/4) Epoch 12, batch 27200, loss[loss=0.2185, simple_loss=0.2998, pruned_loss=0.06858, over 20053.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3245, pruned_loss=0.07943, over 4269165.60 frames. ], batch size: 703, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:55:55,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=2175846.0, ans=22.5 2023-06-26 07:56:12,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2175906.0, ans=0.125 2023-06-26 07:57:33,880 INFO [train.py:996] (3/4) Epoch 12, batch 27250, loss[loss=0.2609, simple_loss=0.3317, pruned_loss=0.09511, over 21403.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3253, pruned_loss=0.08246, over 4263753.88 frames. ], batch size: 549, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:57:34,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2176146.0, ans=0.1 2023-06-26 07:57:48,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2176146.0, ans=0.125 2023-06-26 07:59:13,059 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-26 07:59:24,078 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.268e+02 8.548e+02 1.157e+03 1.722e+03 4.031e+03, threshold=2.315e+03, percent-clipped=3.0 2023-06-26 07:59:32,586 INFO [train.py:996] (3/4) Epoch 12, batch 27300, loss[loss=0.2181, simple_loss=0.2813, pruned_loss=0.07746, over 20034.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3263, pruned_loss=0.0835, over 4265487.41 frames. ], batch size: 703, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:59:54,193 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=12.0 2023-06-26 08:00:28,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2176626.0, ans=0.0 2023-06-26 08:01:13,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2176686.0, ans=0.09899494936611666 2023-06-26 08:01:21,474 INFO [train.py:996] (3/4) Epoch 12, batch 27350, loss[loss=0.221, simple_loss=0.3047, pruned_loss=0.06862, over 21777.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3291, pruned_loss=0.08393, over 4269982.37 frames. ], batch size: 332, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:01:23,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2176746.0, ans=0.0 2023-06-26 08:01:37,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2176806.0, ans=0.0 2023-06-26 08:02:03,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2176866.0, ans=0.125 2023-06-26 08:02:14,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2176926.0, ans=0.125 2023-06-26 08:02:34,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2176986.0, ans=0.125 2023-06-26 08:02:59,806 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.844e+02 9.303e+02 1.156e+03 1.696e+03 4.537e+03, threshold=2.312e+03, percent-clipped=11.0 2023-06-26 08:03:08,438 INFO [train.py:996] (3/4) Epoch 12, batch 27400, loss[loss=0.2254, simple_loss=0.2796, pruned_loss=0.08564, over 21631.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3245, pruned_loss=0.08362, over 4277286.12 frames. ], batch size: 231, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:03:40,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2177106.0, ans=0.125 2023-06-26 08:04:31,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2177226.0, ans=0.125 2023-06-26 08:04:56,333 INFO [train.py:996] (3/4) Epoch 12, batch 27450, loss[loss=0.2442, simple_loss=0.3237, pruned_loss=0.08231, over 21438.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3201, pruned_loss=0.08243, over 4279809.46 frames. ], batch size: 194, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:05:03,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2177346.0, ans=0.1 2023-06-26 08:05:29,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2177466.0, ans=0.125 2023-06-26 08:05:59,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2177526.0, ans=0.125 2023-06-26 08:06:19,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2177586.0, ans=0.04949747468305833 2023-06-26 08:06:24,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2177586.0, ans=0.125 2023-06-26 08:06:33,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2177586.0, ans=0.2 2023-06-26 08:06:35,007 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.529e+02 8.908e+02 1.196e+03 1.781e+03 4.590e+03, threshold=2.391e+03, percent-clipped=13.0 2023-06-26 08:06:38,332 INFO [train.py:996] (3/4) Epoch 12, batch 27500, loss[loss=0.2349, simple_loss=0.3001, pruned_loss=0.08489, over 21509.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3181, pruned_loss=0.08336, over 4282806.88 frames. ], batch size: 194, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:06:57,763 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-06-26 08:07:04,676 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.18 vs. limit=12.0 2023-06-26 08:07:19,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2177766.0, ans=0.125 2023-06-26 08:08:03,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2177886.0, ans=0.1 2023-06-26 08:08:17,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2177886.0, ans=0.125 2023-06-26 08:08:27,986 INFO [train.py:996] (3/4) Epoch 12, batch 27550, loss[loss=0.2691, simple_loss=0.3257, pruned_loss=0.1062, over 21354.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3122, pruned_loss=0.08054, over 4282753.09 frames. ], batch size: 471, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:08:41,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2177946.0, ans=0.0 2023-06-26 08:08:55,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2178006.0, ans=0.0 2023-06-26 08:10:04,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2178186.0, ans=0.2 2023-06-26 08:10:12,146 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.726e+02 8.167e+02 1.305e+03 2.054e+03 4.575e+03, threshold=2.609e+03, percent-clipped=17.0 2023-06-26 08:10:12,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2178186.0, ans=0.125 2023-06-26 08:10:15,771 INFO [train.py:996] (3/4) Epoch 12, batch 27600, loss[loss=0.2278, simple_loss=0.2869, pruned_loss=0.0843, over 21796.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3052, pruned_loss=0.07951, over 4276684.21 frames. ], batch size: 112, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:10:33,829 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.84 vs. limit=12.0 2023-06-26 08:10:45,310 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.00 vs. limit=6.0 2023-06-26 08:11:06,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2178366.0, ans=0.125 2023-06-26 08:11:10,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2178426.0, ans=0.125 2023-06-26 08:11:14,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=2178426.0, ans=15.0 2023-06-26 08:11:15,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2178426.0, ans=0.0 2023-06-26 08:11:51,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2178486.0, ans=0.1 2023-06-26 08:12:00,885 INFO [train.py:996] (3/4) Epoch 12, batch 27650, loss[loss=0.2082, simple_loss=0.2723, pruned_loss=0.07202, over 21288.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3001, pruned_loss=0.07923, over 4278124.74 frames. ], batch size: 176, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:12:18,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2178606.0, ans=0.0 2023-06-26 08:12:25,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2178606.0, ans=0.125 2023-06-26 08:12:29,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2178606.0, ans=0.2 2023-06-26 08:13:43,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2178786.0, ans=0.125 2023-06-26 08:13:46,677 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.067e+02 8.445e+02 1.173e+03 2.476e+03 5.396e+03, threshold=2.346e+03, percent-clipped=23.0 2023-06-26 08:13:49,037 INFO [train.py:996] (3/4) Epoch 12, batch 27700, loss[loss=0.1912, simple_loss=0.2691, pruned_loss=0.05663, over 21847.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2997, pruned_loss=0.07696, over 4276916.68 frames. ], batch size: 124, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:14:00,097 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.71 vs. limit=15.0 2023-06-26 08:14:14,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2178906.0, ans=0.125 2023-06-26 08:14:25,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2178966.0, ans=0.2 2023-06-26 08:15:14,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2179026.0, ans=0.125 2023-06-26 08:15:17,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2179086.0, ans=0.125 2023-06-26 08:15:19,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=2179086.0, ans=0.2 2023-06-26 08:15:31,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2179086.0, ans=0.0 2023-06-26 08:15:33,961 INFO [train.py:996] (3/4) Epoch 12, batch 27750, loss[loss=0.2228, simple_loss=0.3184, pruned_loss=0.06354, over 21292.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3043, pruned_loss=0.07688, over 4275502.90 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:16:03,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2179206.0, ans=0.1 2023-06-26 08:17:17,680 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.559e+02 9.501e+02 1.486e+03 2.223e+03 3.680e+03, threshold=2.972e+03, percent-clipped=21.0 2023-06-26 08:17:18,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2179446.0, ans=0.0 2023-06-26 08:17:19,378 INFO [train.py:996] (3/4) Epoch 12, batch 27800, loss[loss=0.2412, simple_loss=0.3091, pruned_loss=0.08667, over 21642.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3038, pruned_loss=0.07766, over 4277450.43 frames. ], batch size: 471, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:17:31,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2179446.0, ans=0.0 2023-06-26 08:17:39,717 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:17:44,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2179506.0, ans=0.2 2023-06-26 08:17:47,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2179506.0, ans=0.125 2023-06-26 08:18:10,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2179566.0, ans=0.125 2023-06-26 08:18:18,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2179626.0, ans=0.125 2023-06-26 08:19:10,308 INFO [train.py:996] (3/4) Epoch 12, batch 27850, loss[loss=0.2591, simple_loss=0.3336, pruned_loss=0.09234, over 21591.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3026, pruned_loss=0.07822, over 4285905.82 frames. ], batch size: 471, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:19:31,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2179806.0, ans=0.0 2023-06-26 08:19:38,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2179806.0, ans=0.1 2023-06-26 08:20:36,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2179986.0, ans=0.2 2023-06-26 08:20:49,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2179986.0, ans=0.0 2023-06-26 08:20:53,365 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-26 08:20:53,731 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.551e+02 9.151e+02 1.351e+03 2.039e+03 4.854e+03, threshold=2.703e+03, percent-clipped=13.0 2023-06-26 08:20:55,762 INFO [train.py:996] (3/4) Epoch 12, batch 27900, loss[loss=0.2412, simple_loss=0.3409, pruned_loss=0.07079, over 21651.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3138, pruned_loss=0.07964, over 4291193.70 frames. ], batch size: 389, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:21:58,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2180166.0, ans=0.0 2023-06-26 08:22:13,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2180226.0, ans=0.125 2023-06-26 08:22:47,160 INFO [train.py:996] (3/4) Epoch 12, batch 27950, loss[loss=0.1977, simple_loss=0.2932, pruned_loss=0.05114, over 21838.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3128, pruned_loss=0.07573, over 4285703.47 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:22:55,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2180346.0, ans=0.2 2023-06-26 08:22:58,566 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=22.5 2023-06-26 08:24:12,887 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:24:16,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2180586.0, ans=0.125 2023-06-26 08:24:33,129 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.115e+02 7.857e+02 1.156e+03 1.726e+03 4.008e+03, threshold=2.312e+03, percent-clipped=7.0 2023-06-26 08:24:34,661 INFO [train.py:996] (3/4) Epoch 12, batch 28000, loss[loss=0.2113, simple_loss=0.3178, pruned_loss=0.05239, over 21286.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3097, pruned_loss=0.07297, over 4288965.18 frames. ], batch size: 549, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:25:39,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2180766.0, ans=0.035 2023-06-26 08:25:43,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2180766.0, ans=0.05 2023-06-26 08:25:58,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2180826.0, ans=0.0 2023-06-26 08:26:04,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2180886.0, ans=0.2 2023-06-26 08:26:22,250 INFO [train.py:996] (3/4) Epoch 12, batch 28050, loss[loss=0.1891, simple_loss=0.2525, pruned_loss=0.06287, over 21442.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3068, pruned_loss=0.07411, over 4285138.59 frames. ], batch size: 211, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:26:33,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-26 08:26:37,016 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.35 vs. limit=12.0 2023-06-26 08:27:26,617 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-26 08:28:11,562 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.128e+02 1.251e+03 1.568e+03 2.445e+03 4.428e+03, threshold=3.136e+03, percent-clipped=24.0 2023-06-26 08:28:13,342 INFO [train.py:996] (3/4) Epoch 12, batch 28100, loss[loss=0.2107, simple_loss=0.277, pruned_loss=0.07226, over 21579.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3042, pruned_loss=0.07406, over 4284260.83 frames. ], batch size: 414, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:30:00,918 INFO [train.py:996] (3/4) Epoch 12, batch 28150, loss[loss=0.2011, simple_loss=0.2597, pruned_loss=0.07128, over 21602.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2965, pruned_loss=0.07395, over 4277290.35 frames. ], batch size: 231, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:30:38,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2181606.0, ans=0.0 2023-06-26 08:31:00,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2181666.0, ans=0.5 2023-06-26 08:31:20,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2181726.0, ans=0.0 2023-06-26 08:31:50,384 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.207e+02 9.538e+02 1.303e+03 2.099e+03 3.954e+03, threshold=2.605e+03, percent-clipped=5.0 2023-06-26 08:31:52,075 INFO [train.py:996] (3/4) Epoch 12, batch 28200, loss[loss=0.2682, simple_loss=0.3359, pruned_loss=0.1002, over 21922.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2961, pruned_loss=0.07575, over 4266099.87 frames. ], batch size: 372, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:33:53,893 INFO [train.py:996] (3/4) Epoch 12, batch 28250, loss[loss=0.2139, simple_loss=0.2927, pruned_loss=0.06755, over 20665.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3, pruned_loss=0.07841, over 4260670.11 frames. ], batch size: 607, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:34:15,280 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2023-06-26 08:34:16,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2182206.0, ans=0.09899494936611666 2023-06-26 08:34:42,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2182266.0, ans=0.1 2023-06-26 08:35:01,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2182326.0, ans=0.0 2023-06-26 08:35:24,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2182386.0, ans=0.125 2023-06-26 08:35:42,655 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.099e+02 9.506e+02 1.674e+03 2.975e+03 5.994e+03, threshold=3.348e+03, percent-clipped=30.0 2023-06-26 08:35:44,308 INFO [train.py:996] (3/4) Epoch 12, batch 28300, loss[loss=0.1953, simple_loss=0.2522, pruned_loss=0.06916, over 20727.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2978, pruned_loss=0.07658, over 4255260.06 frames. ], batch size: 608, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 08:36:29,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=2182566.0, ans=15.0 2023-06-26 08:36:34,568 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.76 vs. limit=10.0 2023-06-26 08:37:41,265 INFO [train.py:996] (3/4) Epoch 12, batch 28350, loss[loss=0.2089, simple_loss=0.3032, pruned_loss=0.05735, over 21184.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2961, pruned_loss=0.07131, over 4260589.17 frames. ], batch size: 548, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 08:37:41,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2182746.0, ans=0.125 2023-06-26 08:38:03,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2182806.0, ans=0.125 2023-06-26 08:39:27,859 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.238e+02 9.118e+02 1.283e+03 2.054e+03 3.751e+03, threshold=2.567e+03, percent-clipped=3.0 2023-06-26 08:39:29,413 INFO [train.py:996] (3/4) Epoch 12, batch 28400, loss[loss=0.2074, simple_loss=0.282, pruned_loss=0.06644, over 21672.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2922, pruned_loss=0.07103, over 4258329.34 frames. ], batch size: 298, lr: 2.37e-03, grad_scale: 32.0 2023-06-26 08:39:37,753 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.61 vs. limit=10.0 2023-06-26 08:41:19,684 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-26 08:41:22,183 INFO [train.py:996] (3/4) Epoch 12, batch 28450, loss[loss=0.2773, simple_loss=0.3335, pruned_loss=0.1106, over 21660.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2994, pruned_loss=0.07608, over 4261936.22 frames. ], batch size: 507, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:42:49,026 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.40 vs. limit=10.0 2023-06-26 08:43:10,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2183646.0, ans=0.125 2023-06-26 08:43:12,211 INFO [train.py:996] (3/4) Epoch 12, batch 28500, loss[loss=0.2396, simple_loss=0.3151, pruned_loss=0.08209, over 21789.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3019, pruned_loss=0.07811, over 4272857.16 frames. ], batch size: 351, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:43:13,802 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.783e+02 7.825e+02 1.127e+03 1.628e+03 3.596e+03, threshold=2.254e+03, percent-clipped=4.0 2023-06-26 08:43:21,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2183646.0, ans=0.125 2023-06-26 08:43:24,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2183646.0, ans=0.125 2023-06-26 08:43:25,200 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=22.5 2023-06-26 08:43:31,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2183706.0, ans=0.2 2023-06-26 08:43:31,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2183706.0, ans=0.125 2023-06-26 08:43:36,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2183706.0, ans=0.0 2023-06-26 08:43:38,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2183706.0, ans=0.1 2023-06-26 08:43:54,692 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=12.0 2023-06-26 08:44:38,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2183886.0, ans=0.125 2023-06-26 08:44:55,183 INFO [train.py:996] (3/4) Epoch 12, batch 28550, loss[loss=0.297, simple_loss=0.3904, pruned_loss=0.1018, over 21635.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3091, pruned_loss=0.08013, over 4267604.99 frames. ], batch size: 414, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:45:04,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2183946.0, ans=0.125 2023-06-26 08:45:07,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2183946.0, ans=0.2 2023-06-26 08:45:08,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2183946.0, ans=0.0 2023-06-26 08:45:46,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2184066.0, ans=0.0 2023-06-26 08:46:18,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2184126.0, ans=0.125 2023-06-26 08:46:45,251 INFO [train.py:996] (3/4) Epoch 12, batch 28600, loss[loss=0.2196, simple_loss=0.2985, pruned_loss=0.07037, over 21793.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3161, pruned_loss=0.08261, over 4266806.60 frames. ], batch size: 352, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:46:47,097 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.084e+02 9.095e+02 1.267e+03 2.017e+03 3.986e+03, threshold=2.534e+03, percent-clipped=14.0 2023-06-26 08:46:56,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2184246.0, ans=0.125 2023-06-26 08:46:58,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2184246.0, ans=0.1 2023-06-26 08:47:57,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2184366.0, ans=0.1 2023-06-26 08:48:05,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2184426.0, ans=0.2 2023-06-26 08:48:39,271 INFO [train.py:996] (3/4) Epoch 12, batch 28650, loss[loss=0.207, simple_loss=0.2701, pruned_loss=0.07193, over 21583.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3109, pruned_loss=0.08202, over 4266273.92 frames. ], batch size: 415, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:48:39,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2184546.0, ans=0.125 2023-06-26 08:48:54,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2184546.0, ans=0.125 2023-06-26 08:49:04,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2184606.0, ans=0.1 2023-06-26 08:49:31,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2184666.0, ans=0.0 2023-06-26 08:49:57,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2184726.0, ans=0.07 2023-06-26 08:50:00,538 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:50:36,095 INFO [train.py:996] (3/4) Epoch 12, batch 28700, loss[loss=0.2486, simple_loss=0.3149, pruned_loss=0.09113, over 21752.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3091, pruned_loss=0.08266, over 4263117.13 frames. ], batch size: 441, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:50:37,762 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.799e+02 8.894e+02 1.316e+03 2.106e+03 4.413e+03, threshold=2.633e+03, percent-clipped=13.0 2023-06-26 08:51:31,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2184966.0, ans=0.125 2023-06-26 08:51:37,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2185026.0, ans=0.125 2023-06-26 08:51:58,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2185086.0, ans=0.1 2023-06-26 08:52:24,662 INFO [train.py:996] (3/4) Epoch 12, batch 28750, loss[loss=0.2226, simple_loss=0.2858, pruned_loss=0.07966, over 21361.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3104, pruned_loss=0.08335, over 4273083.50 frames. ], batch size: 176, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:52:44,913 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.68 vs. limit=10.0 2023-06-26 08:53:04,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2185206.0, ans=0.1 2023-06-26 08:53:37,780 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-26 08:54:04,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2185386.0, ans=0.1 2023-06-26 08:54:13,700 INFO [train.py:996] (3/4) Epoch 12, batch 28800, loss[loss=0.3021, simple_loss=0.3652, pruned_loss=0.1195, over 21800.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3128, pruned_loss=0.08269, over 4266980.00 frames. ], batch size: 441, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 08:54:15,162 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.541e+02 8.442e+02 1.126e+03 1.600e+03 2.728e+03, threshold=2.251e+03, percent-clipped=1.0 2023-06-26 08:54:50,378 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-06-26 08:54:58,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2185566.0, ans=0.125 2023-06-26 08:55:03,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2185566.0, ans=0.125 2023-06-26 08:55:07,977 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.02 vs. limit=15.0 2023-06-26 08:55:14,542 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-26 08:55:27,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2185626.0, ans=0.04949747468305833 2023-06-26 08:55:50,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2185686.0, ans=0.07 2023-06-26 08:56:00,418 INFO [train.py:996] (3/4) Epoch 12, batch 28850, loss[loss=0.2265, simple_loss=0.3032, pruned_loss=0.07493, over 21672.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3148, pruned_loss=0.08524, over 4276293.22 frames. ], batch size: 389, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:56:37,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2185806.0, ans=0.1 2023-06-26 08:56:55,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2185866.0, ans=0.125 2023-06-26 08:57:06,077 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.31 vs. limit=10.0 2023-06-26 08:57:26,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2185926.0, ans=0.125 2023-06-26 08:57:57,012 INFO [train.py:996] (3/4) Epoch 12, batch 28900, loss[loss=0.2622, simple_loss=0.3394, pruned_loss=0.09245, over 21790.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3179, pruned_loss=0.08665, over 4283830.38 frames. ], batch size: 118, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:58:06,904 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.674e+02 9.203e+02 1.221e+03 1.627e+03 4.762e+03, threshold=2.441e+03, percent-clipped=10.0 2023-06-26 08:58:58,962 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.26 vs. limit=15.0 2023-06-26 08:59:49,848 INFO [train.py:996] (3/4) Epoch 12, batch 28950, loss[loss=0.2435, simple_loss=0.3376, pruned_loss=0.07473, over 21252.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3187, pruned_loss=0.08591, over 4277345.88 frames. ], batch size: 548, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:59:51,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2186346.0, ans=0.125 2023-06-26 09:00:03,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2186346.0, ans=0.0 2023-06-26 09:00:04,305 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.33 vs. limit=22.5 2023-06-26 09:00:27,366 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-26 09:00:54,664 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=15.0 2023-06-26 09:01:30,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2186586.0, ans=0.125 2023-06-26 09:01:34,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2186586.0, ans=0.2 2023-06-26 09:01:39,448 INFO [train.py:996] (3/4) Epoch 12, batch 29000, loss[loss=0.2605, simple_loss=0.3342, pruned_loss=0.09343, over 21784.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3211, pruned_loss=0.08475, over 4273843.22 frames. ], batch size: 441, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:01:42,711 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.937e+02 9.875e+02 1.439e+03 2.075e+03 4.824e+03, threshold=2.879e+03, percent-clipped=20.0 2023-06-26 09:03:23,771 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.79 vs. limit=15.0 2023-06-26 09:03:27,696 INFO [train.py:996] (3/4) Epoch 12, batch 29050, loss[loss=0.2344, simple_loss=0.2995, pruned_loss=0.08468, over 21870.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3187, pruned_loss=0.08496, over 4281403.68 frames. ], batch size: 298, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:03:33,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.73 vs. limit=8.0 2023-06-26 09:03:41,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2186946.0, ans=0.04949747468305833 2023-06-26 09:04:41,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2187126.0, ans=0.125 2023-06-26 09:05:05,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2187186.0, ans=0.09899494936611666 2023-06-26 09:05:05,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2187186.0, ans=0.2 2023-06-26 09:05:17,240 INFO [train.py:996] (3/4) Epoch 12, batch 29100, loss[loss=0.1884, simple_loss=0.2514, pruned_loss=0.06275, over 21516.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3105, pruned_loss=0.08324, over 4278810.22 frames. ], batch size: 230, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:05:20,342 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.010e+02 8.838e+02 1.250e+03 1.736e+03 3.044e+03, threshold=2.501e+03, percent-clipped=1.0 2023-06-26 09:05:42,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2187306.0, ans=0.125 2023-06-26 09:06:00,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-26 09:06:54,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2187486.0, ans=0.2 2023-06-26 09:07:04,348 INFO [train.py:996] (3/4) Epoch 12, batch 29150, loss[loss=0.2536, simple_loss=0.3452, pruned_loss=0.08096, over 21820.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3088, pruned_loss=0.08122, over 4283471.69 frames. ], batch size: 316, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:07:13,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2187546.0, ans=0.125 2023-06-26 09:07:13,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2187546.0, ans=0.09899494936611666 2023-06-26 09:07:39,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2187606.0, ans=0.125 2023-06-26 09:08:14,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2187666.0, ans=0.125 2023-06-26 09:08:51,314 INFO [train.py:996] (3/4) Epoch 12, batch 29200, loss[loss=0.213, simple_loss=0.2769, pruned_loss=0.07448, over 15149.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3037, pruned_loss=0.07962, over 4281903.70 frames. ], batch size: 60, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:08:54,710 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.344e+02 9.543e+02 1.197e+03 1.931e+03 4.156e+03, threshold=2.395e+03, percent-clipped=13.0 2023-06-26 09:09:25,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2187906.0, ans=0.125 2023-06-26 09:09:39,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2187966.0, ans=0.1 2023-06-26 09:09:48,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2187966.0, ans=0.125 2023-06-26 09:10:20,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-26 09:10:36,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2188086.0, ans=0.2 2023-06-26 09:10:39,254 INFO [train.py:996] (3/4) Epoch 12, batch 29250, loss[loss=0.1993, simple_loss=0.2596, pruned_loss=0.06945, over 20265.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3033, pruned_loss=0.07789, over 4281636.64 frames. ], batch size: 703, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:11:04,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2188206.0, ans=0.0 2023-06-26 09:11:56,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2188326.0, ans=0.125 2023-06-26 09:12:24,369 INFO [train.py:996] (3/4) Epoch 12, batch 29300, loss[loss=0.1885, simple_loss=0.2693, pruned_loss=0.05385, over 20712.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3045, pruned_loss=0.07666, over 4287336.66 frames. ], batch size: 607, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:12:27,557 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.498e+02 8.933e+02 1.350e+03 1.847e+03 4.429e+03, threshold=2.700e+03, percent-clipped=13.0 2023-06-26 09:12:54,216 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-06-26 09:13:02,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2188506.0, ans=0.1 2023-06-26 09:14:12,327 INFO [train.py:996] (3/4) Epoch 12, batch 29350, loss[loss=0.2275, simple_loss=0.3218, pruned_loss=0.06658, over 21651.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3, pruned_loss=0.07573, over 4279305.86 frames. ], batch size: 414, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:14:13,285 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.10 vs. limit=22.5 2023-06-26 09:14:29,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2188746.0, ans=0.2 2023-06-26 09:14:38,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2188806.0, ans=0.125 2023-06-26 09:15:17,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2188866.0, ans=0.125 2023-06-26 09:15:21,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2188866.0, ans=0.125 2023-06-26 09:15:32,787 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.77 vs. limit=15.0 2023-06-26 09:16:14,802 INFO [train.py:996] (3/4) Epoch 12, batch 29400, loss[loss=0.192, simple_loss=0.2712, pruned_loss=0.05641, over 21737.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3009, pruned_loss=0.07402, over 4273215.25 frames. ], batch size: 332, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:16:18,023 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.568e+02 9.109e+02 1.324e+03 1.907e+03 3.871e+03, threshold=2.647e+03, percent-clipped=5.0 2023-06-26 09:17:36,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2189286.0, ans=0.125 2023-06-26 09:18:06,989 INFO [train.py:996] (3/4) Epoch 12, batch 29450, loss[loss=0.2249, simple_loss=0.3005, pruned_loss=0.07465, over 21629.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.298, pruned_loss=0.07303, over 4265841.08 frames. ], batch size: 263, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:18:48,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2189406.0, ans=0.2 2023-06-26 09:18:53,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2189466.0, ans=0.125 2023-06-26 09:19:18,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2189526.0, ans=0.0 2023-06-26 09:19:20,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2189526.0, ans=0.125 2023-06-26 09:19:43,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2189586.0, ans=0.2 2023-06-26 09:19:55,239 INFO [train.py:996] (3/4) Epoch 12, batch 29500, loss[loss=0.2269, simple_loss=0.3002, pruned_loss=0.07682, over 21294.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3023, pruned_loss=0.07606, over 4272284.44 frames. ], batch size: 176, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:20:06,870 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.147e+02 8.502e+02 1.287e+03 1.918e+03 4.991e+03, threshold=2.573e+03, percent-clipped=9.0 2023-06-26 09:20:24,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2189706.0, ans=0.125 2023-06-26 09:20:36,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=2189706.0, ans=0.95 2023-06-26 09:20:48,949 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-26 09:20:54,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2189766.0, ans=0.1 2023-06-26 09:21:00,586 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-26 09:21:04,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2189826.0, ans=0.0 2023-06-26 09:21:18,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2189886.0, ans=0.1 2023-06-26 09:21:45,245 INFO [train.py:996] (3/4) Epoch 12, batch 29550, loss[loss=0.2535, simple_loss=0.3574, pruned_loss=0.07485, over 19853.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3035, pruned_loss=0.07787, over 4279093.70 frames. ], batch size: 703, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:22:14,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2190006.0, ans=0.125 2023-06-26 09:22:47,021 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-26 09:23:49,832 INFO [train.py:996] (3/4) Epoch 12, batch 29600, loss[loss=0.3393, simple_loss=0.4188, pruned_loss=0.1299, over 21510.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.311, pruned_loss=0.08081, over 4286980.33 frames. ], batch size: 471, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:23:55,071 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.495e+02 9.952e+02 1.397e+03 2.179e+03 6.553e+03, threshold=2.795e+03, percent-clipped=14.0 2023-06-26 09:24:02,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-26 09:24:24,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2190366.0, ans=0.1 2023-06-26 09:25:06,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2190486.0, ans=0.0 2023-06-26 09:25:12,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2190486.0, ans=0.125 2023-06-26 09:25:35,912 INFO [train.py:996] (3/4) Epoch 12, batch 29650, loss[loss=0.2057, simple_loss=0.2713, pruned_loss=0.07006, over 21644.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3076, pruned_loss=0.07689, over 4284197.08 frames. ], batch size: 230, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:25:48,590 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=12.0 2023-06-26 09:25:51,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2190606.0, ans=0.125 2023-06-26 09:26:48,532 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=22.5 2023-06-26 09:27:04,771 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.63 vs. limit=15.0 2023-06-26 09:27:22,320 INFO [train.py:996] (3/4) Epoch 12, batch 29700, loss[loss=0.2522, simple_loss=0.3223, pruned_loss=0.09103, over 15856.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3093, pruned_loss=0.07749, over 4285910.08 frames. ], batch size: 60, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:27:27,209 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.270e+02 1.083e+03 1.930e+03 2.745e+03 6.322e+03, threshold=3.861e+03, percent-clipped=21.0 2023-06-26 09:27:41,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2190906.0, ans=0.125 2023-06-26 09:28:34,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2191026.0, ans=0.2 2023-06-26 09:28:36,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=2191026.0, ans=0.025 2023-06-26 09:28:39,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2191086.0, ans=0.09899494936611666 2023-06-26 09:29:06,899 INFO [train.py:996] (3/4) Epoch 12, batch 29750, loss[loss=0.2491, simple_loss=0.3369, pruned_loss=0.08065, over 21754.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3128, pruned_loss=0.07658, over 4273321.69 frames. ], batch size: 247, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:30:13,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2191326.0, ans=0.125 2023-06-26 09:30:53,142 INFO [train.py:996] (3/4) Epoch 12, batch 29800, loss[loss=0.22, simple_loss=0.2942, pruned_loss=0.07286, over 21501.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3139, pruned_loss=0.07767, over 4278039.58 frames. ], batch size: 212, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:30:55,579 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-26 09:30:58,546 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.420e+02 9.303e+02 1.235e+03 1.702e+03 3.129e+03, threshold=2.469e+03, percent-clipped=0.0 2023-06-26 09:31:20,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2191506.0, ans=0.125 2023-06-26 09:31:45,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2191566.0, ans=0.125 2023-06-26 09:31:55,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=2191626.0, ans=15.0 2023-06-26 09:32:29,205 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=15.0 2023-06-26 09:32:31,023 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.36 vs. limit=22.5 2023-06-26 09:32:37,761 INFO [train.py:996] (3/4) Epoch 12, batch 29850, loss[loss=0.2114, simple_loss=0.2804, pruned_loss=0.07121, over 21792.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3079, pruned_loss=0.07514, over 4272425.62 frames. ], batch size: 112, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:32:40,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2191746.0, ans=0.07 2023-06-26 09:32:51,784 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=12.0 2023-06-26 09:34:29,917 INFO [train.py:996] (3/4) Epoch 12, batch 29900, loss[loss=0.2493, simple_loss=0.3085, pruned_loss=0.09506, over 21496.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3082, pruned_loss=0.07659, over 4274322.29 frames. ], batch size: 211, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:34:35,011 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.187e+02 8.891e+02 1.162e+03 1.804e+03 4.059e+03, threshold=2.324e+03, percent-clipped=12.0 2023-06-26 09:34:55,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2192106.0, ans=0.1 2023-06-26 09:35:15,020 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.05 vs. limit=12.0 2023-06-26 09:36:03,233 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-06-26 09:36:08,646 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.86 vs. limit=10.0 2023-06-26 09:36:19,837 INFO [train.py:996] (3/4) Epoch 12, batch 29950, loss[loss=0.2798, simple_loss=0.3449, pruned_loss=0.1074, over 21675.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3122, pruned_loss=0.08087, over 4275487.65 frames. ], batch size: 351, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:36:37,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2192406.0, ans=0.125 2023-06-26 09:36:56,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2192406.0, ans=0.125 2023-06-26 09:37:37,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2192526.0, ans=0.0 2023-06-26 09:37:39,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2192526.0, ans=0.1 2023-06-26 09:38:05,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2192586.0, ans=0.1 2023-06-26 09:38:08,261 INFO [train.py:996] (3/4) Epoch 12, batch 30000, loss[loss=0.2002, simple_loss=0.3177, pruned_loss=0.04138, over 20766.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3148, pruned_loss=0.08115, over 4275218.25 frames. ], batch size: 608, lr: 2.37e-03, grad_scale: 32.0 2023-06-26 09:38:08,261 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 09:38:26,470 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.2465, simple_loss=0.3441, pruned_loss=0.07444, over 1796401.00 frames. 2023-06-26 09:38:26,471 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 24550MB 2023-06-26 09:38:34,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2192646.0, ans=0.0 2023-06-26 09:38:37,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.580e+02 9.008e+02 1.239e+03 1.696e+03 3.151e+03, threshold=2.479e+03, percent-clipped=8.0 2023-06-26 09:38:40,180 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=22.5 2023-06-26 09:39:10,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=2192706.0, ans=15.0 2023-06-26 09:39:34,519 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.56 vs. limit=15.0 2023-06-26 09:39:58,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2192826.0, ans=0.125 2023-06-26 09:40:17,400 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=22.5 2023-06-26 09:40:32,782 INFO [train.py:996] (3/4) Epoch 12, batch 30050, loss[loss=0.2397, simple_loss=0.3775, pruned_loss=0.05094, over 20713.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3195, pruned_loss=0.07829, over 4272873.80 frames. ], batch size: 607, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:40:56,377 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.02 vs. limit=15.0 2023-06-26 09:41:08,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2193006.0, ans=0.0 2023-06-26 09:41:35,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2193066.0, ans=0.0 2023-06-26 09:42:30,766 INFO [train.py:996] (3/4) Epoch 12, batch 30100, loss[loss=0.202, simple_loss=0.2654, pruned_loss=0.06933, over 21473.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.317, pruned_loss=0.07772, over 4268554.35 frames. ], batch size: 195, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:42:42,342 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.763e+02 1.437e+03 2.326e+03 3.330e+03 7.267e+03, threshold=4.652e+03, percent-clipped=46.0 2023-06-26 09:42:48,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2193246.0, ans=0.0 2023-06-26 09:43:02,385 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=15.0 2023-06-26 09:43:23,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2193366.0, ans=0.0 2023-06-26 09:44:04,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2193486.0, ans=0.0 2023-06-26 09:44:19,952 INFO [train.py:996] (3/4) Epoch 12, batch 30150, loss[loss=0.2639, simple_loss=0.3337, pruned_loss=0.09711, over 21131.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3123, pruned_loss=0.07877, over 4266404.61 frames. ], batch size: 143, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:45:35,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2193726.0, ans=0.015 2023-06-26 09:45:55,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=22.5 2023-06-26 09:46:06,634 INFO [train.py:996] (3/4) Epoch 12, batch 30200, loss[loss=0.2336, simple_loss=0.3584, pruned_loss=0.0544, over 21190.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3153, pruned_loss=0.0782, over 4260896.88 frames. ], batch size: 549, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:46:14,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2193846.0, ans=0.09899494936611666 2023-06-26 09:46:15,734 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.538e+02 9.559e+02 1.334e+03 1.956e+03 4.030e+03, threshold=2.668e+03, percent-clipped=0.0 2023-06-26 09:46:26,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2193906.0, ans=0.2 2023-06-26 09:47:06,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2193966.0, ans=0.125 2023-06-26 09:47:25,719 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=15.0 2023-06-26 09:47:51,627 INFO [train.py:996] (3/4) Epoch 12, batch 30250, loss[loss=0.3994, simple_loss=0.4705, pruned_loss=0.1641, over 21472.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3229, pruned_loss=0.08113, over 4262688.05 frames. ], batch size: 507, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:49:07,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-06-26 09:49:31,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2194386.0, ans=0.125 2023-06-26 09:49:39,816 INFO [train.py:996] (3/4) Epoch 12, batch 30300, loss[loss=0.2338, simple_loss=0.2877, pruned_loss=0.08995, over 21863.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3206, pruned_loss=0.08161, over 4257762.17 frames. ], batch size: 107, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:49:48,241 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.277e+02 8.599e+02 1.166e+03 1.867e+03 3.678e+03, threshold=2.332e+03, percent-clipped=6.0 2023-06-26 09:50:40,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2194566.0, ans=0.125 2023-06-26 09:50:46,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2194566.0, ans=0.035 2023-06-26 09:51:09,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2194626.0, ans=0.0 2023-06-26 09:51:34,358 INFO [train.py:996] (3/4) Epoch 12, batch 30350, loss[loss=0.2991, simple_loss=0.3968, pruned_loss=0.1007, over 21685.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3207, pruned_loss=0.08261, over 4257079.54 frames. ], batch size: 389, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:51:35,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=15.0 2023-06-26 09:51:44,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2194746.0, ans=0.125 2023-06-26 09:51:56,910 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:51:57,722 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.67 vs. limit=10.0 2023-06-26 09:52:57,400 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-26 09:52:59,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2194986.0, ans=0.125 2023-06-26 09:53:10,572 INFO [train.py:996] (3/4) Epoch 12, batch 30400, loss[loss=0.2093, simple_loss=0.2697, pruned_loss=0.07443, over 20232.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3171, pruned_loss=0.08179, over 4249863.39 frames. ], batch size: 703, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:53:18,328 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.851e+02 1.020e+03 1.381e+03 2.107e+03 4.576e+03, threshold=2.761e+03, percent-clipped=19.0 2023-06-26 09:53:34,791 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.17 vs. limit=12.0 2023-06-26 09:53:34,846 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.83 vs. limit=15.0 2023-06-26 09:54:28,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2195286.0, ans=0.125 2023-06-26 09:54:39,875 INFO [train.py:996] (3/4) Epoch 12, batch 30450, loss[loss=0.2648, simple_loss=0.3894, pruned_loss=0.07005, over 19848.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3181, pruned_loss=0.08049, over 4192921.89 frames. ], batch size: 702, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:54:51,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2195346.0, ans=0.125 2023-06-26 09:55:22,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2195466.0, ans=0.1 2023-06-26 09:55:32,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2195526.0, ans=0.125 2023-06-26 09:55:33,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2195526.0, ans=0.125 2023-06-26 09:55:52,819 INFO [train.py:1249] (3/4) Done!